• Title/Summary/Keyword: Word Extraction

Search Result 231, Processing Time 0.028 seconds

Multi-cue Integration for Automatic Annotation (자동 주석을 위한 멀티 큐 통합)

  • Shin, Seong-Yoon;Rhee, Yang-Won
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2010.07a
    • /
    • pp.151-152
    • /
    • 2010
  • WWW images locate in structural, networking documents, so the importance of a word can be indicated by its location, frequency. There are two patterns for multi-cues ingegration annotation. The multi-cues integration algorithm shows initial promise as an indicator of semantic keyphrases of the web images. The latent semantic automatic keyphrase extraction that causes the improvement with the usage of multi-cues is expected to be preferable.

  • PDF

A Study on Word Semantic Categories for Natural Language Question Type Classification and Answer Extraction (자연어 질의 유형판별과 응답 추출을 위한 어휘 의미체계에 관한 연구)

  • Yoon Sung-Hee
    • Proceedings of the KAIS Fall Conference
    • /
    • 2004.11a
    • /
    • pp.141-144
    • /
    • 2004
  • 질의응답 시스템이 정보검색 시스템과 다른 중요한 점은 질의 처리 과정이며, 자연어 질의 문장에서 사용자의 질의 의도를 파악하여 질의 유형을 분류하는 것이다. 본 논문에서는 질의 주-형을 분류하기 위해 복잡한 분류 규칙이나 대용량의 사전 정보를 이용하지 않고 질의 문장에서 의문사에 해당하는 어휘들을 추출하고 주변에 나타나는 명사들의 의미 정보를 이용하여 세부적인 정답 유형을 결정할 수 있는 질의 유형 분류 방법을 제안한다. 의문사가 생략된 경우의 처리 방법과 동의어 정보와 접미사 정보를 이용하여 질의 유형 분류 성능을 향상시킬 수 있는 방법을 제안한다.

  • PDF

Extraction of the Latent Index Terms Using the Word Frequency and Part of Speech in Automatic Indexing (자동색인에서 단어의 품사와 빈도를 이용한 색인후보어 발췌)

  • 이태영;남궁황
    • Proceedings of the Korean Society for Information Management Conference
    • /
    • 2001.08a
    • /
    • pp.181-184
    • /
    • 2001
  • 본 논문에서는 적합한 색인어를 자동으로 추출해 내기 위해 잘 알려진 통계적 기법과 구문분석적 기법을 혼용하였다. 적용결과를 검색효율로 나타내지 않고 각 방법에 따라 추출된 단어들을 실증적으로 보여주어 성능에 대한 판단을 유도하였다. 빈도나 품사가 단독으로 사용된 것보다 동시에 적용된 것이 보다 좋은 결과를 가져왔다.

  • PDF

A Study on the Deduction of Social Issues Applying Word Embedding: With an Empasis on News Articles related to the Disables (단어 임베딩(Word Embedding) 기법을 적용한 키워드 중심의 사회적 이슈 도출 연구: 장애인 관련 뉴스 기사를 중심으로)

  • Choi, Garam;Choi, Sung-Pil
    • Journal of the Korean Society for information Management
    • /
    • v.35 no.1
    • /
    • pp.231-250
    • /
    • 2018
  • In this paper, we propose a new methodology for extracting and formalizing subjective topics at a specific time using a set of keywords extracted automatically from online news articles. To do this, we first extracted a set of keywords by applying TF-IDF methods selected by a series of comparative experiments on various statistical weighting schemes that can measure the importance of individual words in a large set of texts. In order to effectively calculate the semantic relation between extracted keywords, a set of word embedding vectors was constructed by using about 1,000,000 news articles collected separately. Individual keywords extracted were quantified in the form of numerical vectors and clustered by K-means algorithm. As a result of qualitative in-depth analysis of each keyword cluster finally obtained, we witnessed that most of the clusters were evaluated as appropriate topics with sufficient semantic concentration for us to easily assign labels to them.

Keyword Spotting on Hangul Document Images Using Character Feature Models (문자 별 특징 모델을 이용한 한글 문서 영상에서 키워드 검색)

  • Park, Sang-Cheol;Kim, Soo-Hyung;Choi, Deok-Jai
    • The KIPS Transactions:PartB
    • /
    • v.12B no.5 s.101
    • /
    • pp.521-526
    • /
    • 2005
  • In this Paper, we propose a keyword spotting system as an alternative to searching system for poor quality Korean document images and compare the Proposed system with an OCR-based document retrieval system. The system is composed of character segmentation, feature extraction for the query keyword, and word-to-word matching. In the character segmentation step, we propose an effective method to remove the connectivity between adjacent characters and a character segmentation method by making the variance of character widths minimum. In the query creation step, feature vector for the query is constructed by a combination of a character model by typeface. In the matching step, word-to-word matching is applied base on a character-to-character matching. We demonstrated that the proposed keyword spotting system is more efficient than the OCR-based one to search a keyword on the Korean document images, especially when the quality of documents is quite poor and point size is small.

An Artificial Intelligence Approach for Word Semantic Similarity Measure of Hindi Language

  • Younas, Farah;Nadir, Jumana;Usman, Muhammad;Khan, Muhammad Attique;Khan, Sajid Ali;Kadry, Seifedine;Nam, Yunyoung
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.15 no.6
    • /
    • pp.2049-2068
    • /
    • 2021
  • AI combined with NLP techniques has promoted the use of Virtual Assistants and have made people rely on them for many diverse uses. Conversational Agents are the most promising technique that assists computer users through their operation. An important challenge in developing Conversational Agents globally is transferring the groundbreaking expertise obtained in English to other languages. AI is making it possible to transfer this learning. There is a dire need to develop systems that understand secular languages. One such difficult language is Hindi, which is the fourth most spoken language in the world. Semantic similarity is an important part of Natural Language Processing, which involves applications such as ontology learning and information extraction, for developing conversational agents. Most of the research is concentrated on English and other European languages. This paper presents a Corpus-based word semantic similarity measure for Hindi. An experiment involving the translation of the English benchmark dataset to Hindi is performed, investigating the incorporation of the corpus, with human and machine similarity ratings. A significant correlation to the human intuition and the algorithm ratings has been calculated for analyzing the accuracy of the proposed similarity measures. The method can be adapted in various applications of word semantic similarity or module for any other language.

Heuristic-based Korean Coreference Resolution for Information Extraction

  • Euisok Chung;Soojong Lim;Yun, Bo-Hyun
    • Proceedings of the Korean Society for Language and Information Conference
    • /
    • 2002.02a
    • /
    • pp.50-58
    • /
    • 2002
  • The information extraction is to delimit in advance, as part of the specification of the task, the semantic range of the output and to filter information from large volumes of texts. The most representative word of the document is composed of named entities and pronouns. Therefore, it is important to resolve coreference in order to extract the meaningful information in information extraction. Coreference resolution is to find name entities co-referencing real-world entities in the documents. Results of coreference resolution are used for name entity detection and template generation. This paper presents the heuristic-based approach for coreference resolution in Korean. We constructed the heuristics expanded gradually by using the corpus and derived the salience factors of antecedents as the importance measure in Korean. Our approach consists of antecedents selection and antecedents weighting. We used three kinds of salience factors that are used to weight each antecedent of the anaphor. The experiment result shows 80% precision.

  • PDF

Cross-Domain Text Sentiment Classification Method Based on the CNN-BiLSTM-TE Model

  • Zeng, Yuyang;Zhang, Ruirui;Yang, Liang;Song, Sujuan
    • Journal of Information Processing Systems
    • /
    • v.17 no.4
    • /
    • pp.818-833
    • /
    • 2021
  • To address the problems of low precision rate, insufficient feature extraction, and poor contextual ability in existing text sentiment analysis methods, a mixed model account of a CNN-BiLSTM-TE (convolutional neural network, bidirectional long short-term memory, and topic extraction) model was proposed. First, Chinese text data was converted into vectors through the method of transfer learning by Word2Vec. Second, local features were extracted by the CNN model. Then, contextual information was extracted by the BiLSTM neural network and the emotional tendency was obtained using softmax. Finally, topics were extracted by the term frequency-inverse document frequency and K-means. Compared with the CNN, BiLSTM, and gate recurrent unit (GRU) models, the CNN-BiLSTM-TE model's F1-score was higher than other models by 0.0147, 0.006, and 0.0052, respectively. Then compared with CNN-LSTM, LSTM-CNN, and BiLSTM-CNN models, the F1-score was higher by 0.0071, 0.0038, and 0.0049, respectively. Experimental results showed that the CNN-BiLSTM-TE model can effectively improve various indicators in application. Lastly, performed scalability verification through a takeaway dataset, which has great value in practical applications.

Rule Based Document Conversion and Information Extraction on the Word Document (워드문서 콘텐츠의 사용자 XML 콘텐츠로의 변환 및 저장 시스템 개발)

  • Joo, Won-Kyun;Yang, Myung-Seok;Kim, Tae-Hyun;Lee, Min-Ho;Choi, Ki-Seok
    • Proceedings of the Korea Contents Association Conference
    • /
    • 2006.11a
    • /
    • pp.555-559
    • /
    • 2006
  • This paper will intend to contribute to extracting and storing various form of information on user interests by using structural rules user makes and XML-based word document converting techniques. The system named PPE consists of three essential element. One is converting element which converts word documents like HWP, DOC into XML documents, another is extracting element to prepare structural rules and extract concerned information from XML document by structural rules, and the other is storing element to make final XML document or store it into database system. For word document converting, we developed OCX based word converting daemon. Helping user to extracting information, we developed script language having native function/variable processing engine extended from XSLT. This system can be used in the area of constructing word document contents DB or providing various information service based on RAW word documents. We really applied it to project management system and project result management system.

  • PDF

Text extraction from camera based document image (카메라 기반 문서영상에서의 문자 추출)

  • 박희주;김진호
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.8 no.2
    • /
    • pp.14-20
    • /
    • 2003
  • This paper presents a text extraction method of camera based document image. It is more difficult to recognize camera based document image in comparison with scanner based image because of segmentation problem due to variable lighting condition and versatile fonts. Both document binarization and character extraction are important processes to recognize camera based document image. After converting color image into grey level image, gray level normalization is used to extract character region independent of lighting condition and background image. Local adaptive binarization method is then used to extract character from the background after the removal of noise. In this character extraction step, the information of the horizontal and vertical projection and the connected components is used to extract character line, word region and character region. To evaluate the proposed method, we have experimented with documents mixed Hangul, English, symbols and digits of the ETRI database. An encouraging binarization and character extraction results have been obtained.

  • PDF