• Title/Summary/Keyword: Word Extraction

Search Result 231, Processing Time 0.025 seconds

Korean Emotion Vocabulary: Extraction and Categorization of Feeling Words (한국어 감정표현단어의 추출과 범주화)

  • Sohn, Sun-Ju;Park, Mi-Sook;Park, Ji-Eun;Sohn, Jin-Hun
    • Science of Emotion and Sensibility
    • /
    • v.15 no.1
    • /
    • pp.105-120
    • /
    • 2012
  • This study aimed to develop a Korean emotion vocabulary list that functions as an important tool in understanding human feelings. In doing so, the focus was on the careful extraction of most widely used feeling words, as well as categorization into groups of emotion(s) in relation to its meaning when used in real life. A total of 12 professionals (including Korean major graduate students) partook in the study. Using the Korean 'word frequency list' developed by Yonsei University and through various sorting processes, the study condensed the original 64,666 emotion words into a finalized 504 words. In the next step, a total of 80 social work students evaluated and classified each word for its meaning and into any of the following categories that seem most appropriate for inclusion: 'happiness', 'sadness', 'fear', 'anger', 'disgust', 'surprise', 'interest', 'boredom', 'pain', 'neutral', and 'other'. Findings showed that, of the 504 feeling words, 426 words expressed a single emotion, whereas 72 words reflected two emotions (i.e., same word indicating two distinct emotions), and 6 words showing three emotions. Of the 426 words that represent a single emotion, 'sadness' was predominant, followed by 'anger' and 'happiness'. Amongst 72 words that showed two emotions were mostly a combination of 'anger' and 'disgust', followed by 'sadness' and 'fear', and 'happiness' and 'interest'. The significance of the study is on the development of a most adaptive list of Korean feeling words that can be meticulously combined with other emotion signals such as facial expression in optimizing emotion recognition research, particularly in the Human-Computer Interface (HCI) area. The identification of feeling words that connote more than one emotion is also noteworthy.

  • PDF

CNN-based Distant Supervision Relation Extraction Model with Multi-sense Word Embedding (다중-어의 단어 임베딩을 적용한 CNN 기반 원격 지도 학습 관계 추출 모델)

  • Nam, Sangha;Han, Kijong;Kim, Eun-Kyung;Gwon, Seong-Gu;Jeong, Yu-Seong;Choi, Key-Sun
    • Annual Conference on Human and Language Technology
    • /
    • 2017.10a
    • /
    • pp.137-142
    • /
    • 2017
  • 원격 지도 학습은 자동으로 매우 큰 코퍼스와 지식베이스 간의 주석 데이터를 생성하여 기계 학습에 필요한 학습 데이터를 사람의 손을 빌리지 않고 저렴한 비용으로 만들 수 있어, 많은 연구들이 관계 추출 문제를 해결하기 위해 원격 지도 학습 방법을 적용하고 있다. 그러나 기존 연구들에서는 모델 학습의 입력으로 사용되는 단어 임베딩에서 단어의 동형이의어 성질을 반영하지 못한다는 단점이 있다. 때문에 서로 다른 의미를 가진 동형이의어가 하나의 임베딩 값을 가지다 보니, 단어의 의미를 정확히 파악하지 못한채 관계 추출 모델을 학습한다고 볼 수 있다. 본 논문에서는 원격 지도 학습 기반 관계 추출 모델에 다중-어의 단어 임베딩을 적용한 모델을 제안한다. 다중-어의 단어 임베딩 학습을 위해 어의 중의성 해소 모듈을 활용하였으며, 관계 추출 모델은 문장 내 주요 특징을 효율적으로 파악하는 모델인 CNN과 PCNN을 활용하였다. 본 논문에서 제안하는 다중-어의 단어 임베딩 적용 관계추출 모델의 성능을 평가하기 위해 추가적으로 2가지 방식의 단어 임베딩을 학습하여 비교 평가를 수행하였고, 그 결과 어의 중의성 해소 모듈을 활용한 단어 임베딩을 활용하였을 때 관계추출 모델의 성능이 향상된 결과를 보였다.

  • PDF

CNN-based Distant Supervision Relation Extraction Model with Multi-sense Word Embedding (다중-어의 단어 임베딩을 적용한 CNN 기반 원격 지도 학습 관계 추출 모델)

  • Nam, Sangha;Han, Kijong;Kim, Eun-Kyung;Gwon, Seong-Gu;Jeong, Yu-Seong;Choi, Key-Sun
    • 한국어정보학회:학술대회논문집
    • /
    • 2017.10a
    • /
    • pp.137-142
    • /
    • 2017
  • 원격 지도 학습은 자동으로 매우 큰 코퍼스와 지식베이스 간의 주석 데이터를 생성하여 기계 학습에 필요한 학습 데이터를 사람의 손을 빌리지 않고 저렴한 비용으로 만들 수 있어, 많은 연구들이 관계 추출 문제를 해결하기 위해 원격 지도 학습 방법을 적용하고 있다. 그러나 기존 연구들에서는 모델 학습의 입력으로 사용되는 단어 임베딩에서 단어의 동형이의어 성질을 반영하지 못한다는 단점이 있다. 때문에 서로 다른 의미를 가진 동형이의어가 하나의 임베딩 값을 가지다 보니, 단어의 의미를 정확히 파악하지 못한 채 관계 추출 모델을 학습한다고 볼 수 있다. 본 논문에서는 원격 지도 학습 기반 관계 추출 모델에 다중-어의 단어 임베딩을 적용한 모델을 제안한다. 다중-어의 단어 임베딩 학습을 위해 어의 중의성 해소 모듈을 활용하였으며, 관계 추출 모델은 문장 내 주요 특징을 효율적으로 파악하는 모델인 CNN과 PCNN을 활용하였다. 본 논문에서 제안하는 다중-어의 단어 임베딩 적용 관계추출 모델의 성능을 평가하기 위해 추가적으로 2가지 방식의 단어 임베딩을 학습하여 비교 평가를 수행하였고, 그 결과 어의 중의성 해소 모듈을 활용한 단어 임베딩을 활용하였을 때 관계추출 모델의 성능이 향상된 결과를 보였다.

  • PDF

Performance Improvement of Bilingual Lexicon Extraction via Pivot Language and Word Alignment Tool (중간언어와 단어정렬을 통한 이중언어 사전의 자동 추출에 대한 성능 개선)

  • Kwon, Hong-Seok;Seo, Hyeung-Won;Kim, Jae-Hoon
    • Annual Conference on Human and Language Technology
    • /
    • 2013.10a
    • /
    • pp.27-32
    • /
    • 2013
  • 본 논문은 잘 알려지지 않은 언어 쌍에 대해서 병렬말뭉치(parallel corpus)로부터 자동으로 이중언어 사전을 추출하는 방법을 제안하였다. 이 방법은 중간언어(pivot language)를 매개로 하고 문맥 벡터를 생성하기 위해 공개된 단어 정렬 도구인 Anymalign을 사용하였다. 그 결과로 초기사전(seed dictionary)을 사용한 문맥벡터의 번역 과정이 필요 없으며 통계적 방법의 약점인 낮은 빈도수를 가지는 어휘에 대한 번역 정확도를 높였다. 또한 문맥벡터의 요소 값으로 특정 임계값 이상을 가지는 양방향 번역 확률 정보를 사용하여 상위 5위 이내의 번역 정확도를 크게 높였다. 본 논문은 두 개의 서로 다른 언어 쌍 한국어-스페인어 그리고 한국어-프랑스어 양방향에 대해서 각각 이중언어 사전을 추출하는 실험을 하였다. 높은 빈도수를 가지는 어휘에 대한 번역 정확도는 이전 연구에서 보인 실험 결과에 비해 최소 3.41% 최대 67.91%의 성능 향상을 보였고 낮은 빈도수를 가지는 어휘에 대한 번역 정확도는 최소 5.06%, 최대 990%의 성능 향상을 보였다.

  • PDF

Status Report on the Korean Speech Recognition Platform (한국어 음성인식 플랫폼 개발현황)

  • Kwon, Oh-Wook;Kwon, Suk-Bong;Jang, Gyu-Cheol;Yun, Sung-rack;Kim, Yong-Rae;Jang, Kwang-Dong;Kim, Hoi-Rin;Yoo, Chang-Dong;Kim, Bong-Wan;Lee, Yong-Ju
    • Proceedings of the KSPS conference
    • /
    • 2005.11a
    • /
    • pp.215-218
    • /
    • 2005
  • This paper reports the current status of development of the Korean speech recognition platform (ECHOS). We implement new modules including ETSI feature extraction, backward search with trigram, and utterance verification. The ETSI feature extraction module is implemented by converting the public software to an object-oriented program. We show that trigram language modeling in the backward search pass reduces the word error rate from 23.5% to 22% on a large vocabulary continuous speech recognition task. We confirm the utterance verification module by examining word graphs with confidence score.

  • PDF

Concept-based Question Answering System

  • Kang Yu-Hwan;Shin Seung-Eun;Ahn Young-Min;Seo Young-Hoon
    • International Journal of Contents
    • /
    • v.2 no.1
    • /
    • pp.17-21
    • /
    • 2006
  • In this paper, we describe a concept-based question-answering system in which concept rather than keyword itself makes an important role on both question analysis and answer extraction. Our idea is that concepts occurred in same type of questions are similar, and if a question is analyzed according to those concepts then we can extract more accurate answer because we know the semantic role of each word or phrase in question. Concept frame is defined for each type of question, and it is composed of important concepts in that question type. Currently the number of question type is 79 including 34 types for person, 14 types for location, and so on. We experiment this concept-based approach about questions which require person s name as their answer. Experimental results show that our system has high accuracy in answer extraction. Also, this concept-based approach can be used in combination with conventional approaches.

  • PDF

Conceptual Extraction of Compound Korean Keywords

  • Lee, Samuel Sangkon
    • Journal of Information Processing Systems
    • /
    • v.16 no.2
    • /
    • pp.447-459
    • /
    • 2020
  • After reading a document, people construct a concept about the information they consumed and merge multiple words to set up keywords that represent the material. With that in mind, this study suggests a smarter and more efficient keyword extraction method wherein scholarly journals are used as the basis for the establishment of production rules based on a concept information of words appearing in a document in a way in which author-provided keywords are functional although they do not appear in the body of the document. This study presents a new way to determine the importance of each keyword, excluding non-relevant keywords. To identify the validity of extracted keywords, titles and abstracts of journals about natural language and auditory language were collected for analysis. The comparison of author-provided keywords with the keyword results of the developed system showed that the developed system was highly useful, with an accuracy rate as good as up to 96%.

Question Similarity Measurement of Chinese Crop Diseases and Insect Pests Based on Mixed Information Extraction

  • Zhou, Han;Guo, Xuchao;Liu, Chengqi;Tang, Zhan;Lu, Shuhan;Li, Lin
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.15 no.11
    • /
    • pp.3991-4010
    • /
    • 2021
  • The Question Similarity Measurement of Chinese Crop Diseases and Insect Pests (QSM-CCD&IP) aims to judge the user's tendency to ask questions regarding input problems. The measurement is the basis of the Agricultural Knowledge Question and Answering (Q & A) system, information retrieval, and other tasks. However, the corpus and measurement methods available in this field have some deficiencies. In addition, error propagation may occur when the word boundary features and local context information are ignored when the general method embeds sentences. Hence, these factors make the task challenging. To solve the above problems and tackle the Question Similarity Measurement task in this work, a corpus on Chinese crop diseases and insect pests(CCDIP), which contains 13 categories, was established. Then, taking the CCDIP as the research object, this study proposes a Chinese agricultural text similarity matching model, namely, the AgrCQS. This model is based on mixed information extraction. Specifically, the hybrid embedding layer can enrich character information and improve the recognition ability of the model on the word boundary. The multi-scale local information can be extracted by multi-core convolutional neural network based on multi-weight (MM-CNN). The self-attention mechanism can enhance the fusion ability of the model on global information. In this research, the performance of the AgrCQS on the CCDIP is verified, and three benchmark datasets, namely, AFQMC, LCQMC, and BQ, are used. The accuracy rates are 93.92%, 74.42%, 86.35%, and 83.05%, respectively, which are higher than that of baseline systems without using any external knowledge. Additionally, the proposed method module can be extracted separately and applied to other models, thus providing reference for related research.

Arabic Words Extraction and Character Recognition from Picturesque Image Macros with Enhanced VGG-16 based Model Functionality Using Neural Networks

  • Ayed Ahmad Hamdan Al-Radaideh;Mohd Shafry bin Mohd Rahim;Wad Ghaban;Majdi Bsoul;Shahid Kamal;Naveed Abbas
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.17 no.7
    • /
    • pp.1807-1822
    • /
    • 2023
  • Innovation and rapid increased functionality in user friendly smartphones has encouraged shutterbugs to have picturesque image macros while in work environment or during travel. Formal signboards are placed with marketing objectives and are enriched with text for attracting people. Extracting and recognition of the text from natural images is an emerging research issue and needs consideration. When compared to conventional optical character recognition (OCR), the complex background, implicit noise, lighting, and orientation of these scenic text photos make this problem more difficult. Arabic language text scene extraction and recognition adds a number of complications and difficulties. The method described in this paper uses a two-phase methodology to extract Arabic text and word boundaries awareness from scenic images with varying text orientations. The first stage uses a convolution autoencoder, and the second uses Arabic Character Segmentation (ACS), which is followed by traditional two-layer neural networks for recognition. This study presents the way that how can an Arabic training and synthetic dataset be created for exemplify the superimposed text in different scene images. For this purpose a dataset of size 10K of cropped images has been created in the detection phase wherein Arabic text was found and 127k Arabic character dataset for the recognition phase. The phase-1 labels were generated from an Arabic corpus of quotes and sentences, which consists of 15kquotes and sentences. This study ensures that Arabic Word Awareness Region Detection (AWARD) approach with high flexibility in identifying complex Arabic text scene images, such as texts that are arbitrarily oriented, curved, or deformed, is used to detect these texts. Our research after experimentations shows that the system has a 91.8% word segmentation accuracy and a 94.2% character recognition accuracy. We believe in the future that the researchers will excel in the field of image processing while treating text images to improve or reduce noise by processing scene images in any language by enhancing the functionality of VGG-16 based model using Neural Networks.

Concept-based Question Analysis for Accurate Answer Extraction (정확한 해답 추출을 위한 개념 기반의 질의 분석)

  • Shin, Seung-Eun;Kang, Yu-Hwan;Ahn, Young-Min;Park, Hee-Guen;Seo, Young-Hoon
    • The Journal of the Korea Contents Association
    • /
    • v.7 no.1
    • /
    • pp.10-20
    • /
    • 2007
  • This paper describes a concept-based question analysis to analyze concept which is more important than keyword for the accurate answer extraction. Our idea is that we can extract correct answers from various paragraphs with different structures when we use well-defined concepts because concepts occurred in questions of same answer type are similar. That is, we will analyze the syntactic and semantic role of each word or phrase in a question in order to extract more relevant documents and more accurate answer in them. For each answer type, we define a concept frame which is composed of concepts commonly occurred in that type of questions and analyze user's question by filling a concept frame with a word or phrase. Empirical results show that our concept-based question analysis can extract more accurate answer than any other conventional approach. Also, concept-based approach has additional merits that it is language universal model, and can be combined with arbitrary conventional approaches.