• Title/Summary/Keyword: Morpheme embedding

Search Result 18, Processing Time 0.021 seconds

Automatic Bias Classification of Political News Articles by using Morpheme Embedding and SVM (형태소 임베딩과 SVM을 이용한 뉴스 기사 정치적 편향성의 자동 분류)

  • Cho, Dan-Bi;Lee, Hyun-Young;Park, Ji-Hoon;Kang, Seung-Shik
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2020.05a
    • /
    • pp.451-454
    • /
    • 2020
  • 딥러닝 기술을 이용한 정치적 성향의 편향성 분류를 위하여 신문 뉴스 기사를 수집하고, 머신러닝을 위한 학습 데이터를 구축하였다. 학습 데이터의 구축은 보수 성향과 진보 성향을 대표하는 6개 언론사의 뉴스에서 정치적 성향을 이진 분류 데이터로 구축하였다. 뉴스 기사의 수집 방법으로 최근 이슈들 중에서 정치적 성향과 밀접하게 관련이 있는 키워드 15개를 선정하고 이에 관한 뉴스 기사들을 수집하였다. 그 결과로 11,584개의 학습 및 실험용 데이터를 구축하였으며, 정치적 편향성 분류를 위한 머신러닝 모델을 설계하였다. 머신러닝 기법으로 학습 및 실험을 위해 형태소 단위의 임베딩을 이용하여 문장 및 문서 임베딩으로 확장하였으며, SVM(Support Vector Machine)을 이용하여 정치적 편향성 분류 실험을 수행한 결과로 75%의 정확도를 달성하였다.

A Reranking Model for Korean Morphological Analysis Based on Sequence-to-Sequence Model (Sequence-to-Sequence 모델 기반으로 한 한국어 형태소 분석의 재순위화 모델)

  • Choi, Yong-Seok;Lee, Kong Joo
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.7 no.4
    • /
    • pp.121-128
    • /
    • 2018
  • A Korean morphological analyzer adopts sequence-to-sequence (seq2seq) model, which can generate an output sequence of different length from an input. In general, a seq2seq based Korean morphological analyzer takes a syllable-unit based sequence as an input, and output a syllable-unit based sequence. Syllable-based morphological analysis has the advantage that unknown words can be easily handled, but has the disadvantages that morpheme-based information is ignored. In this paper, we propose a reranking model as a post-processor of seq2seq model that can improve the accuracy of morphological analysis. The seq2seq based morphological analyzer can generate K results by using a beam-search method. The reranking model exploits morpheme-unit embedding information as well as n-gram of morphemes in order to reorder K results. The experimental results show that the reranking model can improve 1.17% F1 score comparing with the original seq2seq model.

Predicate Recognition Method using BiLSTM Model and Morpheme Features (BiLSTM 모델과 형태소 자질을 이용한 서술어 인식 방법)

  • Nam, Chung-Hyeon;Jang, Kyung-Sik
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.26 no.1
    • /
    • pp.24-29
    • /
    • 2022
  • Semantic role labeling task used in various natural language processing fields, such as information extraction and question answering systems, is the task of identifying the arugments for a given sentence and predicate. Predicate used as semantic role labeling input are extracted using lexical analysis results such as POS-tagging, but the problem is that predicate can't extract all linguistic patterns because predicate in korean language has various patterns, depending on the meaning of sentence. In this paper, we propose a korean predicate recognition method using neural network model with pre-trained embedding models and lexical features. The experiments compare the performance on the hyper parameters of models and with or without the use of embedding models and lexical features. As a result, we confirm that the performance of the proposed neural network model was 92.63%.

Measuring Sentence Similarity using Morpheme Embedding Model and GRU Encoder for Question and Answering System (질의응답 시스템에서 형태소임베딩 모델과 GRU 인코더를 이용한 문장유사도 측정)

  • Lee, DongKeon;Oh, KyoJoong;Choi, Ho-Jin;Heo, Jeong
    • 한국어정보학회:학술대회논문집
    • /
    • 2016.10a
    • /
    • pp.128-133
    • /
    • 2016
  • 문장유사도 분석은 문서 평가 자동화에 활용될 수 있는 중요한 기술이다. 최근 순환신경망을 이용한 인코더-디코더 언어 모델이 기계학습 분야에서 괄목할만한 성과를 거두고 있다. 본 논문에서는 한국어 형태소임베딩 모델과 GRU(Gated Recurrent Unit)기반의 인코더를 제시하고, 이를 이용하여 언어모델을 한국어 위키피디아 말뭉치로부터 학습하고, 한국어 질의응답 시스템에서 질문에 대한 정답을 유추 할 수 있는 증거문장을 찾을 수 있도록 문장유사도를 측정하는 방법을 제시한다. 본 논문에 제시된 형태소임베딩 모델과 GRU 기반의 인코딩 모델을 이용하여 문장유사도 측정에 있어서, 기존 글자임베딩 방법에 비해 개선된 결과를 얻을 수 있었으며, 질의응답 시스템에서도 유용하게 활용될 수 있음을 알 수 있었다.

  • PDF

Measuring Sentence Similarity using Morpheme Embedding Model and GRU Encoder for Question and Answering System (질의응답 시스템에서 형태소임베딩 모델과 GRU 인코더를 이용한 문장유사도 측정)

  • Lee, DongKeon;Oh, KyoJoong;Choi, Ho-Jin;Heo, Jeong
    • Annual Conference on Human and Language Technology
    • /
    • 2016.10a
    • /
    • pp.128-133
    • /
    • 2016
  • 문장유사도 분석은 문서 평가 자동화에 활용될 수 있는 중요한 기술이다. 최근 순환신경망을 이용한 인코더-디코더 언어 모델이 기계학습 분야에서 괄목할만한 성과를 거두고 있다. 본 논문에서는 한국어 형태 소임베딩 모델과 GRU(Gated Recurrent Unit)기반의 인코더를 제시하고, 이를 이용하여 언어모델을 한국어 위키피디아 말뭉치로부터 학습하고, 한국어 질의응답 시스템에서 질문에 대한 정답을 유추 할 수 있는 증거문장을 찾을 수 있도록 문장유사도를 측정하는 방법을 제시한다. 본 논문에 제시된 형태소임베딩 모델과 GRU 기반의 인코딩 모델을 이용하여 문장유사도 측정에 있어서, 기존 글자임베딩 방법에 비해 개선된 결과를 얻을 수 있었으며, 질의응답 시스템에서도 유용하게 활용될 수 있음을 알 수 있었다.

  • PDF

Product Evaluation Criteria Extraction through Online Review Analysis: Using LDA and k-Nearest Neighbor Approach (온라인 리뷰 분석을 통한 상품 평가 기준 추출: LDA 및 k-최근접 이웃 접근법을 활용하여)

  • Lee, Ji Hyeon;Jung, Sang Hyung;Kim, Jun Ho;Min, Eun Joo;Yeo, Un Yeong;Kim, Jong Woo
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.1
    • /
    • pp.97-117
    • /
    • 2020
  • Product evaluation criteria is an indicator describing attributes or values of products, which enable users or manufacturers measure and understand the products. When companies analyze their products or compare them with competitors, appropriate criteria must be selected for objective evaluation. The criteria should show the features of products that consumers considered when they purchased, used and evaluated the products. However, current evaluation criteria do not reflect different consumers' opinion from product to product. Previous studies tried to used online reviews from e-commerce sites that reflect consumer opinions to extract the features and topics of products and use them as evaluation criteria. However, there is still a limit that they produce irrelevant criteria to products due to extracted or improper words are not refined. To overcome this limitation, this research suggests LDA-k-NN model which extracts possible criteria words from online reviews by using LDA and refines them with k-nearest neighbor. Proposed approach starts with preparation phase, which is constructed with 6 steps. At first, it collects review data from e-commerce websites. Most e-commerce websites classify their selling items by high-level, middle-level, and low-level categories. Review data for preparation phase are gathered from each middle-level category and collapsed later, which is to present single high-level category. Next, nouns, adjectives, adverbs, and verbs are extracted from reviews by getting part of speech information using morpheme analysis module. After preprocessing, words per each topic from review are shown with LDA and only nouns in topic words are chosen as potential words for criteria. Then, words are tagged based on possibility of criteria for each middle-level category. Next, every tagged word is vectorized by pre-trained word embedding model. Finally, k-nearest neighbor case-based approach is used to classify each word with tags. After setting up preparation phase, criteria extraction phase is conducted with low-level categories. This phase starts with crawling reviews in the corresponding low-level category. Same preprocessing as preparation phase is conducted using morpheme analysis module and LDA. Possible criteria words are extracted by getting nouns from the data and vectorized by pre-trained word embedding model. Finally, evaluation criteria are extracted by refining possible criteria words using k-nearest neighbor approach and reference proportion of each word in the words set. To evaluate the performance of the proposed model, an experiment was conducted with review on '11st', one of the biggest e-commerce companies in Korea. Review data were from 'Electronics/Digital' section, one of high-level categories in 11st. For performance evaluation of suggested model, three other models were used for comparing with the suggested model; actual criteria of 11st, a model that extracts nouns by morpheme analysis module and refines them according to word frequency, and a model that extracts nouns from LDA topics and refines them by word frequency. The performance evaluation was set to predict evaluation criteria of 10 low-level categories with the suggested model and 3 models above. Criteria words extracted from each model were combined into a single words set and it was used for survey questionnaires. In the survey, respondents chose every item they consider as appropriate criteria for each category. Each model got its score when chosen words were extracted from that model. The suggested model had higher scores than other models in 8 out of 10 low-level categories. By conducting paired t-tests on scores of each model, we confirmed that the suggested model shows better performance in 26 tests out of 30. In addition, the suggested model was the best model in terms of accuracy. This research proposes evaluation criteria extracting method that combines topic extraction using LDA and refinement with k-nearest neighbor approach. This method overcomes the limits of previous dictionary-based models and frequency-based refinement models. This study can contribute to improve review analysis for deriving business insights in e-commerce market.

Multi-Dimensional Analysis Method of Product Reviews for Market Insight (마켓 인사이트를 위한 상품 리뷰의 다차원 분석 방안)

  • Park, Jeong Hyun;Lee, Seo Ho;Lim, Gyu Jin;Yeo, Un Yeong;Kim, Jong Woo
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.2
    • /
    • pp.57-78
    • /
    • 2020
  • With the development of the Internet, consumers have had an opportunity to check product information easily through E-Commerce. Product reviews used in the process of purchasing goods are based on user experience, allowing consumers to engage as producers of information as well as refer to information. This can be a way to increase the efficiency of purchasing decisions from the perspective of consumers, and from the seller's point of view, it can help develop products and strengthen their competitiveness. However, it takes a lot of time and effort to understand the overall assessment and assessment dimensions of the products that I think are important in reading the vast amount of product reviews offered by E-Commerce for the products consumers want to compare. This is because product reviews are unstructured information and it is difficult to read sentiment of reviews and assessment dimension immediately. For example, consumers who want to purchase a laptop would like to check the assessment of comparative products at each dimension, such as performance, weight, delivery, speed, and design. Therefore, in this paper, we would like to propose a method to automatically generate multi-dimensional product assessment scores in product reviews that we would like to compare. The methods presented in this study consist largely of two phases. One is the pre-preparation phase and the second is the individual product scoring phase. In the pre-preparation phase, a dimensioned classification model and a sentiment analysis model are created based on a review of the large category product group review. By combining word embedding and association analysis, the dimensioned classification model complements the limitation that word embedding methods for finding relevance between dimensions and words in existing studies see only the distance of words in sentences. Sentiment analysis models generate CNN models by organizing learning data tagged with positives and negatives on a phrase unit for accurate polarity detection. Through this, the individual product scoring phase applies the models pre-prepared for the phrase unit review. Multi-dimensional assessment scores can be obtained by aggregating them by assessment dimension according to the proportion of reviews organized like this, which are grouped among those that are judged to describe a specific dimension for each phrase. In the experiment of this paper, approximately 260,000 reviews of the large category product group are collected to form a dimensioned classification model and a sentiment analysis model. In addition, reviews of the laptops of S and L companies selling at E-Commerce are collected and used as experimental data, respectively. The dimensioned classification model classified individual product reviews broken down into phrases into six assessment dimensions and combined the existing word embedding method with an association analysis indicating frequency between words and dimensions. As a result of combining word embedding and association analysis, the accuracy of the model increased by 13.7%. The sentiment analysis models could be seen to closely analyze the assessment when they were taught in a phrase unit rather than in sentences. As a result, it was confirmed that the accuracy was 29.4% higher than the sentence-based model. Through this study, both sellers and consumers can expect efficient decision making in purchasing and product development, given that they can make multi-dimensional comparisons of products. In addition, text reviews, which are unstructured data, were transformed into objective values such as frequency and morpheme, and they were analysed together using word embedding and association analysis to improve the objectivity aspects of more precise multi-dimensional analysis and research. This will be an attractive analysis model in terms of not only enabling more effective service deployment during the evolving E-Commerce market and fierce competition, but also satisfying both customers.

Automatic Word Spacing of the Korean Sentences by Using End-to-End Deep Neural Network (종단 간 심층 신경망을 이용한 한국어 문장 자동 띄어쓰기)

  • Lee, Hyun Young;Kang, Seung Shik
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.8 no.11
    • /
    • pp.441-448
    • /
    • 2019
  • Previous researches on automatic spacing of Korean sentences has been researched to correct spacing errors by using n-gram based statistical techniques or morpheme analyzer to insert blanks in the word boundary. In this paper, we propose an end-to-end automatic word spacing by using deep neural network. Automatic word spacing problem could be defined as a tag classification problem in unit of syllable other than word. For contextual representation between syllables, Bi-LSTM encodes the dependency relationship between syllables into a fixed-length vector of continuous vector space using forward and backward LSTM cell. In order to conduct automatic word spacing of Korean sentences, after a fixed-length contextual vector by Bi-LSTM is classified into auto-spacing tag(B or I), the blank is inserted in the front of B tag. For tag classification method, we compose three types of classification neural networks. One is feedforward neural network, another is neural network language model and the other is linear-chain CRF. To compare our models, we measure the performance of automatic word spacing depending on the three of classification networks. linear-chain CRF of them used as classification neural network shows better performance than other models. We used KCC150 corpus as a training and testing data.