• Title/Summary/Keyword: Word embedding

Search Result 237, Processing Time 0.027 seconds

Detecting Improper Sentences in a News Article Using Text Mining (텍스트 마이닝을 이용한 기사 내 부적합 문단 검출 시스템)

  • Kim, Kyu-Wan;Sin, Hyun-Ju;Kim, Seon-Jin;Lee, Hyun Ah
    • 한국어정보학회:학술대회논문집
    • /
    • 2017.10a
    • /
    • pp.294-297
    • /
    • 2017
  • SNS와 스마트기기의 발전으로 온라인을 통한 뉴스 배포가 용이해지면서 악의적으로 조작된 뉴스가 급속도로 생성되어 확산되고 있다. 뉴스 조작은 다양한 형태로 이루어지는데, 이 중에서 정상적인 기사 내에 광고나 낚시성 내용을 포함시켜 독자가 의도하지 않은 정보에 노출되게 하는 형태는 독자가 해당 내용을 진짜 뉴스로 받아들이기 쉽다. 본 논문에서는 뉴스 기사 내에 포함된 문단 중에서 부적합한 문단이 포함 되었는지를 판정하기 위한 방법을 제안한다. 제안하는 방식에서는 자연어 처리에 유용한 Convolutional Neural Network(CNN)모델 중 Word2Vec과 tf-idf 알고리즘, 로지스틱 회귀를 함께 이용하여 뉴스 부적합 문단을 검출한다. 본 시스템에서는 로지스틱 회귀를 이용하여 문단의 카테고리를 분류하여 본문의 카테고리 분포도를 계산하고 Word2Vec을 이용하여 문단간의 유사도를 계산한 결과에 가중치를 부여하여 부적합 문단을 검출한다.

  • PDF

A Study on Patent Literature Classification Using Distributed Representation of Technical Terms (기술용어 분산표현을 활용한 특허문헌 분류에 관한 연구)

  • Choi, Yunsoo;Choi, Sung-Pil
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.53 no.2
    • /
    • pp.179-199
    • /
    • 2019
  • In this paper, we propose optimal methodologies for classifying patent literature by examining various feature extraction methods, machine learning and deep learning models, and provide optimal performance through experiments. We compared the traditional BoW method and a distributed representation method (word embedding vector) as a feature extraction, and compared the morphological analysis and multi gram as the method of constructing the document collection. In addition, classification performance was verified using traditional machine learning model and deep learning model. Experimental results show that the best performance is achieved when we apply the deep learning model with distributed representation and morphological analysis based feature extraction. In Section, Class and Subclass classification experiments, We improved the performance by 5.71%, 18.84% and 21.53%, respectively, compared with traditional classification methods.

A Study on Yin Yang, Wuxing, Mutual Collision, and Zangfu Combination of the Ten Heavenly Stems (십간(十干)의 음양(陰陽), 오행(五行), 상충(相沖), 장부배합(臟腑配合)에 관(關)한 연구)

  • Yoo, Young-Joon;Yun, Chang-Yeol
    • Journal of Korean Medical classics
    • /
    • v.32 no.2
    • /
    • pp.17-31
    • /
    • 2019
  • Objectives : Understanding the Ten Stems and Twelve Branches is necessary to grasp the laws of change in Heaven and Earth. Methods : Based on relevant contents in East Asian classics, the Yin Yang, Sibling Wuxing, Husband-Wife Wuxing combinations as well as Mutual Collision and Zangfu combination were examined. Results & Conclusion : Yin Yang combination of the Ten Stems are divided according to odd/evenness. The Sibling Wuxing combination is categorized according to one life cycle of vegetation, resulting in Jia Yi Wood, Bing Ding Fire, Wu Ji Earth, Geng Xin Metal, Ren Gui Water. The Husband-Wife Wuxing combination of the Ten Stems are Jia Ji Earth, Yi Geng Metal, Bing Xin Water, Ding Ren Wood, Wu Gui Fire, which corresponds to the principles of the Duihuazuoyong Theory. Within the Husband-Wife Wuxing combination lies three principles which are Yin Yang combination, Mutual Restraining combination, and the Yang Stem restraining the Yin Stem. The Mutual Collision of the Ten Stems are Jia and Geng, Yi and Xin, Ren and Bing, Gui and Ding against each other. In matching Zangfu to the Ten Stems, Jia matches with Gallbladder, Yi matches with Liver; Bing matches with Small Intestine, Ding matches with Heart; Wu matches with Stomach, Ji matches with Spleen; Geng matches with Large Intestine, Xin matches with Lung; Ren matches with Bladder, Gui matches with Kidney. : When the adjacent vectors are extracted, the count-based word embedding method derives the main herbs that are frequently used in conjunction with each other. On the other hand, in the prediction-based word embedding method, the synonyms of the herbs were derived.

Development of a Fake News Detection Model Using Text Mining and Deep Learning Algorithms (텍스트 마이닝과 딥러닝 알고리즘을 이용한 가짜 뉴스 탐지 모델 개발)

  • Dong-Hoon Lim;Gunwoo Kim;Keunho Choi
    • Information Systems Review
    • /
    • v.23 no.4
    • /
    • pp.127-146
    • /
    • 2021
  • Fake news isexpanded and reproduced rapidly regardless of their authenticity by the characteristics of modern society, called the information age. Assuming that 1% of all news are fake news, the amount of economic costs is reported to about 30 trillion Korean won. This shows that the fake news isvery important social and economic issue. Therefore, this study aims to develop an automated detection model to quickly and accurately verify the authenticity of the news. To this end, this study crawled the news data whose authenticity is verified, and developed fake news prediction models using word embedding (Word2Vec, Fasttext) and deep learning algorithms (LSTM, BiLSTM). Experimental results show that the prediction model using BiLSTM with Word2Vec achieved the best accuracy of 84%.

Effective Text Question Analysis for Goal-oriented Dialogue (목적 지향 대화를 위한 효율적 질의 의도 분석에 관한 연구)

  • Kim, Hakdong;Go, Myunghyun;Lim, Heonyeong;Lee, Yurim;Jee, Minkyu;Kim, Wonil
    • Journal of Broadcast Engineering
    • /
    • v.24 no.1
    • /
    • pp.48-57
    • /
    • 2019
  • The purpose of this study is to understand the intention of the inquirer from the single text type question in Goal-oriented dialogue. Goal-Oriented Dialogue system means a dialogue system that satisfies the user's specific needs via text or voice. The intention analysis process is a step of analysing the user's intention of inquiry prior to the answer generation, and has a great influence on the performance of the entire Goal-Oriented Dialogue system. The proposed model was used for a daily chemical products domain and Korean text data related to the domain was used. The analysis is divided into a speech-act which means independent on a specific field concept-sequence and which means depend on a specific field. We propose a classification method using the word embedding model and the CNN as a method for analyzing speech-act and concept-sequence. The semantic information of the word is abstracted through the word embedding model, and concept-sequence and speech-act classification are performed through the CNN based on the semantic information of the abstract word.

Comparative study of text representation and learning for Persian named entity recognition

  • Pour, Mohammad Mahdi Abdollah;Momtazi, Saeedeh
    • ETRI Journal
    • /
    • v.44 no.5
    • /
    • pp.794-804
    • /
    • 2022
  • Transformer models have had a great impact on natural language processing (NLP) in recent years by realizing outstanding and efficient contextualized language models. Recent studies have used transformer-based language models for various NLP tasks, including Persian named entity recognition (NER). However, in complex tasks, for example, NER, it is difficult to determine which contextualized embedding will produce the best representation for the tasks. Considering the lack of comparative studies to investigate the use of different contextualized pretrained models with sequence modeling classifiers, we conducted a comparative study about using different classifiers and embedding models. In this paper, we use different transformer-based language models tuned with different classifiers, and we evaluate these models on the Persian NER task. We perform a comparative analysis to assess the impact of text representation and text classification methods on Persian NER performance. We train and evaluate the models on three different Persian NER datasets, that is, MoNa, Peyma, and Arman. Experimental results demonstrate that XLM-R with a linear layer and conditional random field (CRF) layer exhibited the best performance. This model achieved phrase-based F-measures of 70.04, 86.37, and 79.25 and word-based F scores of 78, 84.02, and 89.73 on the MoNa, Peyma, and Arman datasets, respectively. These results represent state-of-the-art performance on the Persian NER task.

Encoding Dictionary Feature for Deep Learning-based Named Entity Recognition

  • Ronran, Chirawan;Unankard, Sayan;Lee, Seungwoo
    • International Journal of Contents
    • /
    • v.17 no.4
    • /
    • pp.1-15
    • /
    • 2021
  • Named entity recognition (NER) is a crucial task for NLP, which aims to extract information from texts. To build NER systems, deep learning (DL) models are learned with dictionary features by mapping each word in the dataset to dictionary features and generating a unique index. However, this technique might generate noisy labels, which pose significant challenges for the NER task. In this paper, we proposed DL-dictionary features, and evaluated them on two datasets, including the OntoNotes 5.0 dataset and our new infectious disease outbreak dataset named GFID. We used (1) a Bidirectional Long Short-Term Memory (BiLSTM) character and (2) pre-trained embedding to concatenate with (3) our proposed features, named the Convolutional Neural Network (CNN), BiLSTM, and self-attention dictionaries, respectively. The combined features (1-3) were fed through BiLSTM - Conditional Random Field (CRF) to predict named entity classes as outputs. We compared these outputs with other predictions of the BiLSTM character, pre-trained embedding, and dictionary features from previous research, which used the exact matching and partial matching dictionary technique. The findings showed that the model employing our dictionary features outperformed other models that used existing dictionary features. We also computed the F1 score with the GFID dataset to apply this technique to extract medical or healthcare information.

Researcher and Research Area Recommendation System for Promoting Convergence Research Using Text Mining and Messenger UI (텍스트 마이닝 방법론과 메신저UI를 활용한 융합연구 촉진을 위한 연구자 및 연구 분야 추천 시스템의 제안)

  • Yang, Nak-Yeong;Kim, Sung-Geun;Kang, Ju-Young
    • The Journal of Information Systems
    • /
    • v.27 no.4
    • /
    • pp.71-96
    • /
    • 2018
  • Purpose Recently, social interest in the convergence research is at its peak. However, contrary to the keen interest in convergence research, an infrastructure that makes it easier to recruit researchers from other fields is not yet well established, which is why researchers are having considerable difficulty in carrying out real convergence research. In this study, we implemented a researcher recommendation system that helps researchers who want to collaborate easily recruit researchers from other fields, and we expect it to serve as a springboard for growth in the convergence research field. Design/methodology/approach In this study, we implemented a system that recommends proper researchers when users enter keyword in the field of research that they want to collaborate using word embedding techniques, word2vec. In addition, we also implemented function of keyword suggestions by using keywords drawn from LDA Topicmodeling Algorithm. Finally, the UI of the researcher recommendation system was completed by utilizing the collaborative messenger Slack to facilitate immediate exchange of information with the recommended researchers and to accommodate various applications for collaboration. Findings In this study, we validated the completed researcher recommendation system by ensuring that the list of researchers recommended by entering a specific keyword is accurate and that words learned as a similar word with a particular researcher match the researcher's field of research. The results showed 85.89% accuracy in the former, and in the latter case, mostly, the words drawn as similar words were found to match the researcher's field of research, leading to excellent performance of the researcher recommendation system.

The Research Trends and Keywords Modeling of Shoulder Rehabilitation using the Text-mining Technique (텍스트 마이닝 기법을 활용한 어깨 재활 연구분야 동향과 키워드 모델링)

  • Kim, Jun-hee;Jung, Sung-hoon;Hwang, Ui-jae
    • Journal of the Korean Society of Physical Medicine
    • /
    • v.16 no.2
    • /
    • pp.91-100
    • /
    • 2021
  • PURPOSE: This study analyzed the trends and characteristics of shoulder rehabilitation research through keyword analysis, and their relationships were modeled using text mining techniques. METHODS: Abstract data of 10,121 articles in which abstracts were registered on the MEDLINE of PubMed with 'shoulder' and 'rehabilitation' as keywords were collected using python. By analyzing the frequency of words, 10 keywords were selected in the order of the highest frequency. Word-embedding was performed using the word2vec technique to analyze the similarity of words. In addition, the groups were classified and analyzed based on the distance (cosine similarity) through the t-SNE technique. RESULTS: The number of studies related to shoulder rehabilitation is increasing year after year, keywords most frequently used in relation to shoulder rehabilitation studies are 'patient', 'pain', and 'treatment'. The word2vec results showed that the words were highly correlated with 12 keywords from studies related to shoulder rehabilitation. Furthermore, through t-SNE, the keywords of the studies were divided into 5 groups. CONCLUSION: This study was the first study to model the keywords and their relationships that make up the abstracts of research in the MEDLINE of Pub Med related to 'shoulder' and 'rehabilitation' using text-mining techniques. The results of this study will help increase the diversifying research topics of shoulder rehabilitation studies to be conducted in the future.

Web Attack Classification Model Based on Payload Embedding Pre-Training (페이로드 임베딩 사전학습 기반의 웹 공격 분류 모델)

  • Kim, Yeonsu;Ko, Younghun;Euom, Ieckchae;Kim, Kyungbaek
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.30 no.4
    • /
    • pp.669-677
    • /
    • 2020
  • As the number of Internet users exploded, attacks on the web increased. In addition, the attack patterns have been diversified to bypass existing defense techniques. Traditional web firewalls are difficult to detect attacks of unknown patterns.Therefore, the method of detecting abnormal behavior by artificial intelligence has been studied as an alternative. Specifically, attempts have been made to apply natural language processing techniques because the type of script or query being exploited consists of text. However, because there are many unknown words in scripts and queries, natural language processing requires a different approach. In this paper, we propose a new classification model which uses byte pair encoding (BPE) technology to learn the embedding vector, that is often used for web attack payloads, and uses an attention mechanism-based Bi-GRU neural network to extract a set of tokens that learn their order and importance. For major web attacks such as SQL injection, cross-site scripting, and command injection attacks, the accuracy of the proposed classification method is about 0.9990 and its accuracy outperforms the model suggested in the previous study.