• Title/Summary/Keyword: skip-gram

Search Result 22, Processing Time 0.023 seconds

Sentence model based subword embeddings for a dialog system

  • Chung, Euisok;Kim, Hyun Woo;Song, Hwa Jeon
    • ETRI Journal
    • /
    • v.44 no.4
    • /
    • pp.599-612
    • /
    • 2022
  • This study focuses on improving a word embedding model to enhance the performance of downstream tasks, such as those of dialog systems. To improve traditional word embedding models, such as skip-gram, it is critical to refine the word features and expand the context model. In this paper, we approach the word model from the perspective of subword embedding and attempt to extend the context model by integrating various sentence models. Our proposed sentence model is a subword-based skip-thought model that integrates self-attention and relative position encoding techniques. We also propose a clustering-based dialog model for downstream task verification and evaluate its relationship with the sentence-model-based subword embedding technique. The proposed subword embedding method produces better results than previous methods in evaluating word and sentence similarity. In addition, the downstream task verification, a clustering-based dialog system, demonstrates an improvement of up to 4.86% over the results of FastText in previous research.

A Word Embedding used Word Sense and Feature Mirror Model (단어 의미와 자질 거울 모델을 이용한 단어 임베딩)

  • Lee, JuSang;Shin, JoonChoul;Ock, CheolYoung
    • KIISE Transactions on Computing Practices
    • /
    • v.23 no.4
    • /
    • pp.226-231
    • /
    • 2017
  • Word representation, an important area in natural language processing(NLP) used machine learning, is a method that represents a word not by text but by distinguishable symbol. Existing word embedding employed a large number of corpora to ensure that words are positioned nearby within text. However corpus-based word embedding needs several corpora because of the frequency of word occurrence and increased number of words. In this paper word embedding is done using dictionary definitions and semantic relationship information(hypernyms and antonyms). Words are trained using the feature mirror model(FMM), a modified Skip-Gram(Word2Vec). Sense similar words have similar vector. Furthermore, it was possible to distinguish vectors of antonym words.

SMS Text Messages Filtering using Word Embedding and Deep Learning Techniques (워드 임베딩과 딥러닝 기법을 이용한 SMS 문자 메시지 필터링)

  • Lee, Hyun Young;Kang, Seung Shik
    • Smart Media Journal
    • /
    • v.7 no.4
    • /
    • pp.24-29
    • /
    • 2018
  • Text analysis technique for natural language processing in deep learning represents words in vector form through word embedding. In this paper, we propose a method of constructing a document vector and classifying it into spam and normal text message, using word embedding and deep learning method. Automatic spacing applied in the preprocessing process ensures that words with similar context are adjacently represented in vector space. Additionally, the intentional word formation errors with non-alphabetic or extraordinary characters are designed to avoid being blocked by spam message filter. Two embedding algorithms, CBOW and skip grams, are used to produce the sentence vector and the performance and the accuracy of deep learning based spam filter model are measured by comparing to those of SVM Light.

A Study on Word Vector Models for Representing Korean Semantic Information

  • Yang, Hejung;Lee, Young-In;Lee, Hyun-jung;Cho, Sook Whan;Koo, Myoung-Wan
    • Phonetics and Speech Sciences
    • /
    • v.7 no.4
    • /
    • pp.41-47
    • /
    • 2015
  • This paper examines whether the Global Vector model is applicable to Korean data as a universal learning algorithm. The main purpose of this study is to compare the global vector model (GloVe) with the word2vec models such as a continuous bag-of-words (CBOW) model and a skip-gram (SG) model. For this purpose, we conducted an experiment by employing an evaluation corpus consisting of 70 target words and 819 pairs of Korean words for word similarities and analogies, respectively. Results of the word similarity task indicated that the Pearson correlation coefficients of 0.3133 as compared with the human judgement in GloVe, 0.2637 in CBOW and 0.2177 in SG. The word analogy task showed that the overall accuracy rate of 67% in semantic and syntactic relations was obtained in GloVe, 66% in CBOW and 57% in SG.

Word Embedding using Semantic Restriction of Predicate (용언의 의미 제약을 이용한 단어 임베딩)

  • Lee, Ju-Sang;Ock, Cheol-Young
    • Annual Conference on Human and Language Technology
    • /
    • 2015.10a
    • /
    • pp.181-183
    • /
    • 2015
  • 최근 자연어 처리 분야에서 딥 러닝이 많이 사용되고 있다. 자연어 처리에서 딥 러닝의 성능 향상을 위해 단어의 표현이 중요하다. 단어 임베딩은 단어 표현을 인공 신경망을 이용해 다차원 벡터로 표현한다. 본 논문에서는 word2vec의 Skip-gram과 negative-sampling을 이용하여 단어 임베딩 학습을 한다. 단어 임베딩 학습 데이터로 한국어 어휘지도 UWordMap의 용언의 필수논항 의미 제약 정보를 이용하여 구성했으며 250,183개의 단어 사전을 구축해 학습한다. 실험 결과로는 의미 제약 정보를 이용한 단어 임베딩이 유사성을 가진 단어들이 인접해 있음을 보인다.

  • PDF

Research Paper Classification Scheme based on Word Embedding (워드 임베딩 기반 연구 논문 분류 기법)

  • Dipto, Biswas;Gil, Joon-Min
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2021.11a
    • /
    • pp.494-497
    • /
    • 2021
  • 텍스트 분류(text classification)는 원시 텍스트 데이터로부터 정보를 추출할 수 있는 기술에 기반하여 많은 양의 텍스트 데이터를 관심 영역으로 분류하는 것으로 최근에 각광을 받고 있다. 본 논문에서는 워드 임베딩(word embedding) 기법을 이용하여 특정 분야의 연구 논문을 분류하고 추천하는 기법을 제안한다. 워드 임베딩으로 CBOW(Continuous Bag-of-Word)와 Sg(Skip-gram)를 연구 논문의 분류에 적용하고 기존 방식인 TF-IDF(Term Frequency-Inverse Document Frequency)와 성능을 비교 분석한다. 성능 평가 결과는 워드 임베딩에 기반한 연구 논문 분류 기법이 TF-IDF에 기반한 연구 논문 분류 기법보다 좋은 성능을 가진다는 것을 나타낸다.

Recommender System Design with Item2vec and LSTM (Item2vec과 LSTM을 사용한 추천 시스템 설계)

  • Minsu Cha;Jiyoung Woo
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2023.01a
    • /
    • pp.145-146
    • /
    • 2023
  • 본 논문에서는 최대 규모의 게임 플랫폼인 Steam에서 수집한 유저 정보 데이터 셋에 Item2vec과 LSTM을 사용하여 추천 시스템을 구현한다. 수집한 유저 정보 데이터 셋에 Item2vec을 적용하여 각각의 유저들이 보유하고 있는 고유한 Appid들을 200차원의 벡터로 변환한다. 그 후 데이터 셋을 기간에 따라 4단계의 시퀀스로 나눈 후 LSTM을 사용하여 유저별로 최대 5가지의 추천 리스트를 생성한다. 유저 정보 데이터 셋은 액티브한 유저 정보를 얻기 위해 Steam 게임 리뷰 항목에서 리뷰를 남긴 유저들의 데이터를 api를 사용해 수집했으며 LSTM을 사용한 실험의 성능 평가 지표는 RMSE를 사용했고 이때의 성능은 0.1357을 얻을 수 있었다.

  • PDF

An Efficient BotNet Detection Scheme Exploiting Word2Vec and Accelerated Hierarchical Density-based Clustering (Word2Vec과 가속화 계층적 밀집도 기반 클러스터링을 활용한 효율적 봇넷 탐지 기법)

  • Lee, Taeil;Kim, Kwanhyun;Lee, Jihyun;Lee, Suchul
    • Journal of Internet Computing and Services
    • /
    • v.20 no.6
    • /
    • pp.11-20
    • /
    • 2019
  • Numerous enterprises, organizations and individual users are exposed to large DDoS (Distributed Denial of Service) attacks. DDoS attacks are performed through a BotNet, which is composed of a number of computers infected with a malware, e.g., zombie PCs and a special computer that controls the zombie PCs within a hierarchical chain of a command system. In order to detect a malware, a malware detection software or a vaccine program must identify the malware signature through an in-depth analysis, and these signatures need to be updated in priori. This is time consuming and costly. In this paper, we propose a botnet detection scheme that does not require a periodic signature update using an artificial neural network model. The proposed scheme exploits Word2Vec and accelerated hierarchical density-based clustering. Botnet detection performance of the proposed method was evaluated using the CTU-13 dataset. The experimental result shows that the detection rate is 99.9%, which outperforms the conventional method.

Korean Idiom Classification Using Word Embedding (워드 임베딩을 활용한 관용표현 인식 연구)

  • Park, Seo-Yoon;Kang, Ye-Jee;Kang, Hye-Rin;Jang, Yeon-Ji;Kim, Han-Saem
    • Annual Conference on Human and Language Technology
    • /
    • 2020.10a
    • /
    • pp.548-553
    • /
    • 2020
  • 우리가 쓰는 일상 언어 중에는 언어적 직관이 없는 사람은 의미 파악이 힘든 관용표현이 존재한다. 관용표현을 이해하기 위해서는 표현에 대한 형태적, 의미적 이해가 수반되어야 하기 때문이다. 기계도 마찬가지로 언어적 직관이 없기 때문에 관용표현에 대한 자연어 처리에는 어려움이 따른다. 특히 일반표현과 중의성 관계에 있는 관용표현의 특성이 고려되지 않은 채 문자적으로만 분석될 위험성이 높다. 본 연구에서는 '관용표현은 주변 문맥과의 관련성이 떨어진다'라는 가정을 중심으로 워드 임베딩을 활용한 관용표현과 일반표현에 대한 구분을 시도하였다. 실험은 4개 표현에 대해 이루어 졌으며 Skip-gram, Fasttext를 활용한 방법을 통해 관용표현은 주변 단어들과의 유사성이 떨어짐을 확인하였다.

  • PDF

Utilizing Local Bilingual Embeddings on Korean-English Law Data (한국어-영어 법률 말뭉치의 로컬 이중 언어 임베딩)

  • Choi, Soon-Young;Matteson, Andrew Stuart;Lim, Heui-Seok
    • Journal of the Korea Convergence Society
    • /
    • v.9 no.10
    • /
    • pp.45-53
    • /
    • 2018
  • Recently, studies about bilingual word embedding have been gaining much attention. However, bilingual word embedding with Korean is not actively pursued due to the difficulty in obtaining a sizable, high quality corpus. Local embeddings that can be applied to specific domains are relatively rare. Additionally, multi-word vocabulary is problematic due to the lack of one-to-one word-level correspondence in translation pairs. In this paper, we crawl 868,163 paragraphs from a Korean-English law corpus and propose three mapping strategies for word embedding. These strategies address the aforementioned issues including multi-word translation and improve translation pair quality on paragraph-aligned data. We demonstrate a twofold increase in translation pair quality compared to the global bilingual word embedding baseline.