• Title/Summary/Keyword: Word Embedding Approach

Search Result 37, Processing Time 0.03 seconds

Text Classification Using Parallel Word-level and Character-level Embeddings in Convolutional Neural Networks

  • Geonu Kim;Jungyeon Jang;Juwon Lee;Kitae Kim;Woonyoung Yeo;Jong Woo Kim
    • Asia pacific journal of information systems
    • /
    • v.29 no.4
    • /
    • pp.771-788
    • /
    • 2019
  • Deep learning techniques such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) show superior performance in text classification than traditional approaches such as Support Vector Machines (SVMs) and Naïve Bayesian approaches. When using CNNs for text classification tasks, word embedding or character embedding is a step to transform words or characters to fixed size vectors before feeding them into convolutional layers. In this paper, we propose a parallel word-level and character-level embedding approach in CNNs for text classification. The proposed approach can capture word-level and character-level patterns concurrently in CNNs. To show the usefulness of proposed approach, we perform experiments with two English and three Korean text datasets. The experimental results show that character-level embedding works better in Korean and word-level embedding performs well in English. Also the experimental results reveal that the proposed approach provides better performance than traditional CNNs with word-level embedding or character-level embedding in both Korean and English documents. From more detail investigation, we find that the proposed approach tends to perform better when there is relatively small amount of data comparing to the traditional embedding approaches.

Addressing the New User Problem of Recommender Systems Based on Word Embedding Learning and Skip-gram Modelling

  • Shin, Su-Mi;Kim, Kyung-Chang
    • Journal of the Korea Society of Computer and Information
    • /
    • v.21 no.7
    • /
    • pp.9-16
    • /
    • 2016
  • Collaborative filtering(CF) uses the purchase or item rating history of other users, but does not need additional properties or attributes of users and items. Hence CF is known th be the most successful recommendation technology. But conventional CF approach has some significant weakness, such as the new user problem. In this paper, we propose a approach using word embedding with skip-gram for learning distributed item representations. In particular, we show that this approach can be used to capture precise item for solving the "new user problem." The proposed approach has been tested on the Movielens databases. We compare the performance of the user based CF, item based CF and our approach by observing the change of recommendation results according to the different number of item rating information. The experimental results shows the improvement in our approach in measuring the precision applied to new user problem situations.

Sentence model based subword embeddings for a dialog system

  • Chung, Euisok;Kim, Hyun Woo;Song, Hwa Jeon
    • ETRI Journal
    • /
    • v.44 no.4
    • /
    • pp.599-612
    • /
    • 2022
  • This study focuses on improving a word embedding model to enhance the performance of downstream tasks, such as those of dialog systems. To improve traditional word embedding models, such as skip-gram, it is critical to refine the word features and expand the context model. In this paper, we approach the word model from the perspective of subword embedding and attempt to extend the context model by integrating various sentence models. Our proposed sentence model is a subword-based skip-thought model that integrates self-attention and relative position encoding techniques. We also propose a clustering-based dialog model for downstream task verification and evaluate its relationship with the sentence-model-based subword embedding technique. The proposed subword embedding method produces better results than previous methods in evaluating word and sentence similarity. In addition, the downstream task verification, a clustering-based dialog system, demonstrates an improvement of up to 4.86% over the results of FastText in previous research.

Latent Semantic Analysis Approach for Document Summarization Based on Word Embeddings

  • Al-Sabahi, Kamal;Zuping, Zhang;Kang, Yang
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.13 no.1
    • /
    • pp.254-276
    • /
    • 2019
  • Since the amount of information on the internet is growing rapidly, it is not easy for a user to find relevant information for his/her query. To tackle this issue, the researchers are paying much attention to Document Summarization. The key point in any successful document summarizer is a good document representation. The traditional approaches based on word overlapping mostly fail to produce that kind of representation. Word embedding has shown good performance allowing words to match on a semantic level. Naively concatenating word embeddings makes common words dominant which in turn diminish the representation quality. In this paper, we employ word embeddings to improve the weighting schemes for calculating the Latent Semantic Analysis input matrix. Two embedding-based weighting schemes are proposed and then combined to calculate the values of this matrix. They are modified versions of the augment weight and the entropy frequency that combine the strength of traditional weighting schemes and word embedding. The proposed approach is evaluated on three English datasets, DUC 2002, DUC 2004 and Multilingual 2015 Single-document Summarization. Experimental results on the three datasets show that the proposed model achieved competitive performance compared to the state-of-the-art leading to a conclusion that it provides a better document representation and a better document summary as a result.

A Method for Learning the Specialized Meaning of Terminology through Mixed Word Embedding (혼합 임베딩을 통한 전문 용어 의미 학습 방안)

  • Kim, Byung Tae;Kim, Nam Gyu
    • The Journal of Information Systems
    • /
    • v.30 no.2
    • /
    • pp.57-78
    • /
    • 2021
  • Purpose In this study, first, we try to make embedding results that reflect the characteristics of both professional and general documents. In addition, when disparate documents are put together as learning materials for natural language processing, we try to propose a method that can measure the degree of reflection of the characteristics of individual domains in a quantitative way. Approach For this study, the Korean Supreme Court Precedent documents and Korean Wikipedia are selected as specialized documents and general documents respectively. After extracting the most similar word pairs and similarities of unique words observed only in the specialized documents, we observed how those values were changed in the process of embedding with general documents. Findings According to the measurement methods proposed in this study, it was confirmed that the degree of specificity of specialized documents was relaxed in the process of combining with general documents, and that the degree of dissolution could have a positive correlation with the size of general documents.

Triplet loss based domain adversarial training for robust wake-up word detection in noisy environments (잡음 환경에 강인한 기동어 검출을 위한 삼중항 손실 기반 도메인 적대적 훈련)

  • Lim, Hyungjun;Jung, Myunghun;Kim, Hoirin
    • The Journal of the Acoustical Society of Korea
    • /
    • v.39 no.5
    • /
    • pp.468-475
    • /
    • 2020
  • A good acoustic word embedding that can well express the characteristics of word plays an important role in wake-up word detection (WWD). However, the representation ability of acoustic word embedding may be weakened due to various types of environmental noise occurred in the place where WWD works, causing performance degradation. In this paper, we proposed triplet loss based Domain Adversarial Training (tDAT) mitigating environmental factors that can affect acoustic word embedding. Through experiments in noisy environments, we verified that the proposed method effectively improves the conventional DAT approach, and checked its scalability by combining with other method proposed for robust WWD.

Expansion of Word Representation for Named Entity Recognition Based on Bidirectional LSTM CRFs (Bidirectional LSTM CRF 기반의 개체명 인식을 위한 단어 표상의 확장)

  • Yu, Hongyeon;Ko, Youngjoong
    • Journal of KIISE
    • /
    • v.44 no.3
    • /
    • pp.306-313
    • /
    • 2017
  • Named entity recognition (NER) seeks to locate and classify named entities in text into pre-defined categories such as names of persons, organizations, locations, expressions of times, etc. Recently, many state-of-the-art NER systems have been implemented with bidirectional LSTM CRFs. Deep learning models based on long short-term memory (LSTM) generally depend on word representations as input. In this paper, we propose an approach to expand word representation by using pre-trained word embedding, part of speech (POS) tag embedding, syllable embedding and named entity dictionary feature vectors. Our experiments show that the proposed approach creates useful word representations as an input of bidirectional LSTM CRFs. Our final presentation shows its efficacy to be 8.05%p higher than baseline NERs with only the pre-trained word embedding vector.

Ontology Matching Method Based on Word Embedding and Structural Similarity

  • Hongzhou Duan;Yuxiang Sun;Yongju Lee
    • International journal of advanced smart convergence
    • /
    • v.12 no.3
    • /
    • pp.75-88
    • /
    • 2023
  • In a specific domain, experts have different understanding of domain knowledge or different purpose of constructing ontology. These will lead to multiple different ontologies in the domain. This phenomenon is called the ontology heterogeneity. For research fields that require cross-ontology operations such as knowledge fusion and knowledge reasoning, the ontology heterogeneity has caused certain difficulties for research. In this paper, we propose a novel ontology matching model that combines word embedding and a concatenated continuous bag-of-words model. Our goal is to improve word vectors and distinguish the semantic similarity and descriptive associations. Moreover, we make the most of textual and structural information from the ontology and external resources. We represent the ontology as a graph and use the SimRank algorithm to calculate the structural similarity. Our approach employs a similarity queue to achieve one-to-many matching results which provide a wider range of insights for subsequent mining and analysis. This enhances and refines the methodology used in ontology matching.

Improving The Performance of Triple Generation Based on Distant Supervision By Using Semantic Similarity (의미 유사도를 활용한 Distant Supervision 기반의 트리플 생성 성능 향상)

  • Yoon, Hee-Geun;Choi, Su Jeong;Park, Seong-Bae
    • Journal of KIISE
    • /
    • v.43 no.6
    • /
    • pp.653-661
    • /
    • 2016
  • The existing pattern-based triple generation systems based on distant supervision could be flawed by assumption of distant supervision. For resolving flaw from an excessive assumption, statistics information has been commonly used for measuring confidence of patterns in previous studies. In this study, we proposed a more accurate confidence measure based on semantic similarity between patterns and properties. Unsupervised learning method, word embedding and WordNet-based similarity measures were adopted for learning meaning of words and measuring semantic similarity. For resolving language discordance between patterns and properties, we adopted CCA for aligning bilingual word embedding models and a translation-based approach for a WordNet-based measure. The results of our experiments indicated that the accuracy of triples that are filtered by the semantic similarity-based confidence measure was 16% higher than that of the statistics-based approach. These results suggested that semantic similarity-based confidence measure is more effective than statistics-based approach for generating high quality triples.

Application of Domain Knowledge in Transaction-based Recommender Systems through Word Embedding (트랜잭션 기반 추천 시스템에서 워드 임베딩을 통한 도메인 지식 반영)

  • Choi, Yeoungje;Moon, Hyun Sil;Cho, Yoonho
    • Knowledge Management Research
    • /
    • v.21 no.1
    • /
    • pp.117-136
    • /
    • 2020
  • In the studies for the recommender systems which solve the information overload problem of users, the use of transactional data has been continuously tried. Especially, because the firms can easily obtain transactional data along with the development of IoT technologies, transaction-based recommender systems are recently used in various areas. However, the use of transactional data has limitations that it is hard to reflect domain knowledge and they do not directly show user preferences for individual items. Therefore, in this study, we propose a method applying the word embedding in the transaction-based recommender system to reflect preference differences among users and domain knowledge. Our approach is based on SAR, which shows high performance in the recommender systems, and we improved its components by using FastText, one of the word embedding techniques. Experimental results show that the reflection of domain knowledge and preference difference has a significant effect on the performance of recommender systems. Therefore, we expect our study to contribute to the improvement of the transaction-based recommender systems and to suggest the expansion of data used in the recommender system.