• Title/Summary/Keyword: word similarity

Search Result 297, Processing Time 0.03 seconds

Sentence model based subword embeddings for a dialog system

  • Chung, Euisok;Kim, Hyun Woo;Song, Hwa Jeon
    • ETRI Journal
    • /
    • v.44 no.4
    • /
    • pp.599-612
    • /
    • 2022
  • This study focuses on improving a word embedding model to enhance the performance of downstream tasks, such as those of dialog systems. To improve traditional word embedding models, such as skip-gram, it is critical to refine the word features and expand the context model. In this paper, we approach the word model from the perspective of subword embedding and attempt to extend the context model by integrating various sentence models. Our proposed sentence model is a subword-based skip-thought model that integrates self-attention and relative position encoding techniques. We also propose a clustering-based dialog model for downstream task verification and evaluate its relationship with the sentence-model-based subword embedding technique. The proposed subword embedding method produces better results than previous methods in evaluating word and sentence similarity. In addition, the downstream task verification, a clustering-based dialog system, demonstrates an improvement of up to 4.86% over the results of FastText in previous research.

Webtoon Search utilizing Genre Similarity with Word2Vec (Word2Vec 기반 장르 유사성을 활용한 웹툰 검색)

  • Lee, ChangMin;Ahn, JeJeong;Kang, DongYeon;Lee, Hyunah
    • Annual Conference on Human and Language Technology
    • /
    • 2019.10a
    • /
    • pp.503-505
    • /
    • 2019
  • 본 논문에서는 기존 웹툰 장르 검색 시스템의 단점을 보완하기 위해 키워드 기반 유사 장르 검색 시스템을 제안한다. 기존 웹툰의 장르와 키워드를 분석하여 44개의 장르를 설정하고 해당 장르에 적합한 웹툰을 수집한다. 나무위키와 위키피디아 문서로 학습된 Word2Vec모델에 기반하여 계산한 사용자 입력 키워드와 44개의 장르간 유사도로 사용자 입력에 가장 유사한 장르를 찾는다. 유사 장르에 포함되는 웹툰을 결과로 출력하여 사용자가 선호하는 장르의 웹툰을 제시한다. 실험 결과에서는 나무위키에서 '장르'로 검색하여 얻는 작은 크기의 문서 집합에서 Word2Vec을 학습한 모델에서 가장 높은 검색 성능을 보였다.

  • PDF

Identifying Similar Overseas Patent Using Word2Vec-Based Semantic Text Analytics (Word2Vec 학습을 통한 의미 기반 해외 유사 특허 검색 방안)

  • Paek, Minji;Kim, Namgyu
    • Journal of Information Technology Services
    • /
    • v.17 no.2
    • /
    • pp.129-142
    • /
    • 2018
  • Recently, the number of patent applications have been increasing rapidly every year as the importance of protecting intellectual property rights becomes more important. Patents must be inventive and have novelty. Especially, the novelty implies that the corresponding invention is not the same as the previous invention. To confirm the novelty, prior art search must be conducted before and after the application. The target of prior art search should include not only Korean patents but also foreign patents. Search of foreign patents should be supported by multilingual search techniques. However, a dictionary-based naive approach shows a limitation because some technical concepts are represented in different terms according to each nation. For example, a Korean term and a Japanese term may not be synonym even though they represent the same technical concept. In this paper, we propose a new method to map semantic similarity between technical terms in Korean patents and Japanese patents. To investigate different representations in each nation for the same technical concept, we identified and analyzed pairs of patents those are mutually connected with priority claim relationship. By performing an experiment with real-world data, we showed that our approach can reveal semantically similar technical terms in other language successfully.

Spontaneous Speech Language Modeling using N-gram based Similarity (N-gram 기반의 유사도를 이용한 대화체 연속 음성 언어 모델링)

  • Park Young-Hee;Chung Minhwa
    • MALSORI
    • /
    • no.46
    • /
    • pp.117-126
    • /
    • 2003
  • This paper presents our language model adaptation for Korean spontaneous speech recognition. Korean spontaneous speech is observed various characteristics of content and style such as filled pauses, word omission, and contraction as compared with the written text corpus. Our approaches focus on improving the estimation of domain-dependent n-gram models by relevance weighting out-of-domain text data, where style is represented by n-gram based tf/sup */idf similarity. In addition to relevance weighting, we use disfluencies as Predictor to the neighboring words. The best result reduces 9.7% word error rate relatively and shows that n-gram based relevance weighting reflects style difference greatly and disfluencies are good predictor also.

  • PDF

Semantic Similarity-Based Contributable Task Identification for New Participating Developers

  • Kim, Jungil;Choi, Geunho;Lee, Eunjoo
    • Journal of information and communication convergence engineering
    • /
    • v.16 no.4
    • /
    • pp.228-234
    • /
    • 2018
  • In software development, the quality of a product often depends on whether its developers can rapidly find and contribute to the proper tasks. Currently, the word data of projects to which newcomers have previously contributed are mainly utilized to find appropriate source files in an ongoing project. However, because of the vocabulary gap between software projects, the accuracy of source file identification based on information retrieval is not guaranteed. In this paper, we propose a novel source file identification method to reduce the vocabulary gap between software projects. The proposed method employs DBPedia Spotlight to identify proper source files based on semantic similarity between source files of software projects. In an experiment based on the Spring Framework project, we evaluate the accuracy of the proposed method in the identification of contributable source files. The experimental results show that the proposed approach can achieve better accuracy than the existing method based on comparison of word vocabularies.

A Semantic Representation Based-on Term Co-occurrence Network and Graph Kernel

  • Noh, Tae-Gil;Park, Seong-Bae;Lee, Sang-Jo
    • International Journal of Fuzzy Logic and Intelligent Systems
    • /
    • v.11 no.4
    • /
    • pp.238-246
    • /
    • 2011
  • This paper proposes a new semantic representation and its associated similarity measure. The representation expresses textual context observed in a context of a certain term as a network where nodes are terms and edges are the number of cooccurrences between connected terms. To compare terms represented in networks, a graph kernel is adopted as a similarity measure. The proposed representation has two notable merits compared with previous semantic representations. First, it can process polysemous words in a better way than a vector representation. A network of a polysemous term is regarded as a combination of sub-networks that represent senses and the appropriate sub-network is identified by context before compared by the kernel. Second, the representation permits not only words but also senses or contexts to be represented directly from corresponding set of terms. The validity of the representation and its similarity measure is evaluated with two tasks: synonym test and unsupervised word sense disambiguation. The method performed well and could compete with the state-of-the-art unsupervised methods.

Middle School Students' Analogical Transfer in Algebra Word Problem Solving (중학생을 대상으로 한 대수 문장제 해결에서의 유추적 전이)

  • 이종희;김진화;김선희
    • The Mathematical Education
    • /
    • v.42 no.3
    • /
    • pp.353-368
    • /
    • 2003
  • Analogy, based on a similarity, is to infer the properties of the similar object from properties of an object. It can be a very useful thinking tool for learning mathematical patterns and laws, noticing on relational properties among various situations. The purpose of this study, when manipulating hint condition, figure and table conditions and the amount of original learning by using algebra word problems, is to verify the effects of analogical transfer in solving equivalent, isomorphic and similar problems according to the similarity of source problems and target ones. Five study questions were set up for the above purpose. It was 354 first grade students of S and G middle schools in Seoul that were experimented for this study. The data was processed by MANOVA analysis of statistical program, SPSS 10.0. The results of this studies would indicate that most of the students would be poor at solving isomorphic and similar problems in the performance of analogical transfer according to the similarity of source and target problems. Hints, figure and table conditions did not facilitate the analogical transfer. Merely, on the condition that amount of teaming was increased, analogical transfer of the students was facilitated. Therefore, it is necessary to have students do much more analogical problem-solving experience to improve their analogical reasoning ability through the instruction program development in the educational fields.

  • PDF

Word Embeddings-Based Pseudo Relevance Feedback Using Deep Averaging Networks for Arabic Document Retrieval

  • Farhan, Yasir Hadi;Noah, Shahrul Azman Mohd;Mohd, Masnizah;Atwan, Jaffar
    • Journal of Information Science Theory and Practice
    • /
    • v.9 no.2
    • /
    • pp.1-17
    • /
    • 2021
  • Pseudo relevance feedback (PRF) is a powerful query expansion (QE) technique that prepares queries using the top k pseudorelevant documents and choosing expansion elements. Traditional PRF frameworks have robustly handled vocabulary mismatch corresponding to user queries and pertinent documents; nevertheless, expansion elements are chosen, disregarding similarity to the original query's elements. Word embedding (WE) schemes comprise techniques of significant interest concerning QE, that falls within the information retrieval domain. Deep averaging networks (DANs) defines a framework relying on average word presence passed through multiple linear layers. The complete query is understandably represented using the average vector comprising the query terms. The vector may be employed for determining expansion elements pertinent to the entire query. In this study, we suggest a DANs-based technique that augments PRF frameworks by integrating WE similarities to facilitate Arabic information retrieval. The technique is based on the fundamental that the top pseudo-relevant document set is assessed to determine candidate element distribution and select expansion terms appropriately, considering their similarity to the average vector representing the initial query elements. The Word2Vec model is selected for executing the experiments on a standard Arabic TREC 2001/2002 set. The majority of the evaluations indicate that the PRF implementation in the present study offers a significant performance improvement compared to that of the baseline PRF frameworks.

Deep Learning Application for Core Image Analysis of the Poems by Ki Hyung-Do (딥러닝을 이용한 기형도 시의 핵심 이미지 분석)

  • Ko, Kwang-Ho
    • The Journal of the Convergence on Culture Technology
    • /
    • v.7 no.3
    • /
    • pp.591-598
    • /
    • 2021
  • It's possible to get the word-vector by the statistical SVD or deep-learning CBOW and LSTM methods and theses ones learn the contexts of forward/backward words or the sequence of following words. It's used to analyze the poems by Ki Hyung-do with similar words recommended by the word-vector showing the core images of the poetry. It seems at first sight that the words don't go well with the images but they express the similar style described by the reference words once you look close the contexts of the specific poems. The word-vector can analogize the words having the same relations with the ones between the representative words for the core images of the poems. Therefore you can analyze the poems in depth and in variety with the similarity and analogy operations by the word-vector estimated with the statistical SVD or deep-learning CBOW and LSTM methods.

Following Firms on Twitter: Determinants of Continuance and Word-of-Mouth Intentions (트위터를 통한 기업과 고객과의 소통: 지속적인 팔로윙과 구전 의도에 영향을 미치는 요인에 대한 연구)

  • Kim, Hongki;Son, Jai-Yeol;Suh, Kil-Soo
    • Asia pacific journal of information systems
    • /
    • v.22 no.3
    • /
    • pp.1-27
    • /
    • 2012
  • Many companies have recently become interested in using social networking sites such as Twitter and Facebook as a new channel to communicate with their customers. For example, companies often offer "special deals" (e.g., coupons, discounts, free samples, etc.) to their customers who participate in promotions or events on social networking sites. Companies often make important announcements on their products or services on social networking sites. By doing so, customers are encouraged to continue to have relationships with companies on social networking sites and to recommend the companies' presence on social networking sites to other potential customers. Moreover, customers who keep close relationships with companies on social networking sites often provide the companies with valuable suggestions and feedback. For instance, Starbucks has more than 2 million followers on Twitter, and often receive suggestions and feedback for their product offerings and services from the followers on Twitter. Although companies realize potential benefits of using social networking sites as a channel to communicate with their customers, it appears that many companies have difficulty forging long-lasting relationships with customers on social networking sites. It is often reported that many customers who had followed companies on Twitter later stopped following them for various reasons. Therefore, it is an important issue to understand what motivates customers to continue to keep relationships with companies on social networking sites. Nonetheless, due attention has yet paid to this issue until recently. This study intends to contribute to our understanding on customers' intention to continue to follow companies on Twitter and to spread positive word-of-mouth about companies on Twitter. Specifically, we identify seven potential factors that customers perceive as important in evaluating their experience with companies on Twitter. The seven factors include similarity, receptivity, interactivity, ubiquitous connectivity, enjoyment, usefulness and transparency. We posit that the seven perception factors can affect the two types of satisfaction, emotional and cognitive, which can in turn influence on customers' intention to follow companies on Twitter and to spread positive word-of-mouth about companies on Twitter. Research hypotheses formulated in this study were tested with data collected from a questionnaire survey administered to customers who had been following companies on Twitter. The data was analyzed with the partial least square (PLS) approach to structural equation modeling. The results of data analysis based on 177 usable responses were generally supportive of our predictions for the effects of the seven factors identified and the two types of satisfaction. In particular, out results suggest that emotional satisfaction was strongly influenced by perceived similarity, perceived receptivity, perceived enjoyment, and perceived transparency. Cognitive satisfaction was significantly influenced by perceived similarity, perceived interactivity, perceived enjoyment, and perceived transparency. While cognitive satisfaction was found to have significant and positive effects on both continued following and word-of-mouth intentions, emotional satisfaction had a significant and positive effect only on word-of-mouth intention.

  • PDF