• Title/Summary/Keyword: Sentence Similarity

Search Result 81, Processing Time 0.031 seconds

Performance Improvement of Web Information Retrieval Using Sentence-Query Similarity (문장-질의 유사성을 이용한 웹 정보 검색의 성능 향상)

  • Park Eui-Kyu;Ra Dong-Yul;Jang Myung-Gil
    • Journal of KIISE:Software and Applications
    • /
    • v.32 no.5
    • /
    • pp.406-415
    • /
    • 2005
  • Prosperity of Internet led to the web containing huge number of documents. Thus increasing importance is given to the web information retrieval technology that can provide users with documents that contain the right information they want. This paper proposes several techniques that are effective for the improvement of web information retrieval. Similarity between a document and the query is a major source of information exploited by conventional systems. However, we suggest a technique to make use of similarity between a sentence and the query. We introduce a technique to compute the approximate score of the sentence-query similarity even without a mature technology of natural language processing. It was shown that the amount of computation for this task is linear to the number of documents in the total collection, which implies that practical systems can make use of this technique. The next important technique proposed in this paper is to use stratification of documents in re-ranking the documents to output. It was shown that it can lead to significant improvement in performance. We furthermore showed that using hyper links, anchor texts, and titles can result in enhancement of performance. To justify the proposed techniques we developed a large scale web information retrieval system and used it for experiments.

Perception of Korean Prosody by Native Speakers of English and Native Speakers of Korean (영어 원어민과 한국어 원어민의 한국어운율 인식)

  • Yi, So-Pae
    • MALSORI
    • /
    • no.65
    • /
    • pp.1-11
    • /
    • 2008
  • This study explored the perception of transplanted Korean prosody by NE (Native speakers of English) and NK (Native speakers of Korean) listeners. The Korean utterances of various sentence types produced by NE and NK were employed to transplant the original Korean prosody contours to the Korean utterances read by NE. Then, other NE and NK were instructed to rate the transplanted prosodic components. Results showed that the interactions between the two rater groups with the three factors (e.g., transplantation types & rater groups, sentence types & rater groups, sentence length & rater groups) turned out to be meaningful. Both rater groups preferred the combined effect of transplanted prosodic components (e.g. DP, DPI) to that of individual transplantation (e.g. I, D, P). Compared to NK, NE were more sensitive to duration change than pitch change whereas NK showed equal preference to the both. In sentence types such as De, Ex, Im, and Ta, NE perceived higher similarity than NK.

  • PDF

A Method for Measuring Inter-Utterance Similarity Considering Various Linguistic Features (다양한 언어적 자질을 고려한 발화간 유사도 측정 방법)

  • Lee, Yeon-Su;Shin, Joong-Hwi;Hong, Gum-Won;Song, Young-In;Lee, Do-Gil;Rim, Hae-Chang
    • The Journal of the Acoustical Society of Korea
    • /
    • v.28 no.1
    • /
    • pp.61-69
    • /
    • 2009
  • This paper presents an improved method measuring inter-utterance similarity in an example-based dialogue system, which searches the most similar utterance in a dialogue database to generate a response to a given user utterance. Unlike general inter-sentence similarity measures, the inter-utterance similarity measure for example-based dialogue system should consider not only word distribution but also various linguistic features, such as affirmation/negation, tense, modality, sentence type, which affects the natural conversation. However, previous approaches do not sufficiently reflect these features. This paper proposes a new utterance similarity measure by analyzing and reflecting various linguistic features to improve performance in accuracy. Also, by considering substitutability of the features, the proposed method can utilize limited number of examples. Experimental results show that the proposed method achieves 10%p improvement in accuracy compared to the previous method.

Implementation of Korean Sentence Similarity using Sent2Vec Sentence Embedding (Sent2Vec 문장 임베딩을 통한 한국어 유사 문장 판별 구현)

  • Park, Sang-Kil;Shin, MyeongCheol
    • Annual Conference on Human and Language Technology
    • /
    • 2018.10a
    • /
    • pp.541-545
    • /
    • 2018
  • 본 논문에서는 Sent2Vec을 이용한 문장 임베딩으로 구현한 유사 문장 판별 시스템을 제안한다. 또한 한국어 특성에 맞게 모델을 개선하여 성능을 향상시키는 방법을 소개한다. 고성능 라이브러리 구현과 제품화 가능한 수준의 완성도 높은 구현을 보였으며, 자체 구축한 평가셋으로 한국어 특성을 반영한 모델에 대한 P@1 평가 결과 Word2Vec CBOW에 비해 9.25%, Sent2Vec에 비해 1.93% 더 높은 성능을 보였다.

  • PDF

A Study on the Establishment of Comparison System between the Statement of Military Reports and Related Laws (군(軍) 보고서 등장 문장과 관련 법령 간 비교 시스템 구축 방안 연구)

  • Jung, Jiin;Kim, Mintae;Kim, Wooju
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.3
    • /
    • pp.109-125
    • /
    • 2020
  • The Ministry of National Defense is pushing for the Defense Acquisition Program to build strong defense capabilities, and it spends more than 10 trillion won annually on defense improvement. As the Defense Acquisition Program is directly related to the security of the nation as well as the lives and property of the people, it must be carried out very transparently and efficiently by experts. However, the excessive diversification of laws and regulations related to the Defense Acquisition Program has made it challenging for many working-level officials to carry out the Defense Acquisition Program smoothly. It is even known that many people realize that there are related regulations that they were unaware of until they push ahead with their work. In addition, the statutory statements related to the Defense Acquisition Program have the tendency to cause serious issues even if only a single expression is wrong within the sentence. Despite this, efforts to establish a sentence comparison system to correct this issue in real time have been minimal. Therefore, this paper tries to propose a "Comparison System between the Statement of Military Reports and Related Laws" implementation plan that uses the Siamese Network-based artificial neural network, a model in the field of natural language processing (NLP), to observe the similarity between sentences that are likely to appear in the Defense Acquisition Program related documents and those from related statutory provisions to determine and classify the risk of illegality and to make users aware of the consequences. Various artificial neural network models (Bi-LSTM, Self-Attention, D_Bi-LSTM) were studied using 3,442 pairs of "Original Sentence"(described in actual statutes) and "Edited Sentence"(edited sentences derived from "Original Sentence"). Among many Defense Acquisition Program related statutes, DEFENSE ACQUISITION PROGRAM ACT, ENFORCEMENT RULE OF THE DEFENSE ACQUISITION PROGRAM ACT, and ENFORCEMENT DECREE OF THE DEFENSE ACQUISITION PROGRAM ACT were selected. Furthermore, "Original Sentence" has the 83 provisions that actually appear in the Act. "Original Sentence" has the main 83 clauses most accessible to working-level officials in their work. "Edited Sentence" is comprised of 30 to 50 similar sentences that are likely to appear modified in the county report for each clause("Original Sentence"). During the creation of the edited sentences, the original sentences were modified using 12 certain rules, and these sentences were produced in proportion to the number of such rules, as it was the case for the original sentences. After conducting 1 : 1 sentence similarity performance evaluation experiments, it was possible to classify each "Edited Sentence" as legal or illegal with considerable accuracy. In addition, the "Edited Sentence" dataset used to train the neural network models contains a variety of actual statutory statements("Original Sentence"), which are characterized by the 12 rules. On the other hand, the models are not able to effectively classify other sentences, which appear in actual military reports, when only the "Original Sentence" and "Edited Sentence" dataset have been fed to them. The dataset is not ample enough for the model to recognize other incoming new sentences. Hence, the performance of the model was reassessed by writing an additional 120 new sentences that have better resemblance to those in the actual military report and still have association with the original sentences. Thereafter, we were able to check that the models' performances surpassed a certain level even when they were trained merely with "Original Sentence" and "Edited Sentence" data. If sufficient model learning is achieved through the improvement and expansion of the full set of learning data with the addition of the actual report appearance sentences, the models will be able to better classify other sentences coming from military reports as legal or illegal. Based on the experimental results, this study confirms the possibility and value of building "Real-Time Automated Comparison System Between Military Documents and Related Laws". The research conducted in this experiment can verify which specific clause, of several that appear in related law clause is most similar to the sentence that appears in the Defense Acquisition Program-related military reports. This helps determine whether the contents in the military report sentences are at the risk of illegality when they are compared with those in the law clauses.

Content-based Korean journal recommendation system using Sentence BERT (Sentence BERT를 이용한 내용 기반 국문 저널추천 시스템)

  • Yongwoo Kim;Daeyoung Kim;Hyunhee Seo;Young-Min Kim
    • Journal of Intelligence and Information Systems
    • /
    • v.29 no.3
    • /
    • pp.37-55
    • /
    • 2023
  • With the development of electronic journals and the emergence of various interdisciplinary studies, the selection of journals for publication has become a new challenge for researchers. Even if a paper is of high quality, it may face rejection due to a mismatch between the paper's topic and the scope of the journal. While research on assisting researchers in journal selection has been actively conducted in English, the same cannot be said for Korean journals. In this study, we propose a system that recommends Korean journals for submission. Firstly, we utilize SBERT (Sentence BERT) to embed abstracts of previously published papers at the document level, compare the similarity between new documents and published papers, and recommend journals accordingly. Next, the order of recommended journals is determined by considering the similarity of abstracts, keywords, and title. Subsequently, journals that are similar to the top recommended journal from previous stage are added by using a dictionary of words constructed for each journal, thereby enhancing recommendation diversity. The recommendation system, built using this approach, achieved a Top-10 accuracy level of 76.6%, and the validity of the recommendation results was confirmed through user feedback. Furthermore, it was found that each step of the proposed framework contributes to improving recommendation accuracy. This study provides a new approach to recommending academic journals in the Korean language, which has not been actively studied before, and it has also practical implications as the proposed framework can be easily applied to services.

A Text Summarization Model Based on Sentence Clustering (문장 클러스터링에 기반한 자동요약 모형)

  • 정영미;최상희
    • Journal of the Korean Society for information Management
    • /
    • v.18 no.3
    • /
    • pp.159-178
    • /
    • 2001
  • This paper presents an automatic text summarization model which selects representative sentences from sentence clusters to create a summary. Summary generation experiments were performed on two sets of test documents after learning the optimum environment from a training set. Centroid clustering method turned out to be the most effective in clustering sentences, and sentence weight was found more effective than the similarity value between sentence and cluster centroid vectors in selecting a representative sentence from each cluster. The result of experiments also proves that inverse sentence weight as well as title word weight for terms and location weight for sentences are effective in improving the performance of summarization.

  • PDF

Analysis of Intention in Spoken Dialogue based on Classifying Sentence Patterns (문형구조의 분류에 따른 대화음성의 의도분석에 관한 연구)

  • Choi, Hwan-Jin;Song, Chang-Hwan;Oh, Yung-Hwan
    • The Journal of the Acoustical Society of Korea
    • /
    • v.15 no.1
    • /
    • pp.61-70
    • /
    • 1996
  • According to topics or speaker's intentions in a dialogue, utterance spoken by speaker has a different sentence structure of word combinations. Based on these facts, we have proposed the statistical approach. IDT(intention decision table), which is modeling the correlations between sentence patterns and the intention. In a IDT, the sentence is splitted into 5 different factors, and the intention of a sentence is determined by the similarity between and intention and 5 factors that have represent a sentence. From the experimental results, the IDT has indicated that the prediction rate of an intention is improved 10~18% over the word-intention correlations and is enhanced 3~12% compared with the MIG(Markov intention graph) that models the intention with a transition graph for word categories in a sentence. Based on these facts, we have found that the IDT is effective method for the prediction of an intention.

  • PDF

Deep Learning Based Semantic Similarity for Korean Legal Field (딥러닝을 이용한 법률 분야 한국어 의미 유사판단에 관한 연구)

  • Kim, Sung Won;Park, Gwang Ryeol
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.11 no.2
    • /
    • pp.93-100
    • /
    • 2022
  • Keyword-oriented search methods are mainly used as data search methods, but this is not suitable as a search method in the legal field where professional terms are widely used. In response, this paper proposes an effective data search method in the legal field. We describe embedding methods optimized for determining similarities between sentences in the field of natural language processing of legal domains. After embedding legal sentences based on keywords using TF-IDF or semantic embedding using Universal Sentence Encoder, we propose an optimal way to search for data by combining BERT models to check similarities between sentences in the legal field.

The Sentence Similarity Measure Using Deep-Learning and Char2Vec (딥러닝과 Char2Vec을 이용한 문장 유사도 판별)

  • Lim, Geun-Young;Cho, Young-Bok
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.22 no.10
    • /
    • pp.1300-1306
    • /
    • 2018
  • The purpose of this study is to see possibility of Char2Vec as alternative of Word2Vec that most famous word embedding model in Sentence Similarity Measure Problem by Deep-Learning. In experiment, we used the Siamese Ma-LSTM recurrent neural network architecture for measure similarity two random sentences. Siamese Ma-LSTM model was implemented with tensorflow. We train each model with 200 epoch on gpu environment and it took about 20 hours. Then we compared Word2Vec based model training result with Char2Vec based model training result. as a result, model of based with Char2Vec that initialized random weight record 75.1% validation dataset accuracy and model of based with Word2Vec that pretrained with 3 million words and phrase record 71.6% validation dataset accuracy. so Char2Vec is suitable alternate of Word2Vec to optimize high system memory requirements problem.