• 제목/요약/키워드: Short text-similarity

검색결과 14건 처리시간 0.023초

Research on Keyword-Overlap Similarity Algorithm Optimization in Short English Text Based on Lexical Chunk Theory

  • Na Li;Cheng Li;Honglie Zhang
    • Journal of Information Processing Systems
    • /
    • 제19권5호
    • /
    • pp.631-640
    • /
    • 2023
  • Short-text similarity calculation is one of the hot issues in natural language processing research. The conventional keyword-overlap similarity algorithms merely consider the lexical item information and neglect the effect of the word order. And some of its optimized algorithms combine the word order, but the weights are hard to be determined. In the paper, viewing the keyword-overlap similarity algorithm, the short English text similarity algorithm based on lexical chunk theory (LC-SETSA) is proposed, which introduces the lexical chunk theory existing in cognitive psychology category into the short English text similarity calculation for the first time. The lexical chunks are applied to segment short English texts, and the segmentation results demonstrate the semantic connotation and the fixed word order of the lexical chunks, and then the overlap similarity of the lexical chunks is calculated accordingly. Finally, the comparative experiments are carried out, and the experimental results prove that the proposed algorithm of the paper is feasible, stable, and effective to a large extent.

정서 차원 공간에서 소설의 지배 정서 분석 및 분류 (Analyzing and classifying emotional flow of story in emotion dimension space)

  • 이신영;함준석;고일주
    • 인지과학
    • /
    • 제22권3호
    • /
    • pp.299-326
    • /
    • 2011
  • 소설, 블로그, 채팅 메시지, 상품평 등의 텍스트는 전반적인 정서의 흐름을 가지고 있다. 텍스트 간의 정서 흐름의 유사도를 비교하면 유사한 정서 흐름을 갖는 텍스트를 분류할 수 있고, 상품 추천이나 의견 수집 등에 활용할 수 있다. 본 논문에서는 텍스트에서 정서 단어를 순차적으로 추출하고 쾌-불쾌, 활성화의 2차원으로 분석하여 텍스트의 정서 흐름을 파악하였다. 또한 텍스트의 순차적인 흐름을 시간 차원으로 설정하여 텍스트의 전반적인 정서 흐름인 '지배 정서(dominant emotion)'를 파악하기 위하여 쾌-불쾌, 활성화, 시간의 3차원 공간에서 정서 흐름을 탐색하였다. 또한 이 3차원 공간 안에서 유클리드 거리를 사용하여 지배 정서 흐름의 유사도를 계산함으로써 유사한 정서 흐름을 가지는 텍스트를 분류하는 방법을 제안하였다. 제안한 방법을 통해 한국 근대 단편 소설들을 분석하여 지배 정서를 분석하였고 유사한 지배 정서를 가지는 소설들을 분류하였다.

  • PDF

Question Similarity Measurement of Chinese Crop Diseases and Insect Pests Based on Mixed Information Extraction

  • Zhou, Han;Guo, Xuchao;Liu, Chengqi;Tang, Zhan;Lu, Shuhan;Li, Lin
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제15권11호
    • /
    • pp.3991-4010
    • /
    • 2021
  • The Question Similarity Measurement of Chinese Crop Diseases and Insect Pests (QSM-CCD&IP) aims to judge the user's tendency to ask questions regarding input problems. The measurement is the basis of the Agricultural Knowledge Question and Answering (Q & A) system, information retrieval, and other tasks. However, the corpus and measurement methods available in this field have some deficiencies. In addition, error propagation may occur when the word boundary features and local context information are ignored when the general method embeds sentences. Hence, these factors make the task challenging. To solve the above problems and tackle the Question Similarity Measurement task in this work, a corpus on Chinese crop diseases and insect pests(CCDIP), which contains 13 categories, was established. Then, taking the CCDIP as the research object, this study proposes a Chinese agricultural text similarity matching model, namely, the AgrCQS. This model is based on mixed information extraction. Specifically, the hybrid embedding layer can enrich character information and improve the recognition ability of the model on the word boundary. The multi-scale local information can be extracted by multi-core convolutional neural network based on multi-weight (MM-CNN). The self-attention mechanism can enhance the fusion ability of the model on global information. In this research, the performance of the AgrCQS on the CCDIP is verified, and three benchmark datasets, namely, AFQMC, LCQMC, and BQ, are used. The accuracy rates are 93.92%, 74.42%, 86.35%, and 83.05%, respectively, which are higher than that of baseline systems without using any external knowledge. Additionally, the proposed method module can be extracted separately and applied to other models, thus providing reference for related research.

Do Words in Central Bank Press Releases Affect Thailand's Financial Markets?

  • CHATCHAWAN, Sapphasak
    • The Journal of Asian Finance, Economics and Business
    • /
    • 제8권4호
    • /
    • pp.113-124
    • /
    • 2021
  • The study investigates how financial markets respond to a shock to tone and semantic similarity of the Bank of Thailand press releases. The techniques in natural language processing are employed to quantify the tone and the semantic similarity of 69 press releases from 2010 to 2018. The corpus of the press releases is accessible to the general public. Stock market returns and bond yields are measured by logged return on SET50 and short-term and long-term government bonds, respectively. Data are daily from January 4, 2010, to August 8, 2019. The study uses the Structural Vector Auto Regressive model (SVAR) to analyze the effects of unanticipated and temporary shocks to the tone and the semantic similarity on bond yields and stock market returns. Impulse response functions are also constructed for the analysis. The results show that 1-month, 3-month, 6-month and 1-year bond yields significantly increase in response to a positive shock to the tone of press releases and 1-month, 3-month, 6-month, 1-year and 25-year bond yields significantly increase in response to a positive shock to the semantic similarity. Interestingly, stock market returns obtained from the SET50 index insignificantly respond to the shocks from the tone and the semantic similarity of the press releases.

Fast, Flexible Text Search Using Genomic Short-Read Mapping Model

  • Kim, Sung-Hwan;Cho, Hwan-Gue
    • ETRI Journal
    • /
    • 제38권3호
    • /
    • pp.518-528
    • /
    • 2016
  • The searching of an extensive document database for documents that are locally similar to a given query document, and the subsequent detection of similar regions between such documents, is considered as an essential task in the fields of information retrieval and data management. In this paper, we present a framework for such a task. The proposed framework employs the method of short-read mapping, which is used in bioinformatics to reveal similarities between genomic sequences. In this paper, documents are considered biological objects; consequently, edit operations between locally similar documents are viewed as an evolutionary process. Accordingly, we are able to apply the method of evolution tracing in the detection of similar regions between documents. In addition, we propose heuristic methods to address issues associated with the different stages of the proposed framework, for example, a frequency-based fragment ordering method and a locality-aware interval aggregation method. Extensive experiments covering various scenarios related to the search of an extensive document database for documents that are locally similar to a given query document are considered, and the results indicate that the proposed framework outperforms existing methods.

Modern Methods of Text Analysis as an Effective Way to Combat Plagiarism

  • Myronenko, Serhii;Myronenko, Yelyzaveta
    • International Journal of Computer Science & Network Security
    • /
    • 제22권8호
    • /
    • pp.242-248
    • /
    • 2022
  • The article presents the analysis of modern methods of automatic comparison of original and unoriginal text to detect textual plagiarism. The study covers two types of plagiarism - literal, when plagiarists directly make exact copying of the text without changing anything, and intelligent, using more sophisticated techniques, which are harder to detect due to the text manipulation, like words and signs replacement. Standard techniques related to extrinsic detection are string-based, vector space and semantic-based. The first, most common and most successful target models for detecting literal plagiarism - N-gram and Vector Space are analyzed, and their advantages and disadvantages are evaluated. The most effective target models that allow detecting intelligent plagiarism, particularly identifying paraphrases by measuring the semantic similarity of short components of the text, are investigated. Models using neural network architecture and based on natural language sentence matching approaches such as Densely Interactive Inference Network (DIIN), Bilateral Multi-Perspective Matching (BiMPM) and Bidirectional Encoder Representations from Transformers (BERT) and its family of models are considered. The progress in improving plagiarism detection systems, techniques and related models is summarized. Relevant and urgent problems that remain unresolved in detecting intelligent plagiarism - effective recognition of unoriginal ideas and qualitatively paraphrased text - are outlined.

Route matching delivery recommendation system using text similarity

  • Song, Jeongeun;Song, Yoon-Ah
    • 한국컴퓨터정보학회논문지
    • /
    • 제27권8호
    • /
    • pp.151-160
    • /
    • 2022
  • 본 연구에서는 급증하는 배송 서비스 수요에 맞춰 더 신속하고 최저 비용으로 근거리 배송을 가능하게 하는 알고리즘을 제안하고자 한다. 본 연구에서 제안하는 알고리즘에서는 배송원으로 지하철 승객을 물류 이동에 참여시킨다. 이때 승객은 이동 경로와 일치하는 배송 물류를 선택할 수 있다. 그리고 서비스 이용자의 입장에서는 현재 근처에 경로가 일치하는 배송원을 선택할 수 있다. 이때 배송원 추천은 TF-IDF&N-gram과 BERT를 결합한 텍스트 유사도 측정 방식으로 진행된다. 따라서 기존 택배 시스템과 달리 소비자-배송원 간의 man-to-man 방식으로 양방향 선택을 지원한다. 탑승 중인 승객을 물류 이동에 참여시킨다는 점에서 비용 최소화와 배송 기간 단축을 모두 보장할 수 있다. 더하여 운송 측면에서도 특별한 기술을 요하지 않으므로, 일자리 입지가 축소된 노동자들에게 경제 참여 기회를 제공할 수 있다는 점에서도 의의가 있다.

Conceptual Graph Matching Method for Reading Comprehension Tests

  • Zhang, Zhi-Chang;Zhang, Yu;Liu, Ting;Li, Sheng
    • Journal of information and communication convergence engineering
    • /
    • 제7권4호
    • /
    • pp.419-430
    • /
    • 2009
  • Reading comprehension (RC) systems are to understand a given text and return answers in response to questions about the text. Many previous studies extract sentences that are the most similar to questions as answers. However, texts for RC tests are generally short and facts about an event or entity are often expressed in multiple sentences. The answers for some questions might be indirectly presented in the sentences having few overlapping words with the questions. This paper proposes a conceptual graph matching method towards RC tests to extract answer strings. The method first represents the text and questions as conceptual graphs, and then extracts subgraphs for every candidate answer concept from the text graph. All candidate answer concepts will be scored and ranked according to the matching similarity between their sub-graphs and question graph. The top one will be returned as answer seed to form a concise answer string. Since the sub-graphs for candidate answer concepts are not restricted to only covering a single sentence, our approach improved the performance of answer extraction on the Remedia test data.

인스턴트 메시징에서의 대화 주제 및 주제 전환 탐지 (Topic and Topic Change Detection in Instance Messaging)

  • 최윤정;신욱현;정윤재;맹성현;한경수
    • 한국컴퓨터정보학회논문지
    • /
    • 제13권7호
    • /
    • pp.59-66
    • /
    • 2008
  • 본 논문에서는 인스턴트 메시징(Instant Messaging), 채팅과 같은 텍스트 기반의 대화에서 현재 발화를 기준으로 대화의 주제를 파악하고, 대화 주제 전환 여부를 판단하는 기법에 대해 기술한다. 대화는 다른 종류의 글과 다르게 길이가 매우 짧아 적은 수의 단어를 사용하고, 두 사람 이상이 참여를 하며, 대화의 이력(History)이 현재의 발화에 영향을 미친다. 이러한 특성에 따라 본 논문에서는 사용자 발화 뿐 아니라 대화 상대자의 발화에서 추출한 키워드 기반으로 주제 탐지를 하며, 대화의 이력도 고려하여 대화 주제 탐지의 정확도를 높힌 연구 결과를 기술한다. 대화주제 전환 탐지는 이전 발화와 현재 발화에서 탐지된 주제의 유사성을 계산하여, 유사성이 낮은 경우에 전환 탐지가 이루어졌다고 판단하였다. 본 논문의 실험에서 대화 주제 탐지는 88.20%. 대화 주제 전환 탐지는 87.36%의 정확도를 얻었다.

  • PDF

WV-BTM: SNS 단문의 주제 분석을 위한 토픽 모델 정확도 개선 기법 (WV-BTM: A Technique on Improving Accuracy of Topic Model for Short Texts in SNS)

  • 송애린;박영호
    • 디지털콘텐츠학회 논문지
    • /
    • 제19권1호
    • /
    • pp.51-58
    • /
    • 2018
  • SNS의 사용자와 데이터량이 폭발적으로 증가함에 따라, SNS 빅 데이터를 기반으로 한 연구들이 활발히 진행되고 있다. 특히 소셜 마이닝 분야에서는 비 분류된 대용량 SNS 텍스트 데이터로부터 각 텍스트 별 유사성을 파악하고, 그로부터 트렌드를 추출하기 위해 대표적인 토픽 모델 기법인 LDA를 사용한다. 그러나 LDA는 단문 데이터에 대하여 비 빈발 단어 출현으로 인한 의미 희박성(semantic sparsity)으로 인해 양질의 주제 추론이 어렵다는 한계를 가진다. BTM 연구는 이와 같은 LDA의 한계점을 두 단어의 조합을 통해 개선하였으나, BTM 또한 조합된 단어 중 높은 빈도수의 단어에 더 큰 영향을 받아 각 주제와의 연관성을 고려한 가중치 계산이 불가능하다는 한계점을 지닌다. 본 논문은 단어 간의 의미적 연관성을 반영함으로써 기존 연구 BTM의 정확도를 개선하는 방안을 모색한다.