• Title/Summary/Keyword: document embedding

Search Result 59, Processing Time 0.022 seconds

Development of a Regulatory Q&A System for KAERI Utilizing Document Search Algorithms and Large Language Model (거대언어모델과 문서검색 알고리즘을 활용한 한국원자력연구원 규정 질의응답 시스템 개발)

  • Hongbi Kim;Yonggyun Yu
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.28 no.5
    • /
    • pp.31-39
    • /
    • 2023
  • The evolution of Natural Language Processing (NLP) and the rise of large language models (LLM) like ChatGPT have paved the way for specialized question-answering (QA) systems tailored to specific domains. This study outlines a system harnessing the power of LLM in conjunction with document search algorithms to interpret and address user inquiries using documents from the Korea Atomic Energy Research Institute (KAERI). Initially, the system refines multiple documents for optimized search and analysis, breaking the content into managable paragraphs suitable for the language model's processing. Each paragraph's content is converted into a vector via an embedding model and archived in a database. Upon receiving a user query, the system matches the extracted vectors from the question with the stored vectors, pinpointing the most pertinent content. The chosen paragraphs, combined with the user's query, are then processed by the language generation model to formulate a response. Tests encompassing a spectrum of questions verified the system's proficiency in discerning question intent, understanding diverse documents, and delivering rapid and precise answers.

Document Embedding for Entity Linking in Social Media (문서 임베딩을 이용한 소셜 미디어 문장의 개체 연결)

  • Park, Youngmin;Jeong, Soyun;Lee, Jeong-Eom;Shin, Dongsoo;Kim, Seona;Seo, Junyun
    • Annual Conference on Human and Language Technology
    • /
    • 2017.10a
    • /
    • pp.194-196
    • /
    • 2017
  • 기존의 단어 기반 접근법을 이용한 개체 연결은 단어의 변형, 신조어 등이 빈번하게 나타나는 비정형 문장에 대해서는 좋은 성능을 기대하기 어렵다. 본 논문에서는 문서 임베딩과 선형 변환을 이용하여 단어 기반 접근법의 단점을 해소하는 개체 연결을 제안한다. 문서 임베딩은 하나의 문서 전체를 벡터 공간에 표현하여 문서 간 의미적 유사도를 계산할 수 있다. 본 논문에서는 또한 비교적 정형 문장인 위키백과 문장과 비정형 문장인 소셜 미디어 문장 사이에 선형 변환을 수행하여 두 문형 사이의 표현 격차를 해소하였다. 제안하는 개체 연결 방법은 대표적인 소셜 미디어인 트위터 환경 문장에서 단어 기반 접근법과 비교하여 높은 성능 향상을 보였다.

  • PDF

Technology-Focused Business Diversification Support Methodology Using Item Network (아이템 네트워크를 활용한 기술 중심 사업 다각화 기회 탐색 지원 방법론)

  • Bae, Kukjin;Kim, Ji-Eun;Kim, Namgyu
    • Journal of Information Technology Services
    • /
    • v.19 no.3
    • /
    • pp.17-34
    • /
    • 2020
  • Recently, various attempts have been made to discover promising items and technologies. However, there are very few data-driven approaches to support business diversification by companies with specific technologies. Therefore, there is a need for a methodology that can detect items related to a specific technology and recommend highly marketable items among them as business diversification targets. In this paper, we devise Labeled Item Network for Business Diversification Consulting Support System. Our research is performed with three sub-studies. In Sub-study 1, we find the proper source documents to build the item network and construct item dictionary. In Sub-study 2, we derive the Labeled Item Network and devise four index for item evaluation. Finally, we introduce the application scenario of our methodology and describe the result of real-case analysis in Sub-study 3. The Labeled Item Network, one of the main outcome of this study, can identify the relationships between items as well as the meaning of the relationship. We expect that more specific business item diversification opportunities can be found with the Labeled Item Network. The proposed methodology can help many SMEs diversify their business on the basis of their technology.

Development of Deep Learning Models for Multi-class Sentiment Analysis (딥러닝 기반의 다범주 감성분석 모델 개발)

  • Syaekhoni, M. Alex;Seo, Sang Hyun;Kwon, Young S.
    • Journal of Information Technology Services
    • /
    • v.16 no.4
    • /
    • pp.149-160
    • /
    • 2017
  • Sentiment analysis is the process of determining whether a piece of document, text or conversation is positive, negative, neural or other emotion. Sentiment analysis has been applied for several real-world applications, such as chatbot. In the last five years, the practical use of the chatbot has been prevailing in many field of industry. In the chatbot applications, to recognize the user emotion, sentiment analysis must be performed in advance in order to understand the intent of speakers. The specific emotion is more than describing positive or negative sentences. In light of this context, we propose deep learning models for conducting multi-class sentiment analysis for identifying speaker's emotion which is categorized to be joy, fear, guilt, sad, shame, disgust, and anger. Thus, we develop convolutional neural network (CNN), long short term memory (LSTM), and multi-layer neural network models, as deep neural networks models, for detecting emotion in a sentence. In addition, word embedding process was also applied in our research. In our experiments, we have found that long short term memory (LSTM) model performs best compared to convolutional neural networks and multi-layer neural networks. Moreover, we also show the practical applicability of the deep learning models to the sentiment analysis for chatbot.

Study on CEO New Year's Address: Using Text Mining Method (텍스트마이닝을 활용한 주요 대기업 신년사 분석)

  • YuKyoung Kim;Daegon Cho
    • Journal of Information Technology Services
    • /
    • v.22 no.2
    • /
    • pp.93-127
    • /
    • 2023
  • This study analyzed the CEO New Year's addresses of major Korean companies, extracting key topics for employees via text mining techniques. An intended contribution of this study is to assist reporters, analysts, and researchers in gaining a better understanding of the New Year's addresses by elucidating the implicit and implicative features of messages within. To this end, this study collected and analyzed 545 New Year's addresses published between 2012 and 2021 by the top 66 Korean companies in terms of market capitalization. Research methodologies applied include text clustering, word embedding of keywords, frequency analysis, and topic modeling. Our main findings suggest that the messages in the New Year's addresses were categorized into nine topics-organizational culture, global advancement, substantial management, business reorganization, capacity building, market leadership, management innovation, sustainable management, and technology development. Next, this study further analyzed the managerial significance of each topic and discussed their characteristics from the perspectives of time, industry, and corporate groups. Companies were typically found to emphasize sound management, market leadership, and business reorganization during economic downturns while stressing capacity building and organizational culture during market transition periods. Also, companies belonging to corporate groups tended to emphasize founding philosophy and corporate culture.

A Study on the Connecting Method of Query and Legal Cases Using Doc2Vec Document Embedding (Doc2Vec 문서 임베딩을 이용한 질의문과 판례 자동 연결 방안 연구)

  • Kang, Ye-Jee;Kang, Hye-Rin;Park, Seo-Yoon;Jang, Yeon-Ji;Kim, Han-Saem
    • Annual Conference on Human and Language Technology
    • /
    • 2020.10a
    • /
    • pp.76-81
    • /
    • 2020
  • 법률 전문 지식이 없는 사람들이 법률 정보 검색을 성공적으로 하기 위해서는 일반 용어를 검색하더라도 전문 용어가 사용된 법령정보가 검색되어야 한다. 하지만 현 판례 검색 시스템은 사용자 선호도 검색이 불가능하며, 일반 용어를 사용하여 검색하면 사용자가 원하는 전문 자료를 도출하는 데 어려움이 있다. 이에 본 논문에서는 일반용어가 사용된 질의문과 전문용어가 사용된 판례를 자동으로 연결해 주고자 하였다. 질의문과 연관된 판례를 자동으로 연결해 주기 위해 전문용어가 사용된 전문가 답변을 바탕으로 문서분류에 높은 성능을 보이는 Doc2Vec을 이용한다. Doc2Vec 문서 임베딩 기법을 이용하여 전문용어가 사용된 전문가 답변과 유사한 답변을 제안하여 비슷한 주제의 답변들끼리 분류하였다. 또한 전문가 답변과 유사도가 높은 판례를 제안하여 질의문에 해당하는 판례를 자동으로 연결하였다.

  • PDF

Korean End-to-End Coreference Resolution with BERT for Long Document (긴 문서를 위한 BERT 기반의 End-to-End 한국어 상호참조해결)

  • Jo, Kyeongbin;Jung, Youngjun;Lee, Changki;Ryu, Jihee;Lim, Joonho
    • Annual Conference on Human and Language Technology
    • /
    • 2021.10a
    • /
    • pp.259-263
    • /
    • 2021
  • 상호참조해결은 주어진 문서에서 상호참조해결 대상이 되는 멘션(mention)을 식별하고, 동일한 개체(entity)를 의미하는 멘션들을 찾아 그룹화하는 자연어처리 태스크이다. 최근 상호참조해결에서는 BERT를 이용하여 단어의 문맥 표현을 얻은 후, 멘션 탐지와 상호참조해결을 동시에 진행하는 end-to-end 모델이 주로 연구되었으나, 512 토큰 이상의 긴 문서를 처리하기 위해서는 512 토큰 이하로 문서를 분할하여 처리하기 때문에 길이가 긴 문서에 대해서는 상호참조해결 성능이 낮아지는 문제가 있다. 본 논문에서는 512 토큰 이상의 긴 문서를 위한 BERT 기반의 end-to-end 상호참조해결 모델을 제안한다. 본 모델은 긴 문서를 512 이하의 토큰으로 쪼개어 기존의 BERT에서 단어의 1차 문맥 표현을 얻은 후, 이들을 다시 연결하여 긴 문서의 Global Positional Encoding 또는 Embedding 값을 더한 후 Global BERT layer를 거쳐 단어의 최종 문맥 표현을 얻은 후, end-to-end 상호참조해결 모델을 적용한다. 실험 결과, 본 논문에서 제안한 모델이 기존 모델과 유사한 성능을 보이면서(테스트 셋에서 0.16% 성능 향상), GPU 메모리 사용량은 1.4배 감소하고 속도는 2.1배 향상되었다.

  • PDF

A Design of HTML Tag Stack and HTML Embedding Method to Improve Web Document Question Answering Performance of BERT (BERT 의 웹 문서 질의 응답 성능 향상을 위한 HTML 태그 스택 및 HTML 임베딩 기법 설계)

  • Mok, Jin-Wang;Lee, Hyun-Seob
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2022.11a
    • /
    • pp.583-585
    • /
    • 2022
  • 최근 기술의 발전으로 인해 자연어 처리 모델의 성능이 증가하고 있다. 그에 따라 평문 지문이 아닌 KorQuAD 2.0 과 같은 웹 문서를 지문으로 하는 기계 독해 과제를 해결하려는 연구가 증가하고 있다. 최근 기계 독해 과제의 대부분의 모델은 트랜스포머를 기반으로 하는 추세를 보인다. 그 중 대표적인 모델인 BERT 는 문자열의 순서에 대한 정보를 임베딩 과정에서 전달받는다. 한편 웹 문서는 태그 구조가 존재하므로 문서를 이해하는데 위치 정보 외에도 태그 정보도 유용하게 사용될 수 있다. 그러나 BERT 의 기존 임베딩은 웹 문서의 태그 정보를 추가적으로 모델에 전달하지 않는다는 문제가 있었다. 본 논문에서는 BERT 에 웹 문서 태그 정보를 효과적으로 전달할 수 있는 HTML 임베딩 기법 및 이를 위한 전처리 기법으로 HTML 태그 스택을 소개한다. HTML 태그 스택은 HTML 태그의 정보들을 추출할 수 있고 HTML 임베딩 기법은 이 정보들을 BERT 의 임베딩 과정에 입력으로 추가함으로써 웹 문서 질의 응답 과제의 성능 향상을 기대할 수 있다.

Automatic Text Summarization based on Selective Copy mechanism against for Addressing OOV (미등록 어휘에 대한 선택적 복사를 적용한 문서 자동요약)

  • Lee, Tae-Seok;Seon, Choong-Nyoung;Jung, Youngim;Kang, Seung-Shik
    • Smart Media Journal
    • /
    • v.8 no.2
    • /
    • pp.58-65
    • /
    • 2019
  • Automatic text summarization is a process of shortening a text document by either extraction or abstraction. The abstraction approach inspired by deep learning methods scaling to a large amount of document is applied in recent work. Abstractive text summarization involves utilizing pre-generated word embedding information. Low-frequent but salient words such as terminologies are seldom included to dictionaries, that are so called, out-of-vocabulary(OOV) problems. OOV deteriorates the performance of Encoder-Decoder model in neural network. In order to address OOV words in abstractive text summarization, we propose a copy mechanism to facilitate copying new words in the target document and generating summary sentences. Different from the previous studies, the proposed approach combines accurate pointing information and selective copy mechanism based on bidirectional RNN and bidirectional LSTM. In addition, neural network gate model to estimate the generation probability and the loss function to optimize the entire abstraction model has been applied. The dataset has been constructed from the collection of abstractions and titles of journal articles. Experimental results demonstrate that both ROUGE-1 (based on word recall) and ROUGE-L (employed longest common subsequence) of the proposed Encoding-Decoding model have been improved to 47.01 and 29.55, respectively.

On Developing a Semantic Annotation Tool for Managing Metadata of Web Documents based on XMP and Ontology (웹 문서의 메타데이터 관리를 위한 XMP 및 온톨로지 기반의 시맨틱 어노테이션 지원도구 개발)

  • Yang, Kyoung-Mo;Hwang, Suk-Hyung;Choi, Sung-Hee
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.10 no.7
    • /
    • pp.1585-1600
    • /
    • 2009
  • The goal of Semantic Web is to provide efficient and effective semantic search and web services based on the machine-processable semantic information of web resources. Therefore, the process of creating and adding computer-understandable metadata for a variety of web contents, namely, semantic annotation is one of the fundamental technologies for the semantic web. Recently, in order to manage annotation metadata, direct approach for embedding metadata into the document is mainly used in semantic annotation. However, many semantic annotation tools for web documents have been mainly worked with HTML documents, and most of these tools do not support semantic search functionalities using the metadata. In this paper, based on these problems and previous works, we propose the Ontology-based Semantic Annotation tool(OSA) to efficiently support semantic annotation for web documents(such as HTML, PDF). We define a semantic annotation model that represents ontological-semantic information by using RDFS(RDF Schema). Based on XMP(eXtensible Metadata Platform) standard, the model is encoded directly into the document. By using OSA with XMP, user can perform semantic annotation on web documents which are able to keep compatibility for managing annotation metadata. Eventually, the integrated semantic annotation metadata can be used effectively in semantic search for a variety of web contents.