• Title/Summary/Keyword: Machine Reading Comprehension(MRC)

Search Result 9, Processing Time 0.018 seconds

Korean Machine Reading Comprehension for Patent Consultation Using BERT (BERT를 이용한 한국어 특허상담 기계독해)

  • Min, Jae-Ok;Park, Jin-Woo;Jo, Yu-Jeong;Lee, Bong-Gun
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.9 no.4
    • /
    • pp.145-152
    • /
    • 2020
  • MRC (Machine reading comprehension) is the AI NLP task that predict the answer for user's query by understanding of the relevant document and which can be used in automated consult services such as chatbots. Recently, the BERT (Pre-training of Deep Bidirectional Transformers for Language Understanding) model, which shows high performance in various fields of natural language processing, have two phases. First phase is Pre-training the big data of each domain. And second phase is fine-tuning the model for solving each NLP tasks as a prediction. In this paper, we have made the Patent MRC dataset and shown that how to build the patent consultation training data for MRC task. And we propose the method to improve the performance of the MRC task using the Pre-trained Patent-BERT model by the patent consultation corpus and the language processing algorithm suitable for the machine learning of the patent counseling data. As a result of experiment, we show that the performance of the method proposed in this paper is improved to answer the patent counseling query.

HTML Tag Depth Embedding: An Input Embedding Method of the BERT Model for Improving Web Document Reading Comprehension Performance (HTML 태그 깊이 임베딩: 웹 문서 기계 독해 성능 개선을 위한 BERT 모델의 입력 임베딩 기법)

  • Mok, Jin-Wang;Jang, Hyun Jae;Lee, Hyun-Seob
    • Journal of Internet of Things and Convergence
    • /
    • v.8 no.5
    • /
    • pp.17-25
    • /
    • 2022
  • Recently the massive amount of data has been generated because of the number of edge devices increases. And especially, the number of raw unstructured HTML documents has been increased. Therefore, MRC(Machine Reading Comprehension) in which a natural language processing model finds the important information within an HTML document is becoming more important. In this paper, we propose HTDE(HTML Tag Depth Embedding Method), which allows the BERT to train the depth of the HTML document structure. HTDE makes a tag stack from the HTML document for each input token in the BERT and then extracts the depth information. After that, we add a HTML embedding layer that takes the depth of the token as input to the step of input embedding of BERT. Since tokenization using HTDE identifies the HTML document structures through the relationship of surrounding tokens, HTDE improves the accuracy of BERT for HTML documents. Finally, we demonstrated that the proposed idea showing the higher accuracy compared than the accuracy using the conventional embedding of BERT.

Risk Prediction Model of Legal Contract Based on Korean Machine Reading Comprehension (한국어 기계독해 기반 법률계약서 리스크 예측 모델)

  • Lee, Chi Hoon;Woo, Noh Ji;Jeong, Jae Hoon;Joo, Kyung Sik;Lee, Dong Hee
    • Journal of Information Technology Services
    • /
    • v.20 no.1
    • /
    • pp.131-143
    • /
    • 2021
  • Commercial transactions, one of the pillars of the capitalist economy, are occurring countless times every day, especially small and medium-sized businesses. However, small and medium-sized enterprises are bound to be the legal underdogs in contracts for commercial transactions and do not receive legal support for contracts for fair and legitimate commercial transactions. When subcontracting contracts are concluded among small and medium-sized enterprises, 58.2% of them do not apply standard contracts and sign contracts that have not undergone legal review. In order to support small and medium-sized enterprises' fair and legitimate contracts, small and medium-sized enterprises can be protected from legal threats if they can reduce the risk of signing contracts by analyzing various risks in the contract and analyzing and informing them of toxic clauses and omitted contracts in advance. We propose a risk prediction model for the machine reading-based legal contract to minimize legal damage to small and medium-sized business owners in the legal blind spots. We have established our own set of legal questions and answers based on the legal data disclosed for the purpose of building a model specialized in legal contracts. Quantitative verification was carried out through indicators such as EM and F1 Score by applying pine tuning and hostile learning to pre-learned machine reading models. The highest F1 score was 87.93, with an EM value of 72.41.

KorPatELECTRA : A Pre-trained Language Model for Korean Patent Literature to improve performance in the field of natural language processing(Korean Patent ELECTRA)

  • Jang, Ji-Mo;Min, Jae-Ok;Noh, Han-Sung
    • Journal of the Korea Society of Computer and Information
    • /
    • v.27 no.2
    • /
    • pp.15-23
    • /
    • 2022
  • In the field of patents, as NLP(Natural Language Processing) is a challenging task due to the linguistic specificity of patent literature, there is an urgent need to research a language model optimized for Korean patent literature. Recently, in the field of NLP, there have been continuous attempts to establish a pre-trained language model for specific domains to improve performance in various tasks of related fields. Among them, ELECTRA is a pre-trained language model by Google using a new method called RTD(Replaced Token Detection), after BERT, for increasing training efficiency. The purpose of this paper is to propose KorPatELECTRA pre-trained on a large amount of Korean patent literature data. In addition, optimal pre-training was conducted by preprocessing the training corpus according to the characteristics of the patent literature and applying patent vocabulary and tokenizer. In order to confirm the performance, KorPatELECTRA was tested for NER(Named Entity Recognition), MRC(Machine Reading Comprehension), and patent classification tasks using actual patent data, and the most excellent performance was verified in all the three tasks compared to comparative general-purpose language models.

Korean TableQA: Structured data question answering based on span prediction style with S3-NET

  • Park, Cheoneum;Kim, Myungji;Park, Soyoon;Lim, Seungyoung;Lee, Jooyoul;Lee, Changki
    • ETRI Journal
    • /
    • v.42 no.6
    • /
    • pp.899-911
    • /
    • 2020
  • The data in tables are accurate and rich in information, which facilitates the performance of information extraction and question answering (QA) tasks. TableQA, which is based on tables, solves problems by understanding the table structure and searching for answers to questions. In this paper, we introduce both novice and intermediate Korean TableQA tasks that involve deducing the answer to a question from structured tabular data and using it to build a question answering pair. To solve Korean TableQA tasks, we use S3-NET, which has shown a good performance in machine reading comprehension (MRC), and propose a method of converting structured tabular data into a record format suitable for MRC. Our experimental results show that the proposed method outperforms a baseline in both the novice task (exact match (EM) 96.48% and F1 97.06%) and intermediate task (EM 99.30% and F1 99.55%).

Machine Reading Comprehension System to Solve Unanswerable Problems using Method of Mimicking Reading Comprehension Patterns (기계독해 시스템에서 답변 불가능 문제 해결을 위한 독해 패턴 모방 방법)

  • Lee, Yejin;Jang, Youngjin;Lee, Hyeon-gu;Shin, Dongwook;Park, Chanhoon;Kang, Inho;Kim, Harksoo
    • Annual Conference on Human and Language Technology
    • /
    • 2021.10a
    • /
    • pp.139-143
    • /
    • 2021
  • 최근 대용량 말뭉치를 기반으로 한 언어 모델이 개발됨에 따라 다양한 자연어처리 분야에서 사람보다 높은 성능을 보이는 시스템이 제안되었다. 이에 따라, 더 어렵고 복잡한 문제를 해결하기 위한 데이터셋들이 공개되었으며 대표적으로 기계독해 작업에서는 시스템이 질문에 대해 답변할 수 없다고 판단할 수 있는지 평가하기 위한 데이터셋이 공개되었다. 입력 받은 데이터에 대해 답변할 수 없다고 판단하는 것은 실제 애플리케이션에서 중요한 문제이기 때문에, 이를 해결하기 위한 연구도 다양하게 진행되었다. 본 논문에서는 문서를 이해하여 답변할 수 없는 데이터에 대해 효과적으로 판단할 수 있는 기계독해 시스템을 제안한다. 제안 모델은 문서의 내용과 질문에 대한 이해도가 낮을 경우 정확한 정답을 맞히지 못하는 사람의 독해 패턴에서 착안하여 기계독해 시스템의 문서 이해도를 높이고자 한다. KLUE-MRC 개발 데이터를 통한 실험에서 EM, Rouge-w 기준으로 각각 71.73%, 76.80%을 보였다.

  • PDF

Korean Machine Reading Comprehension using Continual Learning (Continual Learning을 이용한 한국어 기계독해)

  • Shin, JoongMin;Cho, Sanghyun;Choi, Jaehoon;Kwon, Hyuk-Chul
    • Annual Conference on Human and Language Technology
    • /
    • 2021.10a
    • /
    • pp.609-611
    • /
    • 2021
  • 기계 독해는 주어진 지문 내에서 질문에 대한 답을 기계가 찾아 답하는 문제이다. 딥러닝에서는 여러 데이터셋을 학습시킬 때에 이전에 학습했던 데이터의 weight값이 점차 사라지고 사라진 데이터에 대해 테스트 하였을때 성능이 떨어진 결과를 보인다. 이를 과거에 학습시킨 데이터의 정보를 계속 가진 채로 새로운 데이터를 학습할 수 있는 Continual learning을 통해 해결할 수 있고, 본 논문에서는 이 방법을 MRC에 적용시켜 학습시킨 후 한국어 자연어처리 Task인 Korquad 1.0의 MRC dev set을 통해 성능을 측정하였다. 세 개의 데이터셋중에서 랜덤하게 5만개를 추출하여 10stage를 학습시킨 50K 모델에서 추가로 Continual Learning의 Learning without Forgetting를 사용하여 학습시킨 50K-LWF 모델이 F1 92.57, EM 80.14의 성능을 보였고, BERT 베이스라인 모델의 성능 F1 91.68, EM 79.92에 비교하였을 때 F1, EM 각 0.89, 0.22의 향상이 있었다.

  • PDF

Machine Reading Comprehension-based Q&A System in Educational Environment (교육환경에서의 기계독해 기반 질의응답 시스템)

  • Jun-Ha Ju;Sang-Hyun Park;Seung-Wan Nam;Kyung-Tae Lim
    • Annual Conference on Human and Language Technology
    • /
    • 2022.10a
    • /
    • pp.541-544
    • /
    • 2022
  • 코로나19 이후로 교육의 형태가 오프라인에서 온라인으로 변화되었다. 하지만 온라인 강의 교육 서비스는 실시간 소통의 한계를 가지고 있다. 이러한 단점을 해결하기 위해 본 논문에서는 기계독해 기반 실시간 강의 질의응답 시스템을 제안한다. 본 논문연구에서는 질의응답 시스템을 만들기 위해 KorQuAD 1.0 학습 데이터를 활용해 BERT를 fine-tuning 했고 그 결과를 이용해 기계독해 기반 질의응답 시스템을 구축했다. 하지만 이렇게 구축된 챗봇은 강의 내용에 대한 질의응답에 최적화되어있지 않기 때문에 강의 내용 질의응답에 관한 문장형 데이터 셋을 구축하고 추가 학습을 수행하여 문제를 해결했다. 실험 결과 질의응답 표를 통해 문장형 답변에 대한 성능이 개선된 것을 확인할 수 있다.

  • PDF

Korean Q&A Chatbot for COVID-19 News Domains Using Machine Reading Comprehension (기계 독해를 이용한 COVID-19 뉴스 도메인의 한국어 질의응답 챗봇)

  • Lee, Taemin;Park, Kinam;Park, Jeongbae;Jeong, Younghee;Chae, Jeongmin;Lim, Heuiseok
    • Annual Conference on Human and Language Technology
    • /
    • 2020.10a
    • /
    • pp.540-542
    • /
    • 2020
  • 코로나 19와 관련한 다양한 정보 확인 욕구를 충족하기 위해 한국어 뉴스 데이터 기반의 질의응답 챗봇을 설계하고 구현하였다. BM25 기반의 문서 검색기, 사전 언어 모형인 KoBERT 기반의 문서 독해기, 정답 생성기의 세 가지 모듈을 중심으로 시스템을 설계하였다. 뉴스, 위키, 통계 정보를 수집하여 웹 기반의 챗봇 인터페이스로 질의응답이 가능하도록 구현하였다. 구현 결과는 http://demo.tmkor.com:36200/mrcv2 페이지에서 접근 및 사용을 할 수 있다.

  • PDF