DOI QR코드

DOI QR Code

Document Analysis based Main Requisite Extraction System

문서 분석 기반 주요 요소 추출 시스템

  • Lee, Jongwon (Department of Computer Engineering, Paichai University) ;
  • Yeo, Ilyeon (Department of Computer Engineering, Paichai University) ;
  • Jung, Hoekyung (Department of Computer Engineering, Paichai University)
  • Received : 2019.01.22
  • Accepted : 2019.02.14
  • Published : 2019.04.30

Abstract

In this paper, we propose a system for analyzing documents in XML format and in reports. The system extracts the paper or reports of keywords, shows them to the user, and then extracts the paragraphs containing the keywords by inputting the keywords that the user wants to search within the document. The system checks the frequency of keywords entered by the user, calculates weights, and removes paragraphs containing only keywords with the lowest weight. Also, we divide the refined paragraphs into 10 regions, calculate the importance of the paragraphs per region, compare the importance of each region, and inform the user of the main region having the highest importance. With these features, the proposed system can provide the main paragraphs with higher compression ratio than analyzing the papers or reports using the existing document analysis system. This will reduce the time required to understand the document.

본 논문에서는 XML 형태의 논문이나 보고서로 작성된 문서를 분석하는 시스템을 제안한다. 논문이나 보고서에서 지정한 키워드를 추출하고 이를 사용자에게 보여준 뒤 사용자가 해당 문서 내에서 검색을 원하는 키워드를 입력하면 각 키워드들을 포함하고 있는 문단들을 추출한다. 시스템은 사용자가 입력한 키워드들의 빈도수를 확인하고 가중치를 계산한 뒤 가중치가 가장 낮은 키워드만을 포함한 문단들을 제거한다. 또한, 정제된 문단들을 10개의 영역으로 나눈 뒤 영역별 문단들의 중요도를 계산하고 각 영역들의 중요도를 비교하여 가장 높은 중요도를 갖는 주요 영역을 사용자에게 알려준다. 이러한 특징들로 인해 제안하는 시스템을 활용할 경우 기존의 문서 분석 시스템을 활용하여 논문이나 보고서를 분석하는 것보다 압축률이 높은 형태로 주요 문단들을 제공받을 수 있다. 이로 인해 문서를 이해하는데 필요한 시간을 줄일 수 있을 것으로 사료된다.

Keywords

HOJBC0_2019_v23n4_401_f0001.png 이미지

Fig. 1 System Architecture

HOJBC0_2019_v23n4_401_f0002.png 이미지

Fig. 2 System Flowchart

HOJBC0_2019_v23n4_401_f0003.png 이미지

Fig. 3 Screen of Insert Keyword

HOJBC0_2019_v23n4_401_f0004.png 이미지

Fig. 4 Screen of System Result

HOJBC0_2019_v23n4_401_f0005.png 이미지

Fig. 5 Screen of Analysis Processing

HOJBC0_2019_v23n4_401_f0006.png 이미지

Fig. 6 Screen of Centrality Output

HOJBC0_2019_v23n4_401_f0007.png 이미지

Fig. 7 Test Result Graph 1

HOJBC0_2019_v23n4_401_f0008.png 이미지

Fig. 8 Test Result Graph 2

References

  1. J. R. Li, E. H. Lee, and J. H. Lee, "Sequence-to-sequence based Morphological Analysis and Part-Of-Speech Tagging for Korean Language with Convolutional Features," Journal of Korean Institute of Information Scientists and Engineering, vol. 44, no. 1, pp. 57-62, Jan. 2017.
  2. K. S. Shim, "Cloning of Korean Morphological Analyzers using Pre-analyzed Eojeol Dictionary and Syllable-based Probabilistic Model," Journal of Korean Institute of Information Scientists and Engineering, vol. 22, no. 3, pp. 119-126, Mar. 2016.
  3. J. W. Lee, I. S. Kang, and H. K Jung, "XML Document Keyword Weight Analysis based Paragraph Extraction Model," Journal of the Korea Institute of Information and Communication Engineering, vol. 21, no. 11, Nov. 2017.
  4. U. S. Gim, S. H. Choi, and J. H. Cho, "An impact analysis of FMD news on pork demand in korea," Journal of The Korean Journal of Community Living Science, vol. 26, no. 1, pp. 75-85, Feb. 2015. https://doi.org/10.7856/kjcls.2015.26.1.75
  5. J. H. Lee, K. S. Song, J. A. Kang, and J. R. Hwang, "A study on the efficient extraction method of SNS data related to crime risk factor," Journal of The Korea Society of Computer and Information, vol. 20, no. 1, pp. 255-263, Jan. 2015. https://doi.org/10.9708/jksci.2015.20.1.255
  6. H. Y. Lee, J. S. Lee, B. D. Kang, and S. W. Yang, "Functional Expansion of Morphological Analyzer Based on Longest Phrase Matching For Efficient Korean Parsing," Journal of Digital Contents Society, vol. 17, no. 3, pp. 203-210, Jun. 2016. https://doi.org/10.9728/dcs.2016.17.3.203
  7. J. Y. Lee, J. H. Lee, and Y. H. Park, "A design and implementation of the management system for number of keyword searching results using Google searching engine," Journal of the Korea Institute of Information and Communication Engineering, vol. 20, no. 5, pp. 880-886, May. 2016. https://doi.org/10.6109/jkiice.2016.20.5.880
  8. S. H. Na, J. I. Kim, E. J. Lee, and P. K. Kim, "A Study on the Short Text Categorization using SNS Feature Informations," Journal of Korean Institute of Information Technology, vol. 14, no. 6, pp. 159-165, Jun. 2016.
  9. J. W. Lee, I. S. Kang, and H. K. Jung "XML Document Keyword Weight Analysis based Paragraph Extraction Model," Journal of the Korea Institute of Information and Communication Engineering, vol. 21, no. 11, pp. 2133-2138, Nov. 2017. https://doi.org/10.6109/JKIICE.2017.21.11.2133