DOI QR코드

DOI QR Code

Transformation of Text Contents of Engineering Documents into an XML Document by using a Technique of Document Structure Extraction

문서구조 추출기법을 이용한 엔지니어링 문서 텍스트 정보의 XML 변환

  • 이상호 (연세대학교 토목환경공학과) ;
  • 박준원 (연세대학교 토목환경공학과) ;
  • 박상일 (연세대학교 토목환경공학과) ;
  • 김봉근 (연세대학교 토목환경공학과)
  • Received : 2011.06.09
  • Accepted : 2011.07.21
  • Published : 2011.12.31

Abstract

This paper proposes a method for transforming unstructured text contents of engineering documents, which have complex hierarchical structure of subtitles with various heading symbols, into a semi-structured XML document according to the hierarchical subtitle structure. In order to extract the hierarchical structure from plain text information, this study employed a method of document structure extraction which is an analysis technique of the document structure. In addition, a method for processing enumerative text contents was developed to increase overall accuracy during extraction of the subtitles and construction of a hierarchical subtitle structure. An application module was developed based on the proposed method, and the performance of the module was evaluated with 40 test documents containing structural calculation records of bridges. The first test group of 20 documents related to the superstructure of steel girder bridges as applied in a previous study and they were used to verify the enhanced performance of the proposed method. The test results show that the new module guarantees an increase in accuracy and reliability in comparison with the test results of the previous study. The remaining 20 test documents were used to evaluate the applicability of the method. The final mean value of accuracy exceeded 99%, and the standard deviation was 1.52. The final results demonstrate that the proposed method can be applied to diverse heading symbols in various types of engineering documents to represent the hierarchical subtitle structure in a semi-structured XML document.

본 연구에서는 교량의 구조계산서와 같이 여러 종류의 머리기호를 사용하며 제목의 계층구조가 복잡한 형식을 띄는 엔지니어링 문서의 비구조화된 텍스트 정보를 제목의 계층 구조에 따른 준구조화된 XML 문서로 변환시키는 방법을 제시한다. 텍스트 정보로부터 제목의 계층구조를 자동으로 추출하기 위해 문서구조분석 방법의 하나인 문서구조추출 기법을 이용하는 방법을 개발하였으며, 특히 개조식 구문의 식별방법을 개발하여 구조계산서 문서 계층구조의 제목추출과정 및 계층구분의 전체 정확도를 향상시킬 수 있는 방법을 제시하였다. 제시된 방법에 따른 응용모듈을 개발하였으며, 총 40개의 교량 구조계산서를 대상으로 그 성능을 평가하였다. 먼저, 20개의 강거더 상부 구조계산서를 대상으로 선행 연구결과와 비교하여 본 연구에서 개발된 응용모듈의 정확성과 신뢰도가 향상됨을 보였다. 또한, 다른 구조형식에 대한 구조계산서 20개에 대하여 개발된 모듈의 적용성을 평가하였다. 그 결과 본 연구에서 제안한 방법에 의한 문서 계층구조 분석의 최종 정확도는 평균 99% 수준 이상을 나타내고, 표준편차는 1.52로 나타나 본 연구에서 제시된 방법이 다양한 형식의 머리기호를 사용하여 제목을 구분하는 여러 엔지니어링 문서에도 적용이 가능함을 보였다.

Keywords

References

  1. 박상일, 김봉근, 김경환, 이상호(2009) 엔지니어링 문서의 문장 자동 계층정의 방법론. 한국전산구조공학회 논문집, 한국전산구조공학회, 제22권, 제4호, pp. 323-330.
  2. Bray, T., Paoli, J., and Sperberg-McQueen, C.M. (1998) Extensible Markup Language (XML) 1.0. World Wide Web Consortium. (available at: www.w3c.org)
  3. Burry, M., Coulson, J., Preston, J., and Rutherford, E. (2001) Computer- aided design decision support: interfacing knowledge and information. Automation in Construction, Vol. 10, No. 2, pp. 203-215. https://doi.org/10.1016/S0926-5805(99)00029-1
  4. Caldas, C.H. and Soibelman, L. (2003) Automating hierarchical document classification for construction management information systems. Automation in Construction, Vol. 12, No. 4, pp. 395-406. https://doi.org/10.1016/S0926-5805(03)00004-9
  5. Kim, B.-G., Park, S. I., Kim, H.-J., and Lee, S.-H. (2010) Automatic extraction of apparent semantic structure from text contents of a structural calculation document. Journal of Computing in Civil Engineering, ASCE, Vol. 24, No. 3, pp. 313-324. https://doi.org/10.1061/(ASCE)CP.1943-5487.0000047
  6. Liu, S., McMahon, C.A., Darlington, M.J., Culley, S.J., and Wild, P.J. (2006) A computational framework for retrieval of document fragments based on decomposition schemes in engineering information management. Advanced Engineering Informatics, Vol. 20, No. 4, pp. 401-413. https://doi.org/10.1016/j.aei.2006.05.008
  7. Liu, S., McMahon, C.A., and Culley, S.J. (2008) A review of structured document retrieval (SDR) technology to improve information access performance in engineering document management. Computers in Industry, Vol. 59, No. 1, pp. 3-16. https://doi.org/10.1016/j.compind.2007.08.001
  8. Meziane, F. and Rezgui, Y. (2003) A document management methodology based on similarity contents. Information Sciences, Vol. 158, No. 1, pp. 15-36.
  9. Rezgui, Y. (2006) Ontology-centered knowledge management using information retrieval techniques. Journal of Computing in Civil Engineering, ASCE, Vol. 20, No. 4, pp. 261-270. https://doi.org/10.1061/(ASCE)0887-3801(2006)20:4(261)
  10. Soibelman, L., Wu, J., Caldas, C. Brilakis, I., and Lin, K.-Y. (2008) Management and analysis of unstructured construction data types. Advanced Engineering Informatics, Vol. 22, No. 1, pp. 15-27. https://doi.org/10.1016/j.aei.2007.08.011
  11. Van Rijsbergen, C.J. (1979) Information Retrieval (2nd ed.). Butterworth- Heinemann, London.
  12. Zhiliang, M., Wong, K.D., Li, H., and Jun, Y. (2005) Utilizing exchanged documents in construction projects for decision support based on data warehousing technique. Automation in Construction, Vol. 14, No. 3, pp. 405-412. https://doi.org/10.1016/j.autcon.2004.08.016
  13. Zhu, Y., Issa, R., and Cox, R. (2001) Web-based construction document processing via a malleable frame. Journal of Computing in Civil Engineering, ASCE, Vol. 15, No. 3, pp. 157-169. https://doi.org/10.1061/(ASCE)0887-3801(2001)15:3(157)