DOI QR코드

DOI QR Code

Kernelized Structure Feature for Discriminating Meaningful Table from Decorative Table

장식 테이블과 의미 있는 테이블 식별을 위한 커널 기반의 구조 자질

  • 손정우 (경북대학교 IT대학 컴퓨터공학과) ;
  • 고준호 (경북대학교 IT대학 컴퓨터공학과) ;
  • 박성배 (경북대학교 IT대학 컴퓨터공학과) ;
  • 김권양 (경일대학교 컴퓨터공학과)
  • Received : 2011.06.16
  • Accepted : 2011.09.25
  • Published : 2011.10.25

Abstract

This paper proposes a novel method to discriminate meaningful tables from decorative one using a composite kernel for handling structural information of tables. In this paper, structural information of a table is extracted with two types of parse trees: context tree and table tree. A context tree contains structural information around a table, while a table tree presents structural information within a table. A composite kernel is proposed to efficiently handle these two types of trees based on a parse tree kernel. The support vector machines with the proposed kernel dised kuish meaningful tables from the decorative ones with rich structural information.

본 논문에서는 구조 정보를 활용하기 위한 결합 커널 기반의 의미 있는 웹 테이블과 장식 웹 테이블을 구분하는 새로운 방법을 제안한다. 본 논문에서 테이블의 구조 정보는 두 가지 형태의 구문 분석 트리로부터 추출된다. 컨텍스트 트리는 테이블 주변에 나타난 구조를 반영하고 있으며, 테이블 트리는 테이블 내의 구조를 담고 있다. 두 트리로 표현되는 테이블의 구조 정보를 효과적으로 다루기 위해 파스 트리 커널 기반의 결합 커널을 제안한다. 제안한 결합 커널을 적용한 support vector machines은 풍부한 구조 정보를 활용하여 의미 있는 테이블과 장식 테이블을 분류한다.

Keywords

References

  1. S. Jung, K. Sung, T. Park, and H. Kwon, "Effective Retrieval of Information in Tables on the Internet," In Proceedings of IEA/AIE'02, pp. 493-501, 2002.
  2. G. Penn, J. Hu, H. Luo, and R. McDonald, "Flexible Web Document Analysis for Delivery to Narrow-bandwidth Devices," In Proceedings of ICDAR'06, pp. 119-130, 2004.
  3. Y. Zhai and B. Liu, "Web Data Extraction based on Partial Tree Alignment," In Proceedings of the WWW'05, pp. 76-85, 2005.
  4. H. Chen, S. Tsai, and J. Tsai, "Mining Tables from Large Scale HTML texts," In Proceedings of the 18th International Conference Computational Linguistics, pp. 166-182, 2007.
  5. Y. Wang and J. Hu, "A Machine Learning based Approach for Table Detection on the Web," In Proceedings of WWW'02, pp. 242-250, 2002.
  6. E. Crestan and P. Pantel, "A Fine-Grained Taxonomy of Tables on the Web," In Proceedings of the 19th ACM International Conference on Information and Knowledge management, pp. 1405-1408, 2010.
  7. S. Jung and H. Kwon, "A Scalable Hybrid Approach for Extracting Head Components from Web Tables", IEEE Transactions on Knowledge and Data Engineering, Vol. 18, No. 2, pp. 174-187, 2006. https://doi.org/10.1109/TKDE.2006.19
  8. Y. Liu, K. Bai, P. Mitra, and C. Giles, "Automatic Searching of Tables in Digital Libraries," In Proceedings of the 16th International Conference on World Wide Web, pp. 1135-1136, 2007.
  9. E. Crestan and P. Pantel, "Web-scale Knowledge Extraction from Semi-structured Tables," In Proceedings of the 19th International Conference on World Wide Web, pp 1081-1082, 2010.
  10. N. Cristianini and J. Shawe-Taylor, " An Introduction to Support Vector Machines and other Kernel-based Learning Methods," Cambridge University Press, 2000.
  11. D. Haussler, "Convolution Kernels on Discrete Structures," Technical report, UCS-CRL-99-10, UC Santa Cruz, 1999.
  12. M. Collins and N. Duffy, "Convolution Kernels for Natural Language," In Advances in Neural Information Processing Systems 14, pp. 625-632, 2001
  13. M. Hurst, "Layout and language: Challenges for table understanding on the web," In Proceedings of WDA'01, pp. 27-30, 2001.

Cited by

  1. Comparison Between Optimal Features of Korean and Chinese for Text Classification vol.25, pp.4, 2015, https://doi.org/10.5391/JKIIS.2015.25.4.386