Browse > Article

A Wrapper System for Extraction and Integration of Web Information  

정재목 (서울대학교 컴퓨터공학부)
김형주 (서울대학교 전기컴퓨터공학부)
Abstract
This paper describes the data model and software development of XWS, an XWEET Web-wrapper System for generation wrapper program. To access information from various information sources, one has to convert and integrate source data into the same data model. XWS is developed as a part of XWEET project. We have implemented the XWS system using the Perl programming language stressing efficiency and ease-of-use. XWS has a few distinct features. First, data model and operator used for extracting information from HTML support a unified model of different views of HTML document. Second, it provides a user-friendly interface program to enable wrapper programmer to generate wrapper easily Third, XWS use the high-level script language designed by object-oriented methodology. In this paper, we also present the detail demonstration where it is useful for extracting article information from DBLP site.
Keywords
XWEET; Web; HTML; WWW; XML;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 World Wide Web Consortium (W3C). Extensible Stylesheet Language(XSL), 1998. http://www.w3.org/Style/XSL
2 정재목, 박상원 ,정태선, 이병준, 민경섭, 이강우, 김형주. XWEET: 웹 환경을 위한 통합 데이터베이스 시스템. 정보과학회, 28(8), 2001   과학기술학회마을
3 XWEET Team. XWEET System. Technical report, Seoul National University, Feb 2000. http://oopsla.snu.ac.kr/xweet/xweet.ps.gz
4 D. Florescu, A. Levy, and A. Mendelzon. Database techniques for the World-Wide Web: A survey. SIGMOD Record (ACM Special Interest Group on Management of Data), 27(3):59--74, 1998   DOI   ScienceOn
5 World Wide Web Consortium (W3C). Extensible Markup Language (XML) 1.0, 1998. http://www.w3.org/TR/1998/REC-xml-19980210
6 G. Huck, P. Fankhauser, K. Aberer, and E. J. Neuhold. Iedi: Extracting and synthesizing information from the web. In CoopIS 1998, pages 32--43, 1998
7 L. Liu, C. Pu, and W. Han. Xwrap: An xml-enabled wrapper construction system for web information sources. In ICDE, 2000   DOI
8 W3C. Document Object Level (DOM) Level 1 Specification, oct 1998. http://www.w3.org/TR/
9 A. Sahuguet and F. Azavant. Building lightweight wrappers for legacy web data-sources using w4f. In VLDB, 1999
10 S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. D. Ullman, and J. Widom. The TSIMMIS project: Integration of heterogeneous information sources. In 16th Meeting of the Information Processing Society of Japan, pages 7--18, Tokyo, Japan, 1994
11 J. Hammer, H. Garcia-Molina, S. Nestorov, R. Yerneni, M. Breunig, and V. Vassalos. Template-based wrappers in the TSIMMIS system. In Proceedings of the ACM SIGMOD International Conference on Management of Data, volume 26,2 of SIGMOD Record, pages 532--535, New York, May 13--15 1997. ACM Press   DOI
12 B.Adelberg. NoDoSE - a tool for semi-automatically extracting semi-structured data from text documents. In L. M. Haas and A. Tiwary, editors, SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, June 2-4, 1998, Seattle, Washington, USA, pages 283--294. ACM Press, 1998   DOI
13 N. Kushmerick, R. Doorenbos, and D. Weld. Wrapper induction for information extraction. In International Joint Conference on Artificial Intelligence, 15, 1997
14 J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo. Extracting semistructured information from the web. Technical report, Stanford University, 1998
15 A. Sahuguet and F. Azavant. Web ecology: Recycling HTML pages as XML documents using W4F. In WebDB'99, 1999
16 T. Kistler and H. Marais. Automating the web: WebL, 1999. http://www.research.digital.com/SRC/WebL
17 L. Wall, R. L. Schwartz, T. Christiansen, and S. Potter. Programming Perl. Nutshell Handbook. O'Reilly & Associates, 2nd edition, 1996
18 J. K. Ousterhout. Scripting: Higher Level Programming for the 21st Century. IEEE Computer, 31(3):23--30, Mar. 1998   DOI   ScienceOn