Web Information Extraction using HTML Tag Pattern

Park, Byung-Kwon;

Proceedings of the Korea Association of Information Systems Conference (한국정보시스템학회:학술대회논문집)

2005.05a
/
Pages.79-92
/
2005

Korea Association of Information Systems (한국정보시스템학회)

Web Information Extraction using HTML Tag Pattern

HTML 태그페턴을 이용한 웹정보추출시스템

Park, Byung-Kwon (Dong-A University)

박병권 (동아대학교 경영정보과학부)

Published : 2005.05.28

PDF

Download PDF

⟨ Previous Next ⟩

Abstract

To query the vast amount of web pages which are available i]l the Internet, it is necessary to extract the encoded information in the web pages for converting it into structured data (e.g. relational data for SQL) or semistructured data (e.g. XML data for XQuery), In this paper, we propose a new web information extraction system, PIES, to convert web information into XML documents. PIES is based on a user-specified target schema and HTML tag pattern descriptions. The web information is extracted by the pattern descriptions and validated by the target schema. We designed a new language to describe extraction rules, and a new regular expression to describe HTML tag patterns. We implemented PIES and applied it to the US patent web site to evaluate its correctness. It successfully extracted more than thousands of US patent data and converted them into XML documents.

Proceedings of the Korea Association of Information Systems Conference (한국정보시스템학회:학술대회논문집)

Web Information Extraction using HTML Tag Pattern

HTML 태그페턴을 이용한 웹정보추출시스템

Abstract

Keywords

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)