• Title/Summary/Keyword: HTML Documents

Search Result 149, Processing Time 0.018 seconds

Web Information Extraction using HTML Tag Pattern (HTML 태그페턴을 이용한 웹정보추출시스템)

  • Park, Byung-Kwon
    • Proceedings of the Korea Association of Information Systems Conference
    • /
    • 2005.05a
    • /
    • pp.79-92
    • /
    • 2005
  • To query the vast amount of web pages which are available i]l the Internet, it is necessary to extract the encoded information in the web pages for converting it into structured data (e.g. relational data for SQL) or semistructured data (e.g. XML data for XQuery), In this paper, we propose a new web information extraction system, PIES, to convert web information into XML documents. PIES is based on a user-specified target schema and HTML tag pattern descriptions. The web information is extracted by the pattern descriptions and validated by the target schema. We designed a new language to describe extraction rules, and a new regular expression to describe HTML tag patterns. We implemented PIES and applied it to the US patent web site to evaluate its correctness. It successfully extracted more than thousands of US patent data and converted them into XML documents.

  • PDF

Improving Performance of Change Detection Algorithms through the Efficiency of Matching (대응효율성을 통한 변화 탐지 알고리즘의 성능 개선)

  • Lee, Suk-Kyoon;Kim, Dong-Ah
    • The KIPS Transactions:PartD
    • /
    • v.14D no.2
    • /
    • pp.145-156
    • /
    • 2007
  • Recently, the needs for effective real time change detection algorithms for XML/HTML documents and increased in such fields as the detection of defacement attacks to web documents, the version management, and so on. Especially, those applications of real time change detection for large number of XML/HTML documents require fast heuristic algorithms to be used in real time environment, instead of algorithms which compute minimal cost-edit scripts. Existing heuristic algorithms are fast in execution time, but do not provide satisfactory edit script. In this paper, we present existing algorithms XyDiff and X-tree Diff, analyze their problems and propose algorithm X-tree Diff which improve problems in existing ones. X-tree Diff+ has similar performance in execution time with existing algorithms, but it improves matching ratio between nodes from two documents by refining matching process based on the notion of efficiency of matching.

A Comparative Study of XML and HTML: Focusing on Their Characteristics and Retrieval Functions (디지털도서관 문서양식으로서의 XML과 HTML의 특성 및 검색 기능 비교 연구)

  • 김현희;장혜원
    • Journal of the Korean Society for information Management
    • /
    • v.16 no.2
    • /
    • pp.105-134
    • /
    • 1999
  • For efficient and precise searches in the Web environment, resources should be coded in a structured way. HTML does not cover semantic structure because of its fixed tagging. XML, which has emerged as an alternative standard markuplanguage, uses custom tags that allow structural searching. Therefore, this study aims to compare XML with HTML in terms of their characteristics and retrieval functions. In order to test retrieval functions of XML- and HTML-based systems, we constructed an experimental XML-based system. The XML-based system has several advantages over the HTML system. However, some improvements are needed to make the XML system more comprehensive and effective. First, XML document search engines with user-friendly interfaces are needed. Second, popular Web browsers such as Explorer and Communicator need to support XML 1.0 specification completely. Third, Open DTD format, which will allow information retrieval systems to retrieve documents and compress them into one single format, is also needed to control Web documents more efficiently.

  • PDF

A Study on Tools to Develop Electronic Documents (전자문헌 개발도구에 관한 고찰 - SGML, HTML과 PDF를 중심으로 -)

  • Kim, Yong;NamKoong, Hwang
    • Journal of Information Management
    • /
    • v.29 no.1
    • /
    • pp.1-19
    • /
    • 1998
  • With development in computing and networking technologies, national supports and attention for building digital library, which is to overcome the limits of time and location in using information resources, is increasing. To accomplish the main goal of digital library that is to freely share and transfer information on network, the importance of standardization in developing electronic document is increasing. Now several tools to develop electronic document, which will be used in digital library, are developed for electronic document used on WWW. But none of them has absolute advantages to other formats. Those tools, that is, have comparative advantages and disadvantages for making electronic documents. Through reviewing features and analyzing comparative advantage and disadvantage of SGML, HTML, and PDF, which will be used to develop electronic documents in digital libraries, this study focuses on their comparative advantages and disadvantages. With doing it, this study propose relevant type of electronic document formats to the types of information resources.

  • PDF

The Design and Implementation of HTML-based Intelligent Help System (HTML 기반 지능형 도움말 시스템의 설계 및 구현)

  • 주예찬;권기항
    • Journal of Korea Multimedia Society
    • /
    • v.2 no.2
    • /
    • pp.120-128
    • /
    • 1999
  • This paper proposes the design and implementation of HTML-based Intelligent help system for application developers and users. In existing help systems, developers had to write topics, index, and contents of whole document by themselves. Furthermore those files are linked to one project file, which is in previously compiled form, and user can't modify topics and index information of their help documents. Especially in RAD environments, even though new features or packages are additionally announced, still users should be able to access and replace new help documents with existing ones. But these procedures are very complex in real world. The proposed help system is designed to analyze existing HTML documents, extract help data with regard to user's interest and is facilitated to authorize help contents with the user interface. Removing inconvenience in implementing context-sensitive help contents with the user interface. Removing inconvenience in implementing context-sensitive help system is also considered. In conclusion the proposed system in this paper can be actually useful when adopted into any typical Java RAD such as Bluette for its help system.

  • PDF

User Profile Generation using Visual Differences of HTML Document (HTML 문서의 시각적 분석을 이용한 사용자 프로파일 생성)

  • Gwak, Ju-Hyeon;Lee, Chang-Hun
    • The Transactions of the Korea Information Processing Society
    • /
    • v.7 no.6
    • /
    • pp.1827-1833
    • /
    • 2000
  • In this study, I've suggested how to improve the function of web-agents to find out the web-document users prefer. Web-agents employ TFIDF, which considers all the worked used in a document as equal in improtance to find out users' preferences. Web-documents like HTML, however, make visual differences by using different sizes of letters and highlighting them based on importance of words. In this study, I've attempted to improve the functions of the web-agents by differentiating the weight of each worked in accordance with the visual importance of each paragraph. To enhance functions, I've suggested how to make a profile from each paragraph to be consolidated later. As to suggested algorithms, I've tested their effects by comparing the established TFIDF algorithm with the function which helps users find documents they prefer.

  • PDF

A Management Method for hierarchical Information Structures on Web Systems (계층적 정보 구조의 Web 시스템 관리 기술)

  • Choi, Yong-Jun;Lim, Kyung-Su;Hwang, Do-Sam;Kim, Chong-Gun
    • The Transactions of the Korea Information Processing Society
    • /
    • v.5 no.5
    • /
    • pp.1300-1310
    • /
    • 1998
  • Web Information Systems have many static HTML documents and dynamic CGI application programs. A hyperlinked information environment on Web systems include lots of mutually referenced documents. This cause problems of data consistency in a intra-document and among inter-documents. To solve the problems, we propose a management method of Web system which have hierarchical information structure, and an unified problem-solving approach. We construct a large scale practical Web system based upon the proposed architecture. The proposed results can provide many advantage to WebMasteters.

  • PDF

Design of Document-HTML Generation Technique for Authorized Electronic Document Communication (공인전자문서 소통을 위한 Document-HTML 문서 생성 기법의 설계)

  • Hwang, Hyun-Cheon;Kim, Woo-Je
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.44 no.1
    • /
    • pp.51-59
    • /
    • 2021
  • Electronic document communication based on a digital channel is becoming increasingly important with the advent of the paperless age. The electronic document based on PDF format does not provide a powerful customer experience for a mobile device user despite replacing a paper document by providing the content integrity and the independence of various devices and software. On the other hand, the electronic document based on HTML5 format has weakness in the content integrity as there is no HTML5 specification for the content integrity despite its enhanced customer experience such as a responsive web technology for a mobile device user. In this paper, we design the Document-HTML, which provides the content integrity and the powerful customer experience by declaring the HTML5 constraint rules and the extended tags to contain the digital signature based on PKI. We analyze the existing electronic document that has been used in the major financial enterprise to develop a sample. We also verify the Document-HTML by experimenting with the sample of HTML electronic communication documents and analyze the PKI equation. The Document-HTML document can be used as an authorized electronic document communication and provide a powerful customer experience in the mobile environment between an enterprise and a user in the future.

Electronic Data Interchange System for Hospital Demand Using XML (XML을 이용한 요양기관 청구 전자문서거래(EDI) 시스템)

  • 김진호;김경태
    • Journal of Information Technology Applications and Management
    • /
    • v.9 no.1
    • /
    • pp.97-110
    • /
    • 2002
  • Many companies are using EDI (Electronic Data Interchange) for the electronic transmission of documents and information to and from other companies. The appearance of Internet can enhance existing EDI systems. Existing EDI systems have several problems such as poor system interoperability and high expense of VAN. This paper prognoses a new EDI system utilizing Internet to provide open communication environment by using XML (extensible Markup language) and this applies it to the EDI service for Hospital Demand. XML is a mark-up language extending HTML which is a standard language for the expression of WWW (World-Wide Web) pages. XML is more structural than HTML, thus it is more suitable for the repetitive tasks of EDI and for the maintenance of databases. XML can transmit EDI documents in the open communication environment of Internet and users can easily access the documents with web browsers. Therefore we can provide EDI services within more open environment and we can build an EDI system with lower expense.

  • PDF