• Title/Summary/Keyword: HTML parsing

Search Result 17, Processing Time 0.027 seconds

An Extraction Method of Bibliographic Information from the US Patents: Using an HTML Parsing Technique (미국 특허 서지정보 추출 방법에 대한 연구: HTML 파싱 기법의 활용을 중심으로)

  • Han, Yoo-Jin;Oh, Seung-Woo
    • Journal of the Korean Society for information Management
    • /
    • v.27 no.2
    • /
    • pp.7-20
    • /
    • 2010
  • This study aims to provide a method of extracting the most recent information on US patent documents. An HTML paring technique that can directly connect to the US Patent and Trademark Office (USPTO) Web page is adopted. After obtaining a list of 50 documents through a keyword searching method, this study suggested an algorithm, using HTML parsing techniques, which can extract a patent number, an applicant, and the US patent class information. The study also revealed an algorithm by which we can extract both patents and subsequent patents using their closely connected relationship, that is a very distinctive characteristic of US patent documents. Although the proposed method has several limitations, it can supplement existing databases effectively in terms of timeliness and comprehensiveness.

Implementation of HTML Filter in Wireless Application Protocol for Scalable Web Services (확장성 있는 웹 서비스를 위한 무선 응용 프로토콜 기반의 HTML Filter 구현)

  • 이승진;김대건;최린;강철희
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2001.04a
    • /
    • pp.391-393
    • /
    • 2001
  • 본 논문은 WAP Gateway의 HTML Filter 구현에 대하여 다루고 있다. 웹 콘텐츠를 무선환경에 맞는 WML 문서로 변환하기 위한 HTML Filter 구조를 설계하고 이와 관련된 RuleSet Database, Parsing Engine, Markup Language Translator의 기능을 정의한다. 마지막으로, 확장성 있는 웹 서비스를 위해 실제 웹상의 콘텐츠를 대상으로 한 실험으로 통해 구현된 HTML Filter의 성능 평가의 분석을 수행하여 구현시 고려해야 할 사항 및 향후 연구방향에 대하여 논의한다.

Design and Implemetation of EasyWeb that searching and sharing to Informations (정보 검색 및 공유가 가능한 EasyWeb 설계 및 구현)

  • Gang, Sang-Eun;Kim, Taek-Hwan;Kang, Min-Young;Joo, Ok-Chan;Kim, Jin-Mook
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2011.11a
    • /
    • pp.1411-1413
    • /
    • 2011
  • 기존의 인터넷 검색 편리성을 제공하는 브라우저들은 사용자의 요구에 따라 수동적으로 움직이게 된다. 또한 RSS 와 같은 고급 검색 요구 조건을 만족시키고자 하는 노력에 비하여 사용자의 요구에 따라 능동적으로 움직이기에는 어려움이 존재한다. 이에 본 연구에서는 RSS와 같은 능동적인 정보 검색 및 제공이 가능하고, 표준 HTML2.0을 따르는 효과적인 웹 브라우저인 EasyWeb을 설계 및 구현하고자 한다. 본 논문에서 제안한 EasyWeb 브라우저는 기존의 브라우저들과 달리 표준 규격에 따라 구성하도록 HTML과 XML parsing이 가능하다. 또한 사용자의 요구에 능동적으로 정보를 수집하여 제공할 수 있다. 본 논문에서 제안한 EasyWeb의 구현 결과를 살펴보면 향후 웹 브라우저의 나아갈 방향을 모색할 수 있을 것으로 생각된다.

A Study of Main Contents Extraction from Web News Pages based on XPath Analysis

  • Sun, Bok-Keun
    • Journal of the Korea Society of Computer and Information
    • /
    • v.20 no.7
    • /
    • pp.1-7
    • /
    • 2015
  • Although data on the internet can be used in various fields such as source of data of IR(Information Retrieval), Data mining and knowledge information servece, and contains a lot of unnecessary information. The removal of the unnecessary data is a problem to be solved prior to the study of the knowledge-based information service that is based on the data of the web page, in this paper, we solve the problem through the implementation of XTractor(XPath Extractor). Since XPath is used to navigate the attribute data and the data elements in the XML document, the XPath analysis to be carried out through the XTractor. XTractor Extracts main text by html parsing, XPath grouping and detecting the XPath contains the main data. The result, the recognition and precision rate are showed in 97.9%, 93.9%, except for a few cases in a large amount of experimental data and it was confirmed that it is possible to properly extract the main text of the news.

Development of Internet tools and web site for the visual disabled (시각장애인을 위한 Web Site 구축에 관한 연구)

  • 고민수;김보성;길세기;김낙환;장영건;홍승홍
    • Proceedings of the IEEK Conference
    • /
    • 2000.06e
    • /
    • pp.214-217
    • /
    • 2000
  • To help the blind to find the information easily on World Wide Web, this research has tried to develop the device which enables us to convert HTML for the general into HTML for the blind. This program consists of the items as follows: 1. Web Robot to gather the internet browser and the general HTML. 2. Restoring DB by Parsing process 3. Multimedia editor for the use of web DB to add the literal and acoustic description to the editing function. 4. Convertor which gathers the DB and then changes into HTML for the blind. This project is designed to make it easy for the manager to establish the web site for the blind. We expect that this program will basically help the blind to overcome the inequality in the common information.

  • PDF

FastIO: High Speed Launching of Smart TV Apps (FastIO: 스마트 TV 앱의 고속 구동 기법)

  • Lee, Cheolhee;Hwang, Taeho;Won, Youjip;Lee, Seongjin
    • Journal of KIISE
    • /
    • v.43 no.7
    • /
    • pp.725-735
    • /
    • 2016
  • Smart TV uses Webkit as a web browser engine to provide contents such as web surfing, VOD watching, and games. Webkit uses web resources, such as HTML, CSS, JavaScript, and images, in order to run applications. At the start of an application, Webkit loads resources to the memory and creates DOM tree and render tree, which is a time consuming process. However, DOM tree and render tree created by the smart TV application do not change over time because the smart TV application uses web resources stored in a disk. If DOM tree and render tree can be stored and reused, it is possible to reduce loading time of an application. In this paper, we propose FastIO technique that selectively adds persistency to dynamically allocated memory. FastIO reduces overall application loading time by eliminating the process of loading resources from storage, parsing the HTML documents, and creating DOM tree and render tree. Comparison of the application resource loading times indicates that the web browser with FastIO is 7.9x, 44.8x, and 2.9x faster than the legacy web browser in an SSD, Ramdisk, and eMMC environment, respectively.

Algorithm Embodiment for XQuery2SQL Converter (XQuery2SQL 변환기 위한 알고리즘 구현)

  • 서현호;김영국;김덕만
    • Proceedings of the Korea Contents Association Conference
    • /
    • 2004.05a
    • /
    • pp.335-341
    • /
    • 2004
  • HTML that is language that web technology is center expression these day that use of internet and quantity of information by fast development increase rapidly brought limit to use information of web and XML that express meaning or corelation of data itself in W3C by standard for free document transmission and exchange in World Wide Web by the alternative as long as is deviation appeared. There is many efforts to use storing this XML document in RDBMS but to relation style DB because XML document is tree structure structurally data SQL and perfect disaster caused by things that is language to ask a question accomplish. In this paper XML document XML informations that is stored to RDBMS via Parsing and DOM tree process SQL quality through converter called XQuery2SQL of by change and embody XQuery2SQL conversion algorithm that draw information in RDBMS.

  • PDF

A Structured Markup Language for the Object-Oriented Representation and Management of Decision Models on the Web (웹상에서의 의사결정모형의 객체지향적 표현과 관리를 위한 구조적 마크업 언어)

  • Kim, Hyoung-Do
    • Asia pacific journal of information systems
    • /
    • v.8 no.2
    • /
    • pp.53-67
    • /
    • 1998
  • The explosive growth of the Web is providing end-users access to ever-increasing volumes of information. The resources of legacy systems and relational databases have also been made available to the Web browser, which has become an essential business tool. Recently, model management on the Internet/Web is also proposed with its conceptual design or prototypical system like DecisionNet and DSS Web. However, they are also suffering from the same symptoms as the Web, Although we can identify the elements of a page with HTML tags and (declare) the relationships among the various document elements, they are semantically opaque to computer systems and have no domain-specific meaning. However, HTML is not extensible, so developers are forced to invent convoluted, non-standard solutions for embedding and parsing data. Extensible Markup Language (XML) is a simplified subset of SGML that has many benefits for folks who want to improve structure, maintainability, searchability, presentation, and other aspects of their document management. This paper proposes a structured markup language for model representation and management on the Web as an XML application. The language is based on a conceptual modeling framework, Object-Oriented Structured Modeling (OOSM), which is an extension of the structured modeling.

  • PDF

URL Signatures for Improving URL Normalization (URL 정규화 향상을 위한 URL 서명)

  • Soon, Lay-Ki;Lee, Sang-Ho
    • Journal of KIISE:Databases
    • /
    • v.36 no.2
    • /
    • pp.139-149
    • /
    • 2009
  • In the standard URL normalization mechanism, URLs are normalized syntactically by a set of predefined steps. In this paper, we propose to complement the standard URL normalization by incorporating the semantically meaningful metadata of the web pages. The metadata taken into consideration are the body texts and the page size of the web pages, which can be extracted during HTML parsing. The results from our first exploratory experiment indicate that the body texts are effective in identifying equivalent URLs. Hence, given a URL which has undergone the standard normalization, we construct its URL signature by hashing the body text of the associated web page using Message-Digest algorithm 5 in the second experiment. URLs which share identical signatures are considered to be equivalent in our scheme. The results in the second experiment show that our proposed URL signatures were able to further reduce redundant URLs by 32.94% in comparison with the standard URL normalization.

Preparation of Soil Input Files to a Crop Model Using the Korean Soil Information System (흙토람 데이터베이스를 활용한 작물 모델의 토양입력자료 생성)

  • Yoo, Byoung Hyun;Kim, Kwang Soo
    • Korean Journal of Agricultural and Forest Meteorology
    • /
    • v.19 no.3
    • /
    • pp.174-179
    • /
    • 2017
  • Soil parameters are required inputs to crop models, which estimate crop yield under a given environment condition. The Korean Soil Information System (KSIS), which provides detailed soil profile record of 390 soil series in the HTML (HyperText Markup Language) format, would be useful to prepare soil input files. Korean Soil Information System Processing Tool (KSISPT) was developed to aid generation of soil input data based on the KSIS database. Java was used to implement the tool that consists of a set of modules for parsing the HTML document of the KSIS, storing data required for preparing soil input file, calculating additional soil parameter, and writing soil input file to a local disk. Using the automated soil data preparation tool, about 940 soil input data were created for the DSSAT model and the ORYZA 2000 model, respectively. In combination with soil series distribution map at 30m resolution, spatial analysis of crop yield could be projected under climate change, which would help the development of adaptation strategies.