• Title/Summary/Keyword: web content extraction

Search Result 38, Processing Time 0.021 seconds

Main Content Extraction from Web Pages Based on Node Characteristics

  • Liu, Qingtang;Shao, Mingbo;Wu, Linjing;Zhao, Gang;Fan, Guilin;Li, Jun
    • Journal of Computing Science and Engineering
    • /
    • v.11 no.2
    • /
    • pp.39-48
    • /
    • 2017
  • Main content extraction of web pages is widely used in search engines, web content aggregation and mobile Internet browsing. However, a mass of irrelevant information such as advertisement, irrelevant navigation and trash information is included in web pages. Such irrelevant information reduces the efficiency of web content processing in content-based applications. The purpose of this paper is to propose an automatic main content extraction method of web pages. In this method, we use two indicators to describe characteristics of web pages: text density and hyperlink density. According to continuous distribution of similar content on a page, we use an estimation algorithm to judge if a node is a content node or a noisy node based on characteristics of the node and neighboring nodes. This algorithm enables us to filter advertisement nodes and irrelevant navigation. Experimental results on 10 news websites revealed that our algorithm could achieve a 96.34% average acceptable rate.

Korean Web Content Extraction using Tag Rank Position and Gradient Boosting (태그 서열 위치와 경사 부스팅을 활용한 한국어 웹 본문 추출)

  • Mo, Jonghoon;Yu, Jae-Myung
    • Journal of KIISE
    • /
    • v.44 no.6
    • /
    • pp.581-586
    • /
    • 2017
  • For automatic web scraping, unnecessary components such as menus and advertisements need to be removed from web pages and main contents should be extracted automatically. A content block tends to be located in the middle of a web page. In particular, Korean web documents rarely include metadata and have a complex design; a suitable method of content extraction is therefore needed. Existing content extraction algorithms use the textual and structural features of content blocks because processing visual features requires heavy computation for rendering and image processing. In this paper, we propose a new content extraction method using the tag positions in HTML as a quasi-visual feature. In addition, we develop a tag rank position, a type of tag position not affected by text length, and show that gradient boosting with the tag rank position is a very accurate content extraction method. The result of this paper shows that the content extraction method can be used to collect high-quality text data automatically from various web pages.

A Study on Extracting News Contents from News Web Pages (뉴스 웹 페이지에서 기사 본문 추출에 관한 연구)

  • Lee, Yong-Gu
    • Journal of the Korean Society for information Management
    • /
    • v.26 no.1
    • /
    • pp.305-320
    • /
    • 2009
  • The news pages provided through the web contain unnecessary information. This causes low performance and inefficiency of the news processing system. In this study, news content extraction methods, which are based on sentence identification and block-level tags news web pages, was suggested. To obtain optimal performance, combinations of these methods were applied. The results showed good performance when using an extraction method which applied the sentence identification and eliminated hyperlink text from web pages. Moreover, this method showed better results when combined with the extraction method which used block-level. Extraction methods, which used sentence identification, were effective for raising the extraction recall ratio.

Design and Implementation of Web Crawler with Real-Time Keyword Extraction based on the RAKE Algorithm

  • Zhang, Fei;Jang, Sunggyun;Joe, Inwhee
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2017.11a
    • /
    • pp.395-398
    • /
    • 2017
  • We propose a web crawler system with keyword extraction function in this paper. Researches on the keyword extraction in existing text mining are mostly based on databases which have already been grabbed by documents or corpora, but the purpose of this paper is to establish a real-time keyword extraction system which can extract the keywords of the corresponding text and store them into the database together while grasping the text of the web page. In this paper, we design and implement a crawler combining RAKE keyword extraction algorithm. It can extract keywords from the corresponding content while grasping the content of web page. As a result, the performance of the RAKE algorithm is improved by increasing the weight of the important features (such as the noun appearing in the title). The experimental results show that this method is superior to the existing method and it can extract keywords satisfactorily.

The Concept and Application Methods of Intelligent Content

  • Yoon Yong-Bae;Chae Song-Hwa;Kim Won-Il
    • International Journal of Contents
    • /
    • v.2 no.3
    • /
    • pp.1-5
    • /
    • 2006
  • Intelligent Content is defined as detailed information or fragment of content which contains a semantic data structure. This semantic structure makes possible to do various intelligent operations. There are wide range of content-oriented applications such as classification, retrieval, extraction, translation, presentation and question-answering. The concept of Intelligent Content is applied to various fields like MPEG and Semantic Web. In this paper, we discuss the several important researches of Intelligent Content and how to apply this conception to these fields.

  • PDF

A Study on the Real-time Distributed Content-based Web Image Retrieval System using PC Cluster (PC 클러스터를 이용한 실시간 분산 웹 영상 내용기반 검색 시스템에 관한 연구)

  • 이은애;하석운
    • Journal of Korea Multimedia Society
    • /
    • v.4 no.6
    • /
    • pp.534-542
    • /
    • 2001
  • Recent content-based image retrieval systems make use of a local single server contained a limited number of images. So these systems are not satisfactory for the Web user's needs that make request for various images on the Web. A content-based image retrieval system that has regard for a great number of Web images has to stand on the basis of real-time first of all. Therefore, to implement the above system we have to resolve a problem of large waste time to take for an image collection and feature extractions. In recent, PC clusters with a load distribution are implemented for the purpose of high-performance data processing. In this paper, we decreased the whole retrieval time by distributing the tasks of image collection and feature extraction to take much time among the slave computers of the PC cluster, and so we found the possibility of the real-time processing in the retrieval of Web images.

  • PDF

Cloth Product Recognition based on Siamese Network with Body Region Extraction method

  • Budiman, Sutanto Edward;Kurniawan, Edwin;Lee, Seung Heon;Lee, Jae Seung;Lee, Suk-Ho
    • International journal of advanced smart convergence
    • /
    • v.11 no.2
    • /
    • pp.128-134
    • /
    • 2022
  • Nowadays, people consume a lot of content such as web dramas or K-pop videos through mobile devices such as smartphones, and the market for indirect advertisements through these web dramas or K-pop videos is also increasing every year. In order to lead to the immediate purchase of indirect products in web dramas, a system that allows consumers to purchase immediately at the time the products appear in the drama is needed. In this paper, we propose a system to allow viewers to purchase products worn by celebrities immediately when viewers see and click on them. When a user clicks on a video, it recognizes the product worn by the celebrity, and displays information on the screen on the most similar product corresponding to the recognized product, allowing them to go to the seller's site where they can purchase it. In order for such a system to operate stably, a pose estimation and siamese network-based system is proposed. The proposed system will primarily be released as a streaming service in the form of an app or web page that connects the products in web dramas or other K-pop video contents screened on the mobile with e-commerce. Furthermore, in the future, the technology is expected to be used globally in various industries such as smart mobility and display kiosks.

Topic-Specific Mobile Web Contents Adaptation (주제기반 모바일 웹 콘텐츠 적응화)

  • Lee, Eun-Shil;Kang, Jin-Beom;Choi, Joong-Min
    • Journal of KIISE:Software and Applications
    • /
    • v.34 no.6
    • /
    • pp.539-548
    • /
    • 2007
  • Mobile content adaptation is a technology of effectively representing the contents originally built for the desktop PC on wireless mobile devices. Previous approaches for Web content adaptation are mostly device-dependent. Also, the content transformation to suit to a smaller device is done manually. Furthermore, the same contents are provided to different users regardless of their individual preferences. As a result, the user has difficulty in selecting relevant information from a heavy volume of contents since the context information related to the content is not provided. To resolve these problems, this paper proposes an enhanced method of Web content adaptation for mobile devices. In our system, the process of Web content adaptation consists of 4 stages including block filtering, block title extraction, block content summarization, and personalization through learning. Learning is initiated when the user selects the full content menu from the content summary page. As a result of learning, personalization is realized by showing the information for the relevant block at the top of the content list. A series of experiments are performed to evaluate the content adaptation for a number of Web sites including online newspapers. The results of evaluation are satisfactory, both in block filtering accuracy and in user satisfaction by personalization.

An effective approach to generate Wikipedia infobox of movie domain using semi-structured data

  • Bhuiyan, Hanif;Oh, Kyeong-Jin;Hong, Myung-Duk;Jo, Geun-Sik
    • Journal of Internet Computing and Services
    • /
    • v.18 no.3
    • /
    • pp.49-61
    • /
    • 2017
  • Wikipedia infoboxes have emerged as an important structured information source on the web. To compose infobox for an article, considerable amount of manual effort is required from an author. Due to this manual involvement, infobox suffers from inconsistency, data heterogeneity, incompleteness, schema drift etc. Prior works attempted to solve those problems by generating infobox automatically based on the corresponding article text. However, there are many articles in Wikipedia that do not have enough text content to generate infobox. In this paper, we present an automated approach to generate infobox for movie domain of Wikipedia by extracting information from several sources of the web instead of relying on article text only. The proposed methodology has been developed using semantic relations of article content and available semi-structured information of the web. It processes the article text through some classification processes to identify the template from the large pool of template list. Finally, it extracts the information for the corresponding template attributes from web and thus generates infobox. Through a comprehensive experimental evaluation the proposed scheme was demonstrated as an effective and efficient approach to generate Wikipedia infobox.

Case Study to Setup Web-Service Strategy of National Wind Atlas (해외사례 분석을 통한 국가바람지도 웹서비스 전략수립)

  • Kim, Hyun-Goo;Hwang, Hyo-Jung
    • New & Renewable Energy
    • /
    • v.5 no.4
    • /
    • pp.3-8
    • /
    • 2009
  • This global case study pursues diversification and intensification for an application system of the national wind atlas which has been developed to support national strategy building and promotion of wind energy dissemination. We chose nine counties' national wind atlas and compared their map area, extraction height, temporal and spatial resolutions, download services, etc. to derive a best practice for the Korea wind atlas application system. Therefore, the web service content is designed to offer high-resolution height information of which covers wind turbine rotor sweeping area and time-series dataset which can be downloaded for further analysis by users. It is anticipated that the system and web service would contribute greatly to wind energy policy making, business and research sectors.

  • PDF