• Title/Summary/Keyword: Web Scraping

Search Result 25, Processing Time 0.042 seconds

A Brief Survey into the Field of Automatic Image Dataset Generation through Web Scraping and Query Expansion

  • Bart Dikmans;Dongwann Kang
    • Journal of Information Processing Systems
    • /
    • v.19 no.5
    • /
    • pp.602-613
    • /
    • 2023
  • High-quality image datasets are in high demand for various applications. With many online sources providing manually collected datasets, a persisting challenge is to fully automate the dataset collection process. In this study, we surveyed an automatic image dataset generation field through analyzing a collection of existing studies. Moreover, we examined fields that are closely related to automated dataset generation, such as query expansion, web scraping, and dataset quality. We assess how both noise and regional search engine differences can be addressed using an automated search query expansion focused on hypernyms, allowing for user-specific manual query expansion. Combining these aspects provides an outline of how a modern web scraping application can produce large-scale image datasets.

An Implementation and Performance Evaluation of Fast Web Crawler with Python

  • Kim, Cheong Ghil
    • Journal of the Semiconductor & Display Technology
    • /
    • v.18 no.3
    • /
    • pp.140-143
    • /
    • 2019
  • The Internet has been expanded constantly and greatly such that we are having vast number of web pages with dynamic changes. Especially, the fast development of wireless communication technology and the wide spread of various smart devices enable information being created at speed and changed anywhere, anytime. In this situation, web crawling, also known as web scraping, which is an organized, automated computer system for systematically navigating web pages residing on the web and for automatically searching and indexing information, has been inevitably used broadly in many fields today. This paper aims to implement a prototype web crawler with Python and to improve the execution speed using threads on multicore CPU. The results of the implementation confirmed the operation with crawling reference web sites and the performance improvement by evaluating the execution speed on the different thread configurations on multicore CPU.

A Scraping Method of In-Frame Web Sources Using Python (파이썬을 이용한 프레임내 웹 페이지 스크래핑 기법)

  • Yun, Sujin;Seung, Li;Woo, Young Woon
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2019.05a
    • /
    • pp.271-274
    • /
    • 2019
  • In this paper, we proposed a detailed address acquisition scheme for automatically collecting data of a web page in a frame that is difficult to access by a general web access method. Using the Python language and the Beautiful Soup library, which can utilize the proposed address resolution technique and the HTML selector, we were able to automatically collect all the bulletin board text data written in several pages. By using the proposed method, we can collect large amount of data automatically by Python web scraping program for web pages of any form of address, and we expect that it can be used for big data analysis.

  • PDF

A Study on the Analysis of Accident Types in Public and Private Construction Using Web Scraping and Text Mining (웹 스크래핑과 텍스트마이닝을 이용한 공공 및 민간공사의 사고유형 분석)

  • Yoon, Younggeun;Oh, Taekeun
    • The Journal of the Convergence on Culture Technology
    • /
    • v.8 no.5
    • /
    • pp.729-734
    • /
    • 2022
  • Various studies using accident cases are being conducted to identify the causes of accidents in the construction industry, but studies on the differences between public and private construction are insignificant. In this study, web scraping and text mining technologies were applied to analyze the causes of accidents by order type. Through statistical analysis and word cloud analysis of more than 10,000 structured and unstructured data collected, it was confirmed that there was a difference in the types and causes of accidents in public and private construction. In addition, it can contribute to the establishment of safety management measures in the future by identifying the correlation between major accident causes.

Analysis of accident types at small and medium-sized construction sites based on web scraping and text mining (웹 스크래핑 및 텍스트마이닝에 기반한 중소규모 건설현장 사고유형 분석)

  • Younggeun Yoon
    • The Journal of the Convergence on Culture Technology
    • /
    • v.10 no.1
    • /
    • pp.609-615
    • /
    • 2024
  • The construction industry's fatality count stands at 402, comprising approximately 46% of total industrial accidents. Notably, construction costs less than 5 billion won account for about 69%, so strengthening safety management at small and medium-sized construction sites is required. In this study, 19,511 accident investigation data were collected using web scraping. Through statistical analysis of the collected structured data and text mining analysis of the unstructured data, accident types and causes of accidents were analyzed by construction costs at sites less than 5 billion won. As a result, it was confirmed that there were differences in accident types and causes depending on the construction costs. It is hoped that the results of this study will be used for customized safety management at small and medium-sized construction sites.

Smart Synthetic Path Search System for Prevention of Hazardous Chemical Accidents and Analysis of Reaction Risk (반응 위험성분석 및 사고방지를 위한 스마트 합성경로 탐색시스템)

  • Jeong, Joonsoo;Kim, Chang Won;Kwak, Dongho;Shin, Dongil
    • Korean Chemical Engineering Research
    • /
    • v.57 no.6
    • /
    • pp.781-789
    • /
    • 2019
  • There are frequent accidents by chemicals during laboratory experiments and pilot plant and reactor operations. It is necessary to find and comprehend relevant information to prevent accidents before starting synthesis experiments. In the process design stage, reaction information is also necessary to prevent runaway reactions. Although there are various sources available for synthesis information, including the Internet, it takes long time to search and is difficult to choose the right path because the substances used in each synthesis method are different. In order to solve these problems, we propose an intelligent synthetic path search system to help researchers shorten the search time for synthetic paths and identify hazardous intermediates that may exist on paths. The system proposed in this study automatically updates the database by collecting information existing on the Internet through Web scraping and crawling using Selenium, a Python package. Based on the depth-first search, the path search performs searches based on the target substance, distinguishes hazardous chemical grades and yields, etc., and suggests all synthetic paths within a defined limit of path steps. For the benefit of each research institution, researchers can register their private data and expand the database according to the format type. The system is being released as open source for free use. The system is expected to find a safer way and help prevent accidents by supporting researchers referring to the suggested paths.

Design and implement Web sites for greater user convenience through R based data analysis (R기반의 data분석을 통한 사용자 편의성 증진을 위한 웹사이트 설계 및 구현)

  • Yoon, Kyung Seob;Kim, Yeon Hong
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2018.07a
    • /
    • pp.307-310
    • /
    • 2018
  • 우리 사회는 데이터를 기반으로 진화 하고 있어 데이터분석을 할 수 있는 통계패키지가 오늘날 상용화 되고 있다. 상용화되고 있는 통계패키지를 이용해 본 논문에서는 통계패키지 R을 Model1구조가 아닌 Model2 MVC구조로 적용하여, 웹사이트의 유지보수와 코드 효율성을 증진시키고자 한다. 이를 이용하여 웹 스크래핑을 통한 데이터를 수집 후 데이터 분석을 토대로 사용자가 분석내용을 쉽게 이해할 수 있도록, 편의성을 증진시키고 검색 할 수 있는 웹사이트를 설계 및 구현 하고자 한다.

  • PDF

Korean Web Content Extraction using Tag Rank Position and Gradient Boosting (태그 서열 위치와 경사 부스팅을 활용한 한국어 웹 본문 추출)

  • Mo, Jonghoon;Yu, Jae-Myung
    • Journal of KIISE
    • /
    • v.44 no.6
    • /
    • pp.581-586
    • /
    • 2017
  • For automatic web scraping, unnecessary components such as menus and advertisements need to be removed from web pages and main contents should be extracted automatically. A content block tends to be located in the middle of a web page. In particular, Korean web documents rarely include metadata and have a complex design; a suitable method of content extraction is therefore needed. Existing content extraction algorithms use the textual and structural features of content blocks because processing visual features requires heavy computation for rendering and image processing. In this paper, we propose a new content extraction method using the tag positions in HTML as a quasi-visual feature. In addition, we develop a tag rank position, a type of tag position not affected by text length, and show that gradient boosting with the tag rank position is a very accurate content extraction method. The result of this paper shows that the content extraction method can be used to collect high-quality text data automatically from various web pages.

A Study on the 4D Traffic Condition Board based on a Mash-up Technology (Mash-up 기술을 이용한 4D Wall-Map 구성체계)

  • Kim, Joo-Hwan;Yang, Seung-Mook;Nam, Doo-Hee
    • The Journal of The Korea Institute of Intelligent Transport Systems
    • /
    • v.8 no.3
    • /
    • pp.27-33
    • /
    • 2009
  • Content used in mashups is typically obtained from a third party source through a public interface or API (web services). Other methods of obtaining content for mashups include Web feeds (e.g. RSS or Atom), and screen scraping. A mashup or meshup Web application has two parts: A new service delivered through a Web page, using its own data and data from other sources. The blended data, made available across the Web through an API or other protocols such as HlTP, RSS, REST, etc. There are many types of mashups, such as consumer mashups, data mashups, and Business Mashups. The most common mashup is the consumer mashup, which are aimed at the general public. Examples include Google Maps, iGuide, and RadioClouds. 4D Wall-map display is data mashups combine similar types of media and information from multiple sources into a single representation. This technology focus data into a single presentation and allow for collaborative action among ITS-related information sources.

  • PDF

Analysis on Topic Trends and Topic Modeling of KSHSM Journal Papers using Text Mining (텍스트마이닝을 활용한 보건의료산업학회지의 토픽 모델링 및 토픽트렌드 분석)

  • Cho, Kyoung-Won;Bae, Sung-Kwon;Woo, Young-Woon
    • The Korean Journal of Health Service Management
    • /
    • v.11 no.4
    • /
    • pp.213-224
    • /
    • 2017
  • Objectives : The purpose of this study was to analyze representative topics and topic trends of papers in Korean Society and Health Service Management(KSHSM) Journal. Methods : We collected English abstracts and key words of 516 papers in KSHSM Journal from 2007 to 2017. We utilized Python web scraping programs for collecting the papers from Korea Citation Index web site, and RStudio software for topic analysis based on latent Dirichlet allocation algorithm. Results : 9 topics were decided as the best number of topics by perplexity analysis and the resultant 9 topics for all the papers were extracted using Gibbs sampling method. We could refine 9 topics to 5 topics by deep consideration of meanings of each topics and analysis of intertopic distance map. In topic trends analysis from 2007 to 2017, we could verify 'Health Management' and 'Hospital Service' were two representative topics, and 'Hospital Service' was prevalent topic by 2011, but the ratio of the two topics became to be similar from 2012. Conclusions : We discovered 5 topics were the best number of topics and the topic trends reflected the main issues of KSHSM Journal, such as name revision of the society in 2012.