A Scraping Method of In-Frame Web Sources Using Python

Yun, Sujin;Seung, Li;Woo, Young Woon;

Proceedings of the Korean Institute of Information and Commucation Sciences Conference (한국정보통신학회:학술대회논문집)

2019.05a
/
Pages.271-274
/
2019

The Korea Institute of Information and Commucation Engineering (한국정보통신학회)

A Scraping Method of In-Frame Web Sources Using Python

파이썬을 이용한 프레임내 웹 페이지 스크래핑 기법

Yun, Sujin (Dong-eui University) ;
Seung, Li (Dong-eui University) ;
Woo, Young Woon (Dong-eui University)

윤수진 (동의대학교) ;
승리 (동의대학교) ;
우영운 (동의대학교)

Published : 2019.05.23

PDF

Download PDF

⟨ Previous Next ⟩

Abstract

In this paper, we proposed a detailed address acquisition scheme for automatically collecting data of a web page in a frame that is difficult to access by a general web access method. Using the Python language and the Beautiful Soup library, which can utilize the proposed address resolution technique and the HTML selector, we were able to automatically collect all the bulletin board text data written in several pages. By using the proposed method, we can collect large amount of data automatically by Python web scraping program for web pages of any form of address, and we expect that it can be used for big data analysis.

이 논문에서는 일반적인 웹 접근 방법으로 접근하기 어려운 프레임 내 웹 페이지의 데이터를 프로그램에 의해 자동으로 수집하기 위한 세부 주소 확보 기법을 제안하였다. 제안한 세부 주소 확보 기법과 HTML 실렉터를 활용할 수 있는 Python 언어와 Beautiful Soup 라이브러리를 이용하여 여러 페이지로 작성되어 있는 게시판 텍스트 데이터를 자동으로 모두 수집할 수 있었다. 제안한 기법을 활용하여 어떠한 형태의 주소 형식으로 되어 있는 웹 페이지들에 대해서도 Python 웹스크래핑 프로그램에 의해 자동으로 대량의 데이터를 수집할 수 있으며, 이를 통해 빅데이터 분석에 활용될 수 있을 것으로 예상한다.

Proceedings of the Korean Institute of Information and Commucation Sciences Conference (한국정보통신학회:학술대회논문집)

A Scraping Method of In-Frame Web Sources Using Python

파이썬을 이용한 프레임내 웹 페이지 스크래핑 기법

Abstract

Keywords

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)