Refresh Cycle Optimization for Web Crawlers

Cho, Wan-Sup;Lee, Jeong-Eun;Choi, Chi-Hwan;

doi:10.5392/JKCA.2013.13.06.030

The Journal of the Korea Contents Association (한국콘텐츠학회논문지)

Volume 13 Issue 6
/
Pages.30-39
/
2013
/
1598-4877(pISSN)
/
2508-6723(eISSN)

The Korea Contents Association (한국콘텐츠학회)

DOI QR Code

Refresh Cycle Optimization for Web Crawlers

웹크롤러의 수집주기 최적화

조완섭 (충북대학교 경영정보학과/대학원비즈니스데이터융합학과) ;
이정은 (충북대학교 비즈니스데이터융합학과) ;
최치환 (충북대학교 바이오정보기술학과)

Received : 2013.05.23
Accepted : 2013.05.31
Published : 2013.06.28

https://doi.org/10.5392/JKCA.2013.13.06.030 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Web crawler should maintain fresh data with minimum server overhead for large amount of data in the web sites. The overhead in the server increases rapidly as the amount of data is exploding as in the big data era. The amount of web information is increasing rapidly with advanced wireless networks and emergence of diverse smart devices. Furthermore, the information is continuously being produced and updated in anywhere and anytime by means of easy web platforms, and smart devices. Now, it is becoming a hot issue how frequently updated web data has to be refreshed in data collection and integration. In this paper, we propose dynamic web-data crawling methods, which include sensitive checking of web site changes, and dynamic retrieving of web pages from target web sites based on historical update patterns. Furthermore, we implemented a Java-based web crawling application and compared efficiency between conventional static approaches and our dynamic one. Our experiment results showed 46.2% overhead benefits with more fresh data compared to the static crawling methods.

웹 크롤러는 서버의 부담을 최소화하면서도 최신의 데이터를 웹사이트로부터 수집하고 유지해야 한다. 빅데이터 시대와 같이 데이터가 폭발적으로 증가하는 시대에 데이터 소스로부터 자주 모든 데이터를 추출하는 것은 서버에 심각한 부담을 주게 된다. 무선통신 기술과 다양한 스마트 기기들의 확산으로 정보가 급속도로 생성되고 있으며, 어디에서나 어느 시간이나 지속적으로 생성 및 변경되고 있다. 웹크롤러는 이러한 상황을 감안하여 최신의 정보를 적은 오버헤드로 유지해 나가는 것이 중요한 이슈로 부각되고 있다. 본 논문에서는 웹사이트의 변경사항을 체크할 수 있는 효과적인 방안과 웹사이트의 수집 주기를 동적으로 변경함으로써 적은 비용으로 최신성을 유지할 수 있는 방안을 제시한다. 핵심 아이디어는 과거 히스토리로부터 웹사이트 변경이 집중되는 시간을 파악하여 웹수집 주기를 결정하는데 반영한다는 점이다. 논문에서는 특정 웹사이트의 데이터를 추출하는 Java 크롤러를 개발하고, 제안된 방식과 기존 방식의 유용성을 비교하였다. 제안된 기법을 사용하면 정적인 방식보다 서버 오버헤드를 절반정도(46.2%)로 줄이면서도 최신성을 더욱 높게 보장할 수 있게 된다.

Keywords

References

강한훈, 유성준, 한동일, "다양한 계층 트리구조를 갖는 쇼핑몰 상에서의 상품평 수집을 위한 웹크롤러 래퍼의 설계 및 구현", 한국지능시스템학회논문지, 제20권, 제3호, pp.318-325, 2010.
권성호, 이영탁, 김영준, 이용두, "고성능 웹크롤러의 설계 및 구현", 한국산업정보학회논문지, 제8권, 제4호, pp.64-72, 2003.
고일석, 최우진, 나윤지, 류승렬, "효율적인 웹문서 처리를 위한 HTTP 지연 개선에 관한 연구", 한국콘텐츠학회논문지, 제2권, 제2호, pp.47-52, 2002.
김광영, 이원구, 이민호, "웹 자원 아카이빙을 위한 웹 크롤러 연구 개발", 한국콘텐츠학회논문지, 제11권, 제9호, pp.9-16, 2011. https://doi.org/10.5392/JKCA.2011.11.9.009
김성진, "웹 정보탐색행위 모형의 비교분석 연구", 정보처리학회지, 제21권, 제2호, pp.211-233, 2004.
김경수, 웹 크롤링 수집주기의 동적 설계 및 구현, 충북대학교 경영대학원 석사논문, 2011.
장문수, 정준영, "URL 패턴 스크립트를 이용한 효율적인 웹문서 수집방안", 퍼지 및 지능시스템학회 논문지, 제17권, 제6호, pp.849-854, 2007.
황인수, "웹의 연결구조와 웹문서의 적합도를 이용한 효율적인 인터넷 정보추출", 정보기술과 데이타베이스 저널, 제11권, 제4호, pp.49-60, 2004.
C. Bertoli, V. Vrescenzi, and P. Merialdo, "Crawling Programs for Wraller-based Applications," In Proc. IEEE Intl. Conference on Information Reuse and Integration (IRI '08), pp.160-165, 2008.
J. H. Cho, Crawling the Web: Discovery and maintenance of Large-Scale Web Data, Ph. D. Dissertation, Stanford University, 2001.
S. Chakrabarti, M. van den Berg, and B. Dom, "Focused Crawling: A new Approach to Topic-Specific Web Resource Discovery," Computer Networks, Vol.31, No.11-16, pp.1623-1640, 1999. https://doi.org/10.1016/S1389-1286(99)00052-3
TeraStream 제품소개서, (주)데이터스트림즈 (www.datastreams.co.kr), 2008.
Z. Guan, C. Wang, C. Chen, J. Bu, and J. Wang, "Guide Focused Crawler Efficiently and Effectively Using On-line Topical Importance Estimation," In Proc. of ACM SIGIR Conference on Research and Development in Infoprmation Retrieval, pp.757-758, 2008.
B. He, C. Li, D. Killian, M. Patel, Y. Tseng, and K. C. C. Chang, "A Structure-Driven Yield-Aware Web Form Crawler: Building a Database of Online Databases," UIUC Technical Report, 2006.
J. Y. Yang, T. H. Kim, and J. M. Choi, "An Interface Agent for Wrapper-based Information Extraction," In Proc. Intl. Conf. on Principles of Practice in Multi-agent Systems(PRIMA '04), pp.291-302, 2004.
Karthikeyan Anbarasan, SQL Integration Services (SSIS) - Step by Step Tutorial, in A SSIS eBook (www.f5Debug.net), 2011.
Liu, Bing, Web data mining: exploring hyperlinks, contents, and usage data, Springer Verlag, 2007.
G. Pant, P. Srinivasna, and F. Menczer, "Crawling the web," In Web Dynamics, pp.153-177, 2004.
M. L. Vidal, A. S. da Silva, E. S. de Moura, and J. M. B. Cavalcanti, "GoGetIt!: a tool for generating structure-driven web crawlers," In Proc. 15th international conference on World Wide Web, pp.1011-1012, 2006.
김성진, 이상호, "웹 문서 변화에 관한 실험적 연구", 정보과학회논문지 : 데이터베이스, 제32권, 제2호, pp.151-160, 2005.
http://bric.postech.ac.kr/myboard/list.php?Board=exp_qna
http://news.nate.com/recent?cate=col&mid=n0108&type=t

Cited by

Analysis of fire-accident factors using big-data analysis method for construction areas pp.1976-3808, 2017, https://doi.org/10.1007/s12205-017-0767-7

The Journal of the Korea Contents Association (한국콘텐츠학회논문지)

Refresh Cycle Optimization for Web Crawlers

웹크롤러의 수집주기 최적화

Abstract

Keywords

References

Cited by

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)