Browse > Article
http://dx.doi.org/10.3745/KIPSTC.2003.10C.4.509

Implementation of a Web Robot and Statistics on the Korean Web  

Kim, Sung-Jin (숭실대학교 대학원 컴퓨터학과)
Lee, Sang-Ho (숭실대학교 컴퓨터학부)
Abstract
A web robot is a program that downloads and stores web pages. Implementation issues for developing web robots have been studied widely and various web statistics are reported in the literature. First, this paper describes the overall architecture of our robot and implementation decisions on several important issues. Second, we show empirical statistics on approximately 74 million Korean web pages. Third, we monitored 1,424 Korean web sites to observe the changes of web pages. We identify what factors of web pages could affect the changes. The factors may be used for the selection of web pages to be updated incrementally.
Keywords
Information Retrieval; Web Retrieval System; Web Robot; Web Crawler; Korean Web Statistics;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 M. Burner, 'Crawling Towards Eternity : Building an Archive of the World Wide Web,' Web Techniques Magazine, Vol.2, No.5, pp.37-40, 1997
2 A. Heydon and M. Najork, 'Mercator: A Scalable, Extensible Web Crawler,' International Journal of WWW, Vol.2, No.4, pp.219-229, 1999   DOI
3 V. Shkapenyuk and T. Suel, 'Design and Implementation of a High-performance Distributed Web Crawler,' Proc. 18th Data Engineering Conf., pp.357-368, 2002
4 A. Heydon and M. Najork, 'Performance Limitations of the Java Core Libraries,' Proc. 1st Java Grande Conf., pp.35-41, 1999   DOI
5 M. Najork and J. L. Wiener, 'Breadth-first Crawling Yields High-quality Pages,' Proc. 10th WWW Conf., pp. 114-118, 2001   DOI
6 T. Suel and]. Yuan, 'Compressing the Graph Structure of the Web,' Proc. 11th Data Compression Conf., pp. 213-222, 2001   DOI
7 J. Cho and H. Garcia-Molina, 'The Evolution of the Web and Implications for an Incremental Crawler,' Proc. 26th VLDB Conf., pp.200-209, 2000
8 J. Cho and H. Garcia-Molina, Parallel Crawlers, Proc. 11th WWW Conf., pp.124-135, 2002
9 M. Diligenti, F. M. Coetzee, S. Lawrence, C. L. Giles and M. Gori, 'Focused Crawling using Context Graphs,' Proc. 26th VLDB Conf., pp.527-534, 2000
10 J. Cho and H. Garcia-Molina, 'Synchronizing a Database to Improve Freshness,' Proc. 26th SIGMOD Conf., pp. 117-128, 2000   DOI
11 B. Brewington and G. Cybenko, 'How Dynamic is the Web?,' Proc. 9th WWW Conf.. pp.257-276, 2000
12 S. Raghavan and H. Garcia-Molina, 'Crawling the Hidden Web,' Proc. 27th VDLB Conf., pp.129-138, 2001
13 J. Cho, H. Garcia-Molina, and L. Page, 'Efficient Crawling through URL Ordering,' Proc. 7th WWW Conf., pp. 161-172, 1998