Browse > Article

An Empirical Study on Changes of Web Pages  

Kim Sung Jin (서울대학교 제어계측신기술연구소)
Lee Sang Ho (숭실대학교 컴퓨터학부)
Abstract
As web pages are created, destroyed, and updated frequently, web databases should be updated to keep up-to-date web pages. In order to keep web databases fresh effectively, we need to understand the change of real web pages. Previous researches on the change of the web pages have directed their efforts on the contents modification of web pages only, and have not taken into account the factors of creation and destruction of web pages In their research. This paper investigates the web page changes, which include contents modification, page creation, and page destruction. We introduce three metrics, namely DR (Download Rate), MR (Modification Rate), and CAV (Coefficient of Age Variation) to represent the change of the web pages. We have monitored three million web pages collected from the famous and random sites every other day for one hundred days. With the Download Rate and the Modification Rate, we learned that the download success and the modification depends on the past change of them, and proposes two estimation formulae that predict the download success and modification. With the Coefficient of Age Variation, we show how web pages do not change periodically.
Keywords
web databases; change of web pages; incremental robot; web statistics;
Citations & Related Records
연도 인용수 순위
  • Reference
1 F. Douglis, A. Feldmann, and B. Krishnamurthy, 'Rate of Change and Other Metrics: a Live Study of the World Wide Web,' Proc. 1st USENIX Symposium on Internetworking Technologies and System, pp.147-158,1997
2 S. Lawrence and C.L. Giles, 'Accessibility of Information on the Web,' Nature, 400(6740), pp.107-109, 1999   DOI   ScienceOn
3 D. Fetterly, M. Manasse, M. Najork, and J.L. Wiener, 'A large-scale study of the evolution of web pages,' In proceedings of the 12th World Wide Web conference, 2003, pages 669-678
4 A. Ntoulas, J. Cho, C. Olston 'What's New on the Web? The Evolution of the Web from a Search Engine Perspective,' Proc. 13th WWW Conf., to appear, 2004   DOI
5 S.J. Kim and S.H. Lee, 'Implementation of a Web Robot and Statistics on the Korean Web,' Proc. 2nd Human.Society@Internet Conf., pp.341-350, 2003   DOI
6 C. Wills and M. Mikhailov, 'Towards a Better Understanding of Web Resources and Server Responses for Improved Caching,' Proc. 8th WWW Conf., 1999
7 J. Cho and H. Garcia-Molina, 'The Evolution of the Web and Implications for an Incremental Crawler,' Proc. 26th VLDB Conf., pp.200-209, 2000
8 J. Cho and H. Garcia-Molina, 'Synchronizing a Database to Improve Freshness,' Proc. 26th SIGMOD Conf., pp.117-128, 2000   DOI
9 B. Brewington and G. Cybenko, 'How Dynamic is the Web?,' Proc. 9th WWW Conf., pp.257-276, 2000   DOI
10 J. Edwards, K. McCurley, and J. Tomlin, 'Adaptive Model from Optimizing Performance of an Incremental Web Crawler,' Proc. 10th WWW Conf., pp.106-113, 2001