Browse > Article
http://dx.doi.org/10.3745/KIPSTA.2003.10A.5.469

A Document Collection Method for More Accurate Search Engine  

Ha, Eun-Yong (안양대학교 컴퓨터공학부)
Gwon, Hui-Yong (안양대학교 컴퓨터공학부)
Hwang, Ho-Yeong (안양대학교 디지털미디어학부)
Abstract
Internet information search engines using web robots visit servers conneted to the Internet periodically or non-periodically. They extract and classify data collected according to their own method and construct their database, which are the basis of web information search engines. There procedure are repeated very frequently on the Web. Many search engine sites operate this processing strategically to become popular interneet portal sites which provede users ways how to information on the web. Web search engine contacts to thousands of thousands web servers and maintains its existed databases and navigates to get data about newly connected web servers. But these jobs are decided and conducted by search engines. They run web robots to collect data from web servers without knowledge on the states of web servers. Each search engine issues lots of requests and receives responses from web servers. This is one cause to increase internet traffic on the web. If each web server notify web robots about summary on its public documents and then each web robot runs collecting operations using this summary to the corresponding documents on the web servers, the unnecessary internet traffic is eliminated and also the accuracy of data on search engines will become higher. And the processing overhead concerned with web related jobs on web servers and search engines will become lower. In this paper, a monitoring system on the web server is designed and implemented, which monitors states of documents on the web server and summarizes changes of modified documents and sends the summary information to web robots which want to get documents from the web server. And an efficient web robot on the web search engine is also designed and implemented, which uses the notified summary and gets corresponding documents from the web servers and extracts index and updates its databases.
Keywords
Web Robot; Document Collection; Search Engine;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Mctacrawler : search the search engines, http://www.metacrawler.com/info.metac/dog/index.html
2 Debrifing Meta Search Engine : fast and accurate, http://www.debriefing.com/
3 Berners-Lee,T., Masinter,L. and M. McCahill, 'Uniform Resource Locators (URL),' RFC 1738
4 R. Fielding, J. Gettys, J. C. Mogul, H. Frystyk, L. Masinter, P. Leach, T. Berners-Lee, 'Hypertext Transfer Protocol-HTTP/1.1,' RFC 2068
5 Martijn Koster, 'Robots in the Web : threat or treat?,' April, 1995
6 Koster, M., 'A Standard for Robot Exclusion,' http://info.webcrawler.com/mak/projects/robots/exclusion.html
7 'The Web Robots FAQ,' http://www.robotstxt.org/wc/robots.html
8 D. Wessels, K. Claffy, 'Internet Cache Protocol (ICP), version 2,' RFC 2186
9 Search Engine Robots that visit your web site, http://www.jafsoft.com/searchengines/webbots.html