Implementation of a Large-scale Web Query Processing System Using the Multi-level Cache Scheme

계층적 캐시 기법을 이용한 대용량 웹 검색 질의 처리 시스템의 구현

  • 임성채 (동덕여자대학교 컴퓨터학과)
  • Published : 2008.10.15

Abstract

With the increasing demands of information sharing and searches via the web, the web search engine has drawn much attention. Although many researches have been done to solve technical challenges to build the web search engine, the issue regarding its query processing system is rarely dealt with. Since the software architecture and operational schemes of the query processing system are hard to elaborate, we here present related techniques implemented on a commercial system. The implemented system is a very large-scale system that can process 5-million user queries per day by using index files built on about 65-million web pages. We implement a multi-level cache scheme to save already returned query results for performance considerations, and the multi-level cache is managed in 4-level cache storage areas. Using the multi-level cache, we can improve the system throughput by a factor of 4, thereby reducing around 70% of the server cost.

웹을 이용한 정보 공개 및 검색이 확대됨에 따라 웹 검색 엔진도 지속적인 주목을 받고 있다. 이에 따라 웹 검색 엔진의 다양한 기술적 문제를 해결하고자 하는 연구가 있었음에도 웹 검색 엔진의 질의 처리 시스템에 대한 기술적 내용은 잘 다뤄지지 않았다. 질의 처리 시스템의 경우 소프트웨어 아키텍처나 운영 기법을 고안하기 어렵기 때문에 본 논문에서는 구현된 상용 시스템을 바탕으로 관련 기술을 소개하고자 한다. 구현된 질의 처리 시스템은 6,500 만개 웹 문서를 색인하여 일 500만개 이상의 사용자 질의 요청을 수행하는 큰 규모의 시스템이다. 구현한 시스템은 질의 처리 결과를 재사용하기 위해 계층적 캐시 기법을 적용했으며, 저장된 캐시 데이타는 4계층으로 구성된 데이타 저장소에 분산 저장되는 것이 특징이다. 계층적 캐시 기법을 통해 질의 처리 용량을 400% 정도로 향상 시킬 수 있었으며 이를 통해 서버 구축비용을 70% 정도 절감할 수 있었다.

Keywords

References

  1. Search Engine Report, http://www.searchenginewatch.com, 2005
  2. Arvind Arasu, et al., Searching the Web, ACM Trans. on Internet Technology, Vol. 1(1), pp. 2-43, August 2001 https://doi.org/10.1145/383034.383035
  3. Sriram Raghvan and Hector Garcia-Molina. Crawling the Hidden Web. In Proc. of the VLDB Conference, pp. 129-138, 2001
  4. Andrei Z. Broder, Marc Najork, and Janet L. Wiener, Efficient URL Caching for World Wide Crawling, In Proc. of the 12th WWW Conference, Budapest, Hungary, 2003
  5. Maxim Lifantsev and Tzi-cker Chiueh, I/O-Conscious Data Preparation for Large-Scale Web Search Engines, In Proc. of the 28th VLDB Conf., pp. Hong Kong, 2002
  6. Sergey Melnik, Sriram Raghavan, Beverly Yang, and Hector Garcia-Molina. Building a Distributed Full-text Index for the Web, In Proc. of the 10th International World Wide Web Conference. pp. 396-406, 2001
  7. Larry Page, Sergey Brin, R. Motwani, and T. Winograd, The PageRank Citation Ranking: Bring Order to the Web, Stanford Univ. Technical Report, 1998
  8. Zheng Chen, Shengping Liu, Liu Wenyin, Geguang Pu, and Wei-Ying Ma, Building a web thesaurus from web link structure, In Proc. of the ACM SIGIR' 03, pp. 48-55, Toronto, Canada, 2003
  9. C. Lee, G. Golub and S. Zenios. A Fast Two Stage Algorithm for Computing PageRank, Technical report, Stanford University, 2003
  10. Steve Lawrence, Context in Web Search, IEEE Data Engineering Bulletin, Vol. 23(3), pp. 25-32, 2000
  11. Reiner Kraft, Chi Chao Chang, Farzin Maghoul, and Ravi Kumar, Searching with Context, In Proc. of the WWW Conf., pp. 477-486, Edinburgh, Scotland, 2006
  12. Taher H. Haveliwala. Topic-sensitive PageRank, In Proc. of the 11th International Conf. on World Wide Web, 2002
  13. Maxim Lifantsev and Tzi-cker Chiueh, Implementation of a modern web search engine cluster, In Proc. of the USENIX Annual Technical Conference, Texas, 2003
  14. Ronny Lempel and Shlomo Moran, Predictive Caching and Prefetching of Query Results in Search Engines, In Proc. of the 12th International Conf. on World Wide Web, pp. 19-28, New York, 2003
  15. Boosting Craig Silverstein, Hannes Marais, Monika Henzinger, and Michael Moricz, Analysis of a very large web search engine query log, ACM SIGIR Forum, Vol. 33(1), pp. 6-12, 1999
  16. Tiziano Fagni, Raffaele Perego, Fabrizio Silvestri, and Salvatore Orlando, Boosting the performance of Web search engines: Caching and prefetching query results by exploiting historical usage data, ACM Trans. on Information Systems, Vol. 24(1), pp. 51-78, 2006 https://doi.org/10.1145/1125857.1125859
  17. Alfred V. Aho and Margaret J. Corasick, Efficient String Matching: An Aid to Bibliographic Search, Communication of the ACM, Vol. 18(6), pp. 333-340, 1975 https://doi.org/10.1145/360825.360855
  18. C. Ruemmler and J. Wilkes, An Introduction to Disk Modeling, IEEE Computer, Vol. 17, No. 3, pp. 17-28, 1994