Browse > Article

Performance Improvement of Web Information Retrieval Using Sentence-Query Similarity  

Park Eui-Kyu (연세대학교 전산학과)
Ra Dong-Yul (연세대학교 전산학과)
Jang Myung-Gil (한국전자통신연구원 지식마이닝연구팀)
Abstract
Prosperity of Internet led to the web containing huge number of documents. Thus increasing importance is given to the web information retrieval technology that can provide users with documents that contain the right information they want. This paper proposes several techniques that are effective for the improvement of web information retrieval. Similarity between a document and the query is a major source of information exploited by conventional systems. However, we suggest a technique to make use of similarity between a sentence and the query. We introduce a technique to compute the approximate score of the sentence-query similarity even without a mature technology of natural language processing. It was shown that the amount of computation for this task is linear to the number of documents in the total collection, which implies that practical systems can make use of this technique. The next important technique proposed in this paper is to use stratification of documents in re-ranking the documents to output. It was shown that it can lead to significant improvement in performance. We furthermore showed that using hyper links, anchor texts, and titles can result in enhancement of performance. To justify the proposed techniques we developed a large scale web information retrieval system and used it for experiments.
Keywords
web; information retrieval; sentence-query similarity; stratification; hyper link; anchor text;
Citations & Related Records
연도 인용수 순위
  • Reference
1 D. Hawking, 'Overview of the TREC-9 Web Track,' Proc. of the Ninth Text Retrieval Conference TREC 2000, NIST, May, 2001
2 J-M Lim, H-J Oh, S-H Maeng and M-H Lee, 'Improving efficiency with document category information in Link-based retrieval,' In Proc. of the Information Retrieval on Asian Languages Conference, 1999
3 Sumio Fujita, 'More reflections on 'aboutness' TREC-2001 evaluation experiments at Justsystem,' Proc. of the Tenth Text Retrieval Conference TREC 2001, May, 2002
4 E. Voorhees, 'Variations in relevance judgements and the measurement of retrieval effectiveness,' Information Processing and Management, 36, pp. 697-716, 2000   DOI   ScienceOn
5 G. Salton, A. Wong, and C. S. Tang, 'A Vector Space Model for Automatic Indexing,' Communications of the ACM, 18:11, pp. 614-620, Nov, 1975   DOI   ScienceOn
6 J. Perez-Carballo and T. Strzalkowski, 'Natural language information retrieval: progress report,' Information Processing and Management, Vol. 36, pp.155-178, 2000   DOI   ScienceOn
7 J. Kleinberg, 'Authoritative sources in a hyerlinked environment,' Technical Report RJ 10076, IBM, 1997
8 National Institute of Informatics, 'NTCIR Workshop 3 Meeting OVERVIEW,' Working Notes of the Third NTCIR Workshop Meeting, October 8-10, 2002
9 G. Salton, Automatic Text Processing, Addison-wesley, 1989
10 N. Craswell and D. Hawking, 'Overview of the TREC-2002 Web Track,' Proc. of the Eleventh Text Retrieval Conference TREC-2002, NIST, May, 2003
11 P. Bailey, N. Craswell and D. Hawking, 'Engineering a multi-purpose test collection for Web retrieval experiments,' Technical report, CSIRO, 2001
12 D. Harman, 'The TREC Conferences,' In Readings in Information Retrieval, pp. 247-256, Morgan Kaufman, 1997
13 E. Voorhees and D. Harman, 'Overview of TREC 2001,' Proc. of the Tenth Text Retrieval Conference TREC 2001, May, 2002