DOI QR코드

DOI QR Code

Resampling Feedback Documents Using Overlapping Clusters

중첩 클러스터를 이용한 피드백 문서의 재샘플링 기법

  • 이경순 (전북대학교 전기전자컴퓨터공학부/영상정보신기술연구센터)
  • Published : 2009.06.30

Abstract

Typical pseudo-relevance feedback methods assume the top-retrieved documents are relevant and use these pseudo-relevant documents to expand terms. The initial retrieval set can, however, contain a great deal of noise. In this paper, we present a cluster-based resampling method to select better pseudo-relevant documents based on the relevance model. The main idea is to use document clusters to find dominant documents for the initial retrieval set, and to repeatedly feed the documents to emphasize the core topics of a query. Experimental results on large-scale web TREC collections show significant improvements over the relevance model. For justification of the resampling approach, we examine relevance density of feedback documents. The resampling approach shows higher relevance density than the baseline relevance model on all collections, resulting in better retrieval accuracy in pseudo-relevance feedback. This result indicates that the proposed method is effective for pseudo-relevance feedback.

대부분의 잠정적 적합피드백기법들은 질의에 대해 검색된 상위검색문서들이 적합하다고 가정하고, 그 문서들을 질의 확장을 위한 피드백 문서로 이용하고 있다. 그러나 초기검색결과에는 상당한 양의 부적합 문서를 포함하고 있는 것이 현실이다. 이 논문에서는 보다 좋은 피드백 문서를 선택하기 위해서 중첩클러스터를 이용한 피드백문서의 재샘플링 기법을 제안한다. 주요 아이디어는 질의 중심적인 초기검색문서집합에 대해서 중첩이 허용된 문서클러스터를 이용하여 문서들 사이의 관계를 반영하여 질의에 핵심역할을 하는 지배적 문서를 찾고, 이 문서들을 반복적으로 피드백 하여 질의가 내포하는 핵심 주제를 강조하는 것이다. 대규모 실험집합인 TREC GOV2와 WT10g에 대한 실험비교에서, 최근 잠정적 적합피드백 기법들 중에서 가장 좋은 성능을 보이고 있는 적합모델보다 재샘플링기법이 우수한 성능향상을 보였다. 제안기법에 대한 검증을 위해서 피드백문서에 포함된 적합문서의 정도를 나타내는 적합밀도를 측정하였다. 재샘플링 기법이 TREC 실험집합에 대해서 적합모델에 비해 높은 적합밀도를 보였고, 이 결과 적합피드백에서 검색성능을 향상시키게 되었다. 이는 제안 기법이 잠정적 적합피드백에서 유효한 방법임을 알 수 있다.

Keywords

References

  1. Attar, R. and Fraenkel, A. S. 1977. Local Feedback in Full-Text Retrieval Systems. Journal of the ACM 24, 3 (Jul. 1977), pp.397-417 https://doi.org/10.1145/322017.322021
  2. Buckley, C. and Harman, D. 2004. Reliable information access final workshop report. http://nrrc.mitre.org/NRRC/publications.htm
  3. Buckley, C., Mitra, M., Walz, J., and Cardie, C. 1998. Using clustering and superconcepts within SMART: TREC 6. In Proc. 6th Text REtrieval Conference (TREC-6) https://doi.org/10.1016/S0306-4573(99)00047-3
  4. Buckley, C. and Robertson, S. 2008. Proposal for relevance feedback 2008 track. http://groups.google.com/group/trec-relfeed
  5. Collins-Thompson, K., and Callan, J. 2007. Estimation and use of uncertainty in pseudo-relevance feedback. In Proc. 30th ACM SIGIR conference on Research and Development in Information Retrieval, pp.303-310 https://doi.org/10.1145/1277741.1277795
  6. Diaz, F. 2005. Regularizing ad hoc retrieval scores. In Proc. 14th ACM international conference on Information and knowledge management (CIKM), pp.672-679 https://doi.org/10.1145/1099554.1099722
  7. Diaz, F., and Metzler, D. 2006. Improving the Estimation of Relevance Models Using Large External Corpora, In Proc. 29th ACM SIGIR conference on Research and Development in Information Retrieval, pp.154-161 https://doi.org/10.1145/1148170.1148200
  8. Efron, B. 1979. Bootstrap methods: Another look at the jackknife, The Annals of Statistics, 7, pp.1-26 https://doi.org/10.1214/aos/1176344552
  9. Fix, E. and Hodges, L. 1951. Discriminatory analysis: nonparametric discrimination: consistency properties. Technical Report, USAF School of Aviation Medicine, Randolph Field, Texas, Project 21-49-004
  10. Freund, Y. 1990. Boosting a weak learning algorithm by majority. In Proc. 3rd Annual Workshop on Computational Learning Theory
  11. Jardine. N. and Rijsbergen, C.J.V. 1971. The use of hierarchic clustering in information retrieval. Information Storage and Retrieval, 7, pp.217-240 https://doi.org/10.1016/0020-0271(71)90051-9
  12. O. Kurland and L. Lee, 2004. Corpus structure, language models, and ad hoc information retrieval. In Proc. 27th ACM SIGIR conference on Research and Development in Information Retrieval, pp.194-201 https://doi.org/10.1145/1008992.1009027
  13. Kurland, O., and Lee, L. 2005. Better than the real thing? Iterative pseudo-query processing using cluster-based language models. In Proc. 28th ACM SIGIR conference on Research and Development in Information Retrieval, pp.19-26 https://doi.org/10.1145/1076034.1076041
  14. Kurland, O., and Lee, L. 2006. Respect my authority! HITS without hyperlinks, utilizing cluster-based language models. In Proc. 29th ACM SIGIR conference on Research and Development in Information Retrieval, pp.83-90 https://doi.org/10.1145/1148170.1148188
  15. Lavrenko, V. and Croft, W.B. 2001. Relevance-based language models. In Proc. 24th ACM SIGIR conference on Research and Development in Information Retrieval, pp.120-127 https://doi.org/10.1145/383952.383972
  16. Lee, K.S., Park, Y.C., and Choi, K.S. 2001. Re-ranking model based on document clusters. Information Processing and Management, 37, pp.1-14 https://doi.org/10.1016/S0306-4573(00)00017-0
  17. Lee, K.S., Kageura, K., and Choi, K.S. 2004. Implicit ambiguity resolution based on cluster analysis in crosslanguage information retrieval, Information Processing and Management, 40, pp.145-159 https://doi.org/10.1016/S0306-4573(03)00028-1
  18. Liu, X., and Croft, W.B. 2004. Cluster-based retrieval using language models. In Proc. 27th ACM SIGIR conference on Research and Development in Information Retrieval, pp. 186-193 https://doi.org/10.1145/1008992.1009026
  19. Lynam, T., Buckley, C., Clarke, C., and Cormack, G. 2004. A multi-system analysis of document and term selection for blind feedback. In Proc. 13th ACM international conference on Information and knowledge management (CIKM), pp. 261-269 https://doi.org/10.1145/1031171.1031229
  20. Metzler, D., and Croft, W. B. 2007. Latent Concept Expansion Using Markov Random Fields, In Proc. 30th ACM SIGIR conference on Research and Development in Information Retrieval, pp.311-318 https://doi.org/10.1145/1277741.1277796
  21. Ponte, J.M., and Croft, W.B. 1998. A language modeling approach to information retrieval. In Proc. 21st ACM SIGIR conference on Research and Development in Information Retrieval, pp.275-281 https://doi.org/10.1145/290941.291008
  22. Robertson, S.E., Walker, S., Beaulieu, M., Gatford, M., and Payne, A. 1996. Okapi at TREC-4. In Proc. 4th Text REtrieval Conference (TREC)
  23. Rocchio, J.J. 1971. Relevance feedback in information retrieval. The SMART retrieval system, Prentice-Hall, pp.316-321
  24. Rosenfeld, R. 2000. Two decades of statistical language modeling: where do we go from here? In Proc. of the IEEE, 88(8), pp.1270-1278 https://doi.org/10.1109/5.880083
  25. Sakai, T., Manabe, T. and Koyama, M. 2005. Flexible pseudo-relevance feedback via selective sampling. ACM Transactions on Asian Language Information Processing (TALIP), 4(2), pp.111-135 https://doi.org/10.1145/1105696.1105699
  26. Salton, G., and Buckley, C. 1990. Improving retrieval performance by relevance feedback. Journal of the American Society for Information Science, 41(4), pp.288-297 https://doi.org/10.1002/(SICI)1097-4571(199006)41:4<288::AID-ASI8>3.0.CO;2-H
  27. Schapire, R. 1990. Strength of weak learnability. Journal of Machine Learning, 5, pp.197-227 https://doi.org/10.1007/BF00116037
  28. Strohman, T., Metzler, D., Turtle, H., and Croft, W.B. 2005. Indri: A language model-based search engine for complex queries. In Proc. International Conference on Intelligence Analysis
  29. Tao, T., and Zhai, C. 2006. Regularized estimation of mixture models for robust pseudo-relevance feedback. In Proc. 29th ACM SIGIR conference on Research and Development in Information Retrieval, pp.162-169 https://doi.org/10.1145/1148170.1148201
  30. TREC. 20008. Call for participation. http://trec.nist.gov/call08.html
  31. Xu, J and Croft, W.B. 1996. Query expansion using local and global document analysis. In Proc. 19th ACM SIGIR conference on Research and Development in Information Retrieval, pp.4-11 https://doi.org/10.1145/243199.243202
  32. Yang, L., Ji, D., Zhou, G., Nie, Y., and Xiao, G. 2006. Document re-ranking using cluster validation and label propagation. In Proc. 15th ACM international conference on Information and knowledge management CIKM), pp.690-697 https://doi.org/10.1145/1183614.1183713
  33. Yeung, D.L., Clarke, C.L.A., Cormack, G.V., Lynam, T.R., and Terra, E.L. 2004. Task-specific query expansion. In Proc. 12th Text REtrieval Conference (TREC), pp.810-819
  34. Zhang, B., Li, H., Liu, Y., Ji, L., Xi, W., Fan, W., Chen, Z., and Ma, W.-Y. 2005. Improving web search results using affinity graph. In Proc. 28th ACM SIGIR conference on Research and Development in Information Retrieval, pp.504-511 https://doi.org/10.1145/1076034.1076120
  35. Zhai, C., and Lafferty, J. 2004. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems, 22(2), pp.179-214 https://doi.org/10.1145/984321.984322