DOI QR코드

DOI QR Code

Language Modeling Approaches to Information Retrieval

  • Banerjee, Protima (College of Information Science & Technology, Drexel University) ;
  • Han, Hyo-Il (College of Information Science & Technology, Drexel University)
  • Published : 2009.09.30

Abstract

This article surveys recent research in the area of language modeling (sometimes called statistical language modeling) approaches to information retrieval. Language modeling is a formal probabilistic retrieval framework with roots in speech recognition and natural language processing. The underlying assumption of language modeling is that human language generation is a random process; the goal is to model that process via a generative statistical model. In this article, we discuss current research in the application of language modeling to information retrieval, the role of semantics in the language modeling framework, cluster-based language models, use of language modeling for XML retrieval and future trends.

Keywords

References

  1. ALLAN, J., J. ASLAM, N. BELKIN, C. BUCKLEY, J. CALLAN, W. B. CROFT, S. DUMAIS, N. FUHR, D. HARMAN, D. HARPER, D. HIEMSTRA, T. HOFMANN, E. HOVY, W. KRAAIJ, J. LAFFERTY, V. LAVRENKO, D. LEWIS, L. LIDDY, R. MANMATHA, A. MCCALLUM, J. PONTE, J. PRAGER, D. RADEV, P. RESNIK, S. ROBERTSON, R. ROSENFELD, R. ROUKOS, M. SANDERSON, R. M. SCHWARTZ, A. SINGHAL, A. SMEATON, H. TURTLE, E. VOORHEES, R. WEISCHEDEL, J. XU, AND C. ZHAI. 2003. Challenges in information retrieval and language modeling. Report of a workshop held at the center for intelligent information retrieval. 37 (September):31−47.
  2. BAHL, L. R., P. F., BROWN, P. V. DE SOUZA, AND R. L. MERCER. 1989. A tree-based statistical language model for natural language speech recognition. Communications of the ACM 37:1001−1008. https://doi.org/10.1109/29.32278
  3. BANERJEE, P. AND H. HAN. 2008. Incorporation of Corpus-Specific Semantic Information into Question Answering Context. In CIKM 2008 − Ontologies and Information Systems for the Semantic Web Workshop Napa Valley, USA. https://doi.org/10.1145/1458484.1458497
  4. BANERJEE, P. AND H. HAN. 2009a. Modeling Semantic Question Context for Question Answering. To appear in FLAIRS 2009.
  5. BANERJEE, P. AND H. HAN. 2009b. Answer Credibility: A Language Modeling Approach to Answer Validation. To appear in NAACL-HLT 2009.
  6. BANERJEE, P. AND H. HAN. 2009c. From Question Context to Answer Credibility: Modeling Semantic Structures for Question Answering Using Statistical Methods. To appear in IKE 2009.
  7. BERGER, A. AND J. LAFFERTY. 1999. Information retrieval as statistical translation. Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, 222-229. https://doi.org/10.1145/312624.312681
  8. BROWN, P., S. DELLA PIETRA, V. DELLA PIETRA, F. JELINEK, J. LAFFERTY, R. MERCER, AND P. ROOSSIN. 1990. A statistical approach to machine translation. Computational Linguistics. 16:79-85.
  9. BUCKLEY, C. 2004. Why current IR engines fail. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval Sheffield. UK, 584-585. https://doi.org/10.1145/1008992.1009132
  10. DEMPSTER, A. P., LAIRD, N. M. AND D. B. RUBIN. 1977. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, 39:1-38.
  11. HARMAN, D. AND C. BUCKLEY. 2004. The NRRC reliable information access (RIA) workshop. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. Sheffield, UK, 528-529. https://doi.org/10.1145/1008992.1009104
  12. HERSCH, W. 2004. TREC 2004 Genomics Track Overview. In On-line Proceedings of the Thirteenth Text Retrieval Conference.
  13. HERSCH, W. 2005. TREC 2005 Genomics Track Overview. In On-line Proceedings of the TREC 2005 Genomics Track Overview.
  14. HERSH, W., A. COHEN, P. ROBERTS, AND H. K. REKAPALLI. 2006. TREC 2006 Genomics Track Overview. In Online proceedings of the 2006 Text Retrieval Conference.
  15. HIEMSTRA, D. 1999. A linguistically motivated probabilistic model of information retrieval. In Research and Advanced Technology for Digital Libraries - Second European Conference. ECDL'98, 569-584. https://doi.org/10.1007/3-540-49653-X_34
  16. HOFMANN, T. 1999. Probabilistic latent semantic indexing. Proceedings of the 22nd Annual International SIGIR Conference on Research and Development In Information Retrieval. https://doi.org/10.1145/312624.312649
  17. HOVY, E., U. HERMJAKOB, AND C. Y. LIN. 2002. The Use of External Knowledge in Factoid QA. In NIST Special Publication. Gaithersburg, Maryland, 644-652.
  18. JELINEK, F. AND R. L. MERCER. 1980. Interpolated estimation of Markov source parameters from sparse data. In Proceedings of the Workshop on Pattern Recognition in Practice Amsterdam. Netherlands.
  19. JELINEK, F., R. L. MERCER, L. R. BAHL, AND J. K. BAKER. 1977. Perplexity - a measure of the difficulty of speech recognition tasks. Journal of the Acoustical Society of America. 62:S63.
  20. JIN, R. AND A. HAUPTMANN. 2001. Learning to Select Good Title Words: An New Approach based on Reversed Information Retrieval. Proceedings of the Eighteenth International Conference on Machine Learning. 242-249.
  21. JURAFSKY, D. AND J. H. MARTIN. 2000. Speech and language processing: Prentice Hall Upper Saddle River, NJ.
  22. KALT, T. 1996. A New Probabilistic Model of Text Classification and Retrieval. Technical Report. University of Massachusetts, Amherst, Massachusetts.
  23. KATZ, S. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics Speech and Signal Processing, 35:400-401. https://doi.org/10.1109/TASSP.1987.1165125
  24. KRAAIJ, W. AND M. SPITTERS. 2003. Language Models for Topic Tracking. In Language Models for Information Retrieval, W. B. Croft and J. Lafferty, Eds.: Kluwer Academic Publishers.
  25. KULLBACK, S. AND R. A. LEIBLER. 1951. On Information and Sufficiency. The Annals of Mathematical Statistics. 22:79-86. https://doi.org/10.1214/aoms/1177729694
  26. KURLAND, O., L. LEE, AND C. DOMSHLAK. 2005. Better than the real thing?: iterative pseudo-query processing using cluster-based language models. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval. Salvador, Brazil, 19-26.
  27. LAFFERTY, J. AND C. ZHAI. 2001. Document language models, query models, and risk minimization for information retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval New Orleans. Louisiana: ACM Press, 111-119.
  28. LAVRENKO, V. AND W. B. CROFT. 2001. Relevance based language models. Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval. 120-127.
  29. LAVRENKO, V., M. CHOQUETTE, AND W. B. CROFT. 2002. Cross-Lingual Relevance Models. In Proceedings of the 25th annual international ACM SIGIR Tampere. Finland, 175-182.
  30. LI, X. 2005. Improving the Robustness of Relevance Based Language Models. CIIR Technical Report, University of Massachusetts. Amherst.
  31. LIU, X. AND W. B. CROFT. 2004. Cluster-based retrieval using language models. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. Sheffield, UK, 186-193. https://doi.org/10.1145/1008992.1009026
  32. LIU, X. AND W. B. CROFT. 2005. Statistical Language Modeling For Information Retrieval. Annual Review of Information Science and Technology.
  33. MAGNINI, B., M. NEGRI, R. PREVETE, AND H. TANEV. 2002. Is It the Right Answer? Exploiting Web Redundancy for Answer Validation. In Association for Computa-tional Lingustistics (ACL) 2002. Philadelphia, PA, 425-432.
  34. MANNING, C. D. AND H. SCHUTZE. 1999. Foundations of Statistical Natural Language Processing: The MIT Press.
  35. MANNING, C. D., P. RAGHAVAN AND H. SCHUTZE. 2007. Introduction to Information Retrieval: Cambridge University Press.
  36. MILLER, D. R. H., LEEK, T. AND R. M. SCHWARTZ. 1999. A hidden Markov model information retrieval system. Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. 214-221. https://doi.org/10.1145/312624.312680
  37. MITRA, M., A. SINGHAL, AND C. BUCKLEY. 1998. Improving automatic query expansion. In Proceedings of the 21st annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Melbourne, Australlia, 206-214. https://doi.org/10.1145/290941.290995
  38. MURDOCK, V. AND W. B. CROFT. 2004. Simple translation models for sentence retrieval in factoid question answering. In ACM SIGIR Workshop on Information Retrieval for Question Answering.
  39. OGILVIE, P. AND J. CALLAN. 2003. Using Language Models for Flat Text Queries in XML Retrieval. In INEX 2003 Workshop Proceedings.
  40. OGILVIE, P. AND J. CALLAN. 2006. Parameter Estimation for a Simple Hierarchical Generative Model for XML Retrieval. In Advances in XML Information Retrieval and Evaluation. vol. 3977: Springer, 211.
  41. PONTE, J. M. AND W. B. CROFT. 1998. A language modeling approach to information retrieval. In 21st annual international ACM SIGIR conference on Research and development in information retrieval Melbourne. Australia, 275-281.
  42. ROBERTSON, S. AND K. JONES. 1997. Simple proven approaches to text retrieval. Cambridge University Computer Laboratory Technical Report.
  43. ROSENFELD, R. 2000. Two decades of statistical language modeling: where do we go from here?. Proceedings of the IEEE, 88:1270-1278. https://doi.org/10.1109/5.880083
  44. SALTON, G. AND M. J. MCGILL. 1986. Introduction to Modern Information Retrieval: McGraw-Hill, Inc. New York, NY, USA.
  45. SHANNON, C. E. 1951. Prediction and entropy of printed English. Bell System Technical Journal 30:50-64. https://doi.org/10.1002/j.1538-7305.1951.tb01366.x
  46. SMUCKER, M. AND J. ALLAN. 2007. An Investigation of Dirichlet Prior Smoothing's Performance Advantage. Technical Report, University of Massachusetts. Amherst, Amherst, Massachusetts.
  47. SONG, F. AND W. B. CROFT. 1999. A general language model for information retrieval. Proceedings of the eighth international conference on Information and knowledge management. 316-321. https://doi.org/10.1145/319950.320022
  48. VOORHEES, E. M. 2005a. Overview of the TREC 2005 Question Answering Track. In Online proceedings of the 2005 Text Retrieval Conference.
  49. VOORHEES, E. M. 2005b. Overview of the TREC 2005 Robust Retrieval Track. On-line Proceedings of the Thirteenth Text Retrieval Conference.
  50. VOORHEES, E. M. 2006. Overview of the TREC 2006 Question Answering Track. In Online proceedings of 2006 Text Retrieval Conference.
  51. VOORHEES, E. M. AND D. HARMAN. 1999. Overview of the Eighth Text REtrieval Conference (TREC-8). In Online proceedings of 1999 Text Retrieval Conference.
  52. VOORHEES, E. M. AND D. K. HARMAN. 2005. TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing): The MIT Press.
  53. WITTEN, I. H. AND T. C. BELL. 1991. The zero-frequency problem: estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory. 37:1085-1094. https://doi.org/10.1109/18.87000
  54. XU, J. AND W. B. CROFT. 1999. Cluster-based language models for distributed retrieval. Proceedings of the 22nd annual international ACM SIGIR. 254-261. https://doi.org/10.1145/312624.312687
  55. ZHAI, C. AND J. LAFFERTY. 2001. Model-based Feedback in the Language Modeling Approach to Information Retrieval. In Proceedings of the tenth international conference on Information and knowledge management Altanta. GA: ACM Press New York, NY, USA, 403-410.
  56. ZHAI, C. AND J. LAFFERTY. 2004. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems (TOIS), 22:179-214. https://doi.org/10.1145/984321.984322
  57. ZHANG, J., Z. GHAHRAMANI, AND Y. YANG. 2005. A probabilistic model for online document clustering with application to novelty detection. In Advances in Neural Information Processing Systems 17. vol. 17, Y. W. Lawrence K. Saul, Leon Bottou, Ed.: MIT Press, 1617-1624.
  58. ZHOU, X., X. HU, X. ZHANG, X. LIN, AND I. Y. SONG. 2006. Context-sensitive semantic smoothing for the language modeling approach to genomic IR. In 29th annual international ACM SIGIR Seattle. WA, USA, 170-177. https://doi.org/10.1145/1148170.1148203
  59. ZIPF, G. K. 1949. Human behavior and the principle of least effort: Addison-Wesley Press Cambridge, Mass.