Browse > Article
http://dx.doi.org/10.5351/KJAS.2019.32.5.763

A probabilistic information retrieval model by document ranking using term dependencies  

You, Hyun-Jo (Program in Data Science for Humanities, Seoul National University)
Lee, Jung-Jin (Department of Statistics and Actuarial Science, Soongsil University)
Publication Information
The Korean Journal of Applied Statistics / v.32, no.5, 2019 , pp. 763-782 More about this Journal
Abstract
This paper proposes a probabilistic document ranking model incorporating term dependencies. Document ranking is a fundamental information retrieval task. The task is to sort documents in a collection according to the relevance to the user query (Qin et al., Information Retrieval Journal, 13, 346-374, 2010). A probabilistic model is a model for computing the conditional probability of the relevance of each document given query. Most of the widely used models assume the term independence because it is challenging to compute the joint probabilities of multiple terms. Words in natural language texts are obviously highly correlated. In this paper, we assume a multinomial distribution model to calculate the relevance probability of a document by considering the dependency structure of words, and propose an information retrieval model to rank a document by estimating the probability with the maximum entropy method. The results of the ranking simulation experiment in various multinomial situations show better retrieval results than a model that assumes the independence of words. The results of document ranking experiments using real-world datasets LETOR OHSUMED also show better retrieval results.
Keywords
information retrieval; document ranking; maximum entropy principle; iterative proportional fitting algorithm;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Deming, W. E. and Stephan, F. F. (1940) On a least squares adjustment of a sampled frequency table when the expected marginal totals are known, Annals of Mathematical Statistics, 11, 427-444.   DOI
2 Fienberg, S. E. (1970) An iterative procedure for estimation in contingency tables, Annals of Mathematical Statistics, 41, 907-917.   DOI
3 Kantor, P. B. and Lee, J. J. (1998). Testing the maximum entropy principle for information retrieval, Journal of American Society for Information Science, 49, 557-566.   DOI
4 Lee, J. J. (2005). Discriminating analysis of binary data with multinomial distribution by using the iterative cross entropy minimization estimation, The Korean Communications in Statistics, 12, 125-137.
5 Lee, J. J. and Kantor, P. B. (1991). A study of probabilistic information retrieval systems in the case of inconsistent expert judgments, Journal of American Society for Information Science, 42, 166-172.   DOI
6 Lee, J. J. and Park, H. K. (2010). Rule-based classification analysis using entropy distribution, Communications for Statistical Applications and Methods, 17, 527-540.   DOI
7 Manning, Ch. D., Raghavan, P. and Schuutze, H. (2012). An Introduction to Information Retrieval, CUP. Online publication: https://doi.org/10.1017/CBO9780511809071
8 Min, J. (2017). Utilizing External Resources for Enriching Information Retrieval, Ph.D. Dissertation, DCU. Available at http://doras.dcu.ie/21981/
9 Qin, T., Liu, T.-Y., Xu, J., and Li, H. (2010) LETOR: A benchmark collection for research on learning to rank for information retrieval, Information Retrieval Journal, 13, 346-374.   DOI
10 Robertson S. E. (1977). The probability ranking principle in IR, Journal of Documentation, 33, 294-304.   DOI
11 Sanderson, M. and Croft, W. B. (2012). The history of information retrieval research. In Proceedings of the IEEE, 100, 1444-1451.
12 Ruschendorf, L. (1995) Convergence of the iterative proportional fitting procedure, The Annals of Statistics, 23, 1160-1174.   DOI