[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.5351/KJAS.2019.32.5.763

A probabilistic information retrieval model by document ranking using term dependencies

You, Hyun-Jo (Program in Data Science for Humanities, Seoul National University)
Lee, Jung-Jin (Department of Statistics and Actuarial Science, Soongsil University)

Publication Information

The Korean Journal of Applied Statistics / v.32, no.5, 2019 , pp. 763-782 More about this Journal

Abstract

This paper proposes a probabilistic document ranking model incorporating term dependencies. Document ranking is a fundamental information retrieval task. The task is to sort documents in a collection according to the relevance to the user query (Qin et al., Information Retrieval Journal, 13, 346-374, 2010). A probabilistic model is a model for computing the conditional probability of the relevance of each document given query. Most of the widely used models assume the term independence because it is challenging to compute the joint probabilities of multiple terms. Words in natural language texts are obviously highly correlated. In this paper, we assume a multinomial distribution model to calculate the relevance probability of a document by considering the dependency structure of words, and propose an information retrieval model to rank a document by estimating the probability with the maximum entropy method. The results of the ranking simulation experiment in various multinomial situations show better retrieval results than a model that assumes the independence of words. The results of document ranking experiments using real-world datasets LETOR OHSUMED also show better retrieval results.

Keywords

information retrieval; document ranking; maximum entropy principle; iterative proportional fitting algorithm;

Citations & Related Records

Reference

1	Deming, W. E. and Stephan, F. F. (1940) On a least squares adjustment of a sampled frequency table when the expected marginal totals are known, Annals of Mathematical Statistics, 11, 427-444. DOI
2	Fienberg, S. E. (1970) An iterative procedure for estimation in contingency tables, Annals of Mathematical Statistics, 41, 907-917. DOI
3	Kantor, P. B. and Lee, J. J. (1998). Testing the maximum entropy principle for information retrieval, Journal of American Society for Information Science, 49, 557-566. DOI
4	Lee, J. J. (2005). Discriminating analysis of binary data with multinomial distribution by using the iterative cross entropy minimization estimation, The Korean Communications in Statistics, 12, 125-137.
5	Lee, J. J. and Kantor, P. B. (1991). A study of probabilistic information retrieval systems in the case of inconsistent expert judgments, Journal of American Society for Information Science, 42, 166-172. DOI
6	Lee, J. J. and Park, H. K. (2010). Rule-based classification analysis using entropy distribution, Communications for Statistical Applications and Methods, 17, 527-540. DOI
7	Manning, Ch. D., Raghavan, P. and Schuutze, H. (2012). An Introduction to Information Retrieval, CUP. Online publication: https://doi.org/10.1017/CBO9780511809071
8	Min, J. (2017). Utilizing External Resources for Enriching Information Retrieval, Ph.D. Dissertation, DCU. Available at http://doras.dcu.ie/21981/
9	Qin, T., Liu, T.-Y., Xu, J., and Li, H. (2010) LETOR: A benchmark collection for research on learning to rank for information retrieval, Information Retrieval Journal, 13, 346-374. DOI
10	Robertson S. E. (1977). The probability ranking principle in IR, Journal of Documentation, 33, 294-304. DOI
11	Sanderson, M. and Croft, W. B. (2012). The history of information retrieval research. In Proceedings of the IEEE, 100, 1444-1451.
12	Ruschendorf, L. (1995) Convergence of the iterative proportional fitting procedure, The Annals of Statistics, 23, 1160-1174. DOI

KSCI

A probabilistic information retrieval model by document ranking using term dependencies 용어간 종속성을 이용한 문서 순위 매기기에 의한 확률적 정보 검색

A probabilistic information retrieval model by document ranking using term dependencies