[KSCI] Korea Science Citation Index Service

Multiple Cause Model-based Topic Extraction and Semantic Kernel Construction from Text Documents

장정호 (서울대학교 컴퓨터공학부)
장병탁 (서울대학교 컴퓨터공학부)

Publication Information

Journal of KIISE:Software and Applications / v.31, no.5, 2004 , pp. 595-604 More about this Journal

Abstract

Automatic analysis of concepts or semantic relations from text documents enables not only an efficient acquisition of relevant information, but also a comparison of documents in the concept level. We present a multiple cause model-based approach to text analysis, where latent topics are automatically extracted from document sets and similarity between documents is measured by semantic kernels constructed from the extracted topics. In our approach, a document is assumed to be generated by various combinations of underlying topics. A topic is defined by a set of words that are related to the same topic or cooccur frequently within a document. In a network representing a multiple-cause model, each topic is identified by a group of words having high connection weights from a latent node. In order to facilitate teaming and inferences in multiple-cause models, some approximation methods are required and we utilize an approximation by Helmholtz machines. In an experiment on TDT-2 data set, we extract sets of meaningful words where each set contains some theme-specific terms. Using semantic kernels constructed from latent topics extracted by multiple cause models, we also achieve significant improvements over the basic vector space model in terms of retrieval effectiveness.

Keywords

multiple-cause model; Helmholtz machine; latent semantic feature; semantic kernel;

Citations & Related Records

Reference

1	Dhillon, I. and Modha, D., Concept decomposition for large sparse text data using clustering, Machine Learning, vol. 42, pp. 143-175, 2001 DOI
2	Kolenda, T., Hansen, L. K. and Sigurdsson, S., Independent components in text, In Proceedings of ICA'99, 1999
3	van Rijsbergen, C. J., Information Retrieval, London: Butterworths, 2nd Edition, 1979
4	Jiang, F. and Littman, M. L., Approximate dimension equalization in vector-based information retrieval, In Proceedings of the 17th International Conference on Machine Learning, pp. 423-430, 2000
5	Cristianini, N., Shawe-Taylor, J. and Lodhi, H., Latent semantic kernels, Journal of Intelligent Information Systems, vol. 18, no. 2/3, pp. 127-152, 2002 DOI ScienceOn
6	Dumais, S. T., Furnas, G. W., Landauer, T. K., and Deerwester, S., Using latent semantic analysis to improve information retrieval, In Proceedings of CHI'88, pp. 281-285, 1988 DOI
7	Dumais, S.T., Latent semantic indexing (LSI): TREC-3 report, In Proceedings of the Text Retrieval Conference (TREC-3), pp. 219-230, 1995
8	Dayan, P., Hinton, G. E., Neal, R. M., and Zemel, R. S., The Helmholtz machine, Neural Computation, vol. 7, pp. 889-904, 1995 DOI ScienceOn
9	Frey, B. J., Graphical Models for Machine Learning and Digital Communication, The MIT Press, 1998
10	Lee, D. D. and Seung, H. S. , Learning the parts of objects by non-negative matrix factorization, Nature, 401, 788-791, 1999 DOI ScienceOn
11	Pearl, J., Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann, 1988
12	Hinton, G. E., Dayan, P., Frey, B. J., Neal, R. M., The wake-sleep algorithm for unsupervised neural networks, Science 268, pp. 1158-1161, 1995 DOI
13	M. W. Berry, S. T. Dumais, and G. W. O'Brien., Using linear algebra for intelligent information retrieval, SIAM Review, vol. 37, no. 4, pp. 573-595, 1995 DOI ScienceOn
14	Chang, J.-H. and Zhang, B.-T., Using stochastic Helmholtz machine for text learning, In Proceedings of International Conference on Computer Processing of Oriental Languages, pp. 453-458, 2001
15	Fellbaum, C., deitor, WordNet: An Electronic Lexical Database, MIT Press, 1998
16	Salton, G. and McGill, M. J., Introduction to Modern Information Retrieval, McGraw-Hill, 1983
17	Siolas, G. and d'AlcheBuc, F., Support vector machines based on a semantic kernel for text categorization, In Proceedings of the International Joint Conference on Neural Networks, vol. 5, pp. 205-209, 2000 DOI
18	Wong, S. K. M., Ziarko, W., and Wong, P. C. N., Generalized vector space model in information retrieval, In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 18-25, 1985 DOI
19	Hull, D., Using statistical testing in the evaluation of retrieval experiments, In Proceedings of ACM-SIGIR Conference on Research and Development in Information Retrieval, pp. 329-338, 1993 DOI
20	Dempster, A. P., Laird, N. M., and Rubin, D. B., Maximum likelihood from incomplete date via the EM algorithm (with discussion). Journal of the Royal Statistical Society B, vol. 39, pp. 1-38, 1977
21	Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. and Harshman, R. A., Indexing by latent semantic analysis, Journal of the American Society for Information Science, vol. 41, no. 6, pp. 391-407, 1990 DOI
22	Slonim, N. and Tishby, N., Document clustering using word clusters via the information bottleneck method, In Proceedings of ACM-SIGIR Conference on Research and Development in Information Retrieval, pp. 208-215, 2000 DOI

KSCI

Multiple Cause Model-based Topic Extraction and Semantic Kernel Construction from Text Documents 다중요인모델에 기반한 텍스트 문서에서의 토픽 추출 및 의미 커널 구축

Multiple Cause Model-based Topic Extraction and Semantic Kernel Construction from Text Documents