Browse > Article

Multiple Cause Model-based Topic Extraction and Semantic Kernel Construction from Text Documents  

장정호 (서울대학교 컴퓨터공학부)
장병탁 (서울대학교 컴퓨터공학부)
Abstract
Automatic analysis of concepts or semantic relations from text documents enables not only an efficient acquisition of relevant information, but also a comparison of documents in the concept level. We present a multiple cause model-based approach to text analysis, where latent topics are automatically extracted from document sets and similarity between documents is measured by semantic kernels constructed from the extracted topics. In our approach, a document is assumed to be generated by various combinations of underlying topics. A topic is defined by a set of words that are related to the same topic or cooccur frequently within a document. In a network representing a multiple-cause model, each topic is identified by a group of words having high connection weights from a latent node. In order to facilitate teaming and inferences in multiple-cause models, some approximation methods are required and we utilize an approximation by Helmholtz machines. In an experiment on TDT-2 data set, we extract sets of meaningful words where each set contains some theme-specific terms. Using semantic kernels constructed from latent topics extracted by multiple cause models, we also achieve significant improvements over the basic vector space model in terms of retrieval effectiveness.
Keywords
multiple-cause model; Helmholtz machine; latent semantic feature; semantic kernel;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Dhillon, I. and Modha, D., Concept decomposition for large sparse text data using clustering, Machine Learning, vol. 42, pp. 143-175, 2001   DOI
2 Kolenda, T., Hansen, L. K. and Sigurdsson, S., Independent components in text, In Proceedings of ICA'99, 1999
3 van Rijsbergen, C. J., Information Retrieval, London: Butterworths, 2nd Edition, 1979
4 Jiang, F. and Littman, M. L., Approximate dimension equalization in vector-based information retrieval, In Proceedings of the 17th International Conference on Machine Learning, pp. 423-430, 2000
5 Cristianini, N., Shawe-Taylor, J. and Lodhi, H., Latent semantic kernels, Journal of Intelligent Information Systems, vol. 18, no. 2/3, pp. 127-152, 2002   DOI   ScienceOn
6 Dumais, S. T., Furnas, G. W., Landauer, T. K., and Deerwester, S., Using latent semantic analysis to improve information retrieval, In Proceedings of CHI'88, pp. 281-285, 1988   DOI
7 Dumais, S.T., Latent semantic indexing (LSI): TREC-3 report, In Proceedings of the Text Retrieval Conference (TREC-3), pp. 219-230, 1995
8 Dayan, P., Hinton, G. E., Neal, R. M., and Zemel, R. S., The Helmholtz machine, Neural Computation, vol. 7, pp. 889-904, 1995   DOI   ScienceOn
9 Frey, B. J., Graphical Models for Machine Learning and Digital Communication, The MIT Press, 1998
10 Lee, D. D. and Seung, H. S. , Learning the parts of objects by non-negative matrix factorization, Nature, 401, 788-791, 1999   DOI   ScienceOn
11 Pearl, J., Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann, 1988
12 Hinton, G. E., Dayan, P., Frey, B. J., Neal, R. M., The wake-sleep algorithm for unsupervised neural networks, Science 268, pp. 1158-1161, 1995   DOI
13 M. W. Berry, S. T. Dumais, and G. W. O'Brien., Using linear algebra for intelligent information retrieval, SIAM Review, vol. 37, no. 4, pp. 573-595, 1995   DOI   ScienceOn
14 Chang, J.-H. and Zhang, B.-T., Using stochastic Helmholtz machine for text learning, In Proceedings of International Conference on Computer Processing of Oriental Languages, pp. 453-458, 2001
15 Fellbaum, C., deitor, WordNet: An Electronic Lexical Database, MIT Press, 1998
16 Salton, G. and McGill, M. J., Introduction to Modern Information Retrieval, McGraw-Hill, 1983
17 Siolas, G. and d'AlcheBuc, F., Support vector machines based on a semantic kernel for text categorization, In Proceedings of the International Joint Conference on Neural Networks, vol. 5, pp. 205-209, 2000   DOI
18 Wong, S. K. M., Ziarko, W., and Wong, P. C. N., Generalized vector space model in information retrieval, In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 18-25, 1985   DOI
19 Hull, D., Using statistical testing in the evaluation of retrieval experiments, In Proceedings of ACM-SIGIR Conference on Research and Development in Information Retrieval, pp. 329-338, 1993   DOI
20 Dempster, A. P., Laird, N. M., and Rubin, D. B., Maximum likelihood from incomplete date via the EM algorithm (with discussion). Journal of the Royal Statistical Society B, vol. 39, pp. 1-38, 1977
21 Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. and Harshman, R. A., Indexing by latent semantic analysis, Journal of the American Society for Information Science, vol. 41, no. 6, pp. 391-407, 1990   DOI
22 Slonim, N. and Tishby, N., Document clustering using word clusters via the information bottleneck method, In Proceedings of ACM-SIGIR Conference on Research and Development in Information Retrieval, pp. 208-215, 2000   DOI