[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.3745/JIPS.2008.4.2.067

Inverted Index based Modified Version of K-Means Algorithm for Text Clustering

Jo, Tae-Ho (School of Computer and Information Engineering Inha University)

Publication Information

Journal of Information Processing Systems / v.4, no.2, 2008 , pp. 67-76 More about this Journal

Abstract

This research proposes a new strategy where documents are encoded into string vectors and modified version of k means algorithm to be adaptable to string vectors for text clustering. Traditionally, when k means algorithm is used for pattern classification, raw data should be encoded into numerical vectors. This encoding may be difficult, depending on a given application area of pattern classification. For example, in text clustering, encoding full texts given as raw data into numerical vectors leads to two main problems: huge dimensionality and sparse distribution. In this research, we encode full texts into string vectors, and modify the k means algorithm adaptable to string vectors for text clustering.

Keywords

String Vector; K Means Algorithm; Text Clustering;

Citations & Related Records

Reference

1	H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins, Text Classification with String Kernels, Journal of Machine Learning Research, Vol 2, No 2, pp419-444, 2002 DOI
2	V. Hatzivassiloglou, L. Gravano, and A. Maganti, “An Investigation of Linguistic Features and Clustering Algorithms for Topical Document Clustering”, The Proceedings of 23rd SIGIR, pp.224-231, 2000
3	T. Jo and M. Lee, “The Evaluation Measure of Text Clustering for the Variable Number of Clusters”, Lecture Notes in Computer Science, Vol.4492 pp.871-879, 2007
4	T. Kohonen, “Self Organized Formation of Topologically Correct Feature Maps”, Biological Cybernetics, Vol.43, pp.59-69, 1982 DOI
5	T. Kohonen, S. Kaski, K. Lagus, J. Salojarvi, V. Paatero, and A. Saarela, “Self Organization of a Massive Document Collection”, IEEE Transaction on Neural Networks, Vol.11, No.3, pp.574-585, 2002 DOI
6	F. Sebastiani, “Machine Learning in Automated Text Categorization”, ACM Computing Survey, Vol. 34, No.1, 2002, pp.1-47, 2002 DOI ScienceOn
7	E. D. Wiener, “A Neural Network Approach to Topic Spotting in Text”, The Thesis of Master of University of Colorado, 1995 DOI ScienceOn
8	S. Kaski, T. Honkela, K. Lagus, and T. Kohonen, “WEBSOM-Self Organizing Maps of Document Collections”, Neurocomputing, Vol.21, pp.101-117, 1998 DOI ScienceOn
9	Mitchell, T. M., Machine Learning, McGraw-Hill, 1997 DOI ScienceOn
10	A. Vinokourov, and M. Girolami, “A Probabilistic Hierarchical Clustering Method for Organizing Collections of Text Documents”, The Proceedings of 15th International Conference on Pattern Recognition, pp.182-185, 2000
11	H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins, Text Classification with String Kernels, Journal of Machine Learning Research, Vol 2, No 2, pp419-444, 2002 DOI
12	C. Ambroise, and G. Govaert, “Convergence of an EM-type algorithm for spatial clustering”, Pattern Recognition Letters, Vol.19, No.10, pp.919-927, 1998 DOI ScienceOn
13	Banerjee, I. Dhillon, J. Ghosh, and S. Sra, “Generative model-based clustering of directional data”, The Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.19-28, 2003
14	G. Bote, P. Vincent, M. A. Felix, and V. B. Solana, “Document Organization using Kohonen's Algorithm”, Information Processing and Management, Vol.38, No.1, pp.79-89, 2002 DOI ScienceOn
15	G. Celeux, and G. Govaert, “A Classification EM algorithm for clustering and two stochastic versions”, Computational Statistics & Data Analysis, Vol.14, No. 3, pp.315-332, 1992 DOI ScienceOn
16	A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum Likelihood from Incomplete Data via EM algorithm”, Journal of the Royal Statistics Society, Series B, Vol.39, No.1, pp.1-38, 1977
17	P. Jackson, and I. Mouliner, Natural Language Processing for Online Applications: Text Retrieval, Extraction and Categorization, John Benjamins Publishing Company, 2002
18	T. Jo, “The Concepts of Text Mining”, The Proceedings of ICACT 2000, pp.124-129, 2000
19	T. Jo, “Dynamic Document Organization using Text Categorization and Text Clustering, PhD Dissertation of University of Ottawa, 2006
20	T. Jo and M. Lee, “String Vectors in Unsupervised Learning for Text Clustering”, Information Systems, submitted, 2007
21	T. Jo and N. Japkowicz, “Text Clustering using NTSO”, The Proceedings of IJCNN, pp.558-563, 2005