Browse > Article
http://dx.doi.org/10.3745/JIPS.2008.4.2.067

Inverted Index based Modified Version of K-Means Algorithm for Text Clustering  

Jo, Tae-Ho (School of Computer and Information Engineering Inha University)
Publication Information
Journal of Information Processing Systems / v.4, no.2, 2008 , pp. 67-76 More about this Journal
Abstract
This research proposes a new strategy where documents are encoded into string vectors and modified version of k means algorithm to be adaptable to string vectors for text clustering. Traditionally, when k means algorithm is used for pattern classification, raw data should be encoded into numerical vectors. This encoding may be difficult, depending on a given application area of pattern classification. For example, in text clustering, encoding full texts given as raw data into numerical vectors leads to two main problems: huge dimensionality and sparse distribution. In this research, we encode full texts into string vectors, and modify the k means algorithm adaptable to string vectors for text clustering.
Keywords
String Vector; K Means Algorithm; Text Clustering;
Citations & Related Records
연도 인용수 순위
  • Reference
1 H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins, Text Classification with String Kernels, Journal of Machine Learning Research, Vol 2, No 2, pp419-444, 2002   DOI
2 V. Hatzivassiloglou, L. Gravano, and A. Maganti, “An Investigation of Linguistic Features and Clustering Algorithms for Topical Document Clustering”, The Proceedings of 23rd SIGIR, pp.224-231, 2000
3 T. Jo and M. Lee, “The Evaluation Measure of Text Clustering for the Variable Number of Clusters”, Lecture Notes in Computer Science, Vol.4492 pp.871-879, 2007
4 T. Kohonen, “Self Organized Formation of Topologically Correct Feature Maps”, Biological Cybernetics, Vol.43, pp.59-69, 1982   DOI
5 T. Kohonen, S. Kaski, K. Lagus, J. Salojarvi, V. Paatero, and A. Saarela, “Self Organization of a Massive Document Collection”, IEEE Transaction on Neural Networks, Vol.11, No.3, pp.574-585, 2002   DOI
6 F. Sebastiani, “Machine Learning in Automated Text Categorization”, ACM Computing Survey, Vol. 34, No.1, 2002, pp.1-47, 2002   DOI   ScienceOn
7 E. D. Wiener, “A Neural Network Approach to Topic Spotting in Text”, The Thesis of Master of University of Colorado, 1995   DOI   ScienceOn
8 S. Kaski, T. Honkela, K. Lagus, and T. Kohonen, “WEBSOM-Self Organizing Maps of Document Collections”, Neurocomputing, Vol.21, pp.101-117, 1998   DOI   ScienceOn
9 Mitchell, T. M., Machine Learning, McGraw-Hill, 1997   DOI   ScienceOn
10 A. Vinokourov, and M. Girolami, “A Probabilistic Hierarchical Clustering Method for Organizing Collections of Text Documents”, The Proceedings of 15th International Conference on Pattern Recognition, pp.182-185, 2000
11 H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins, Text Classification with String Kernels, Journal of Machine Learning Research, Vol 2, No 2, pp419-444, 2002   DOI
12 G. Celeux, and G. Govaert, “A Classification EM algorithm for clustering and two stochastic versions”, Computational Statistics & Data Analysis, Vol.14, No. 3, pp.315-332, 1992   DOI   ScienceOn
13 C. Ambroise, and G. Govaert, “Convergence of an EM-type algorithm for spatial clustering”, Pattern Recognition Letters, Vol.19, No.10, pp.919-927, 1998   DOI   ScienceOn
14 Banerjee, I. Dhillon, J. Ghosh, and S. Sra, “Generative model-based clustering of directional data”, The Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.19-28, 2003
15 G. Bote, P. Vincent, M. A. Felix, and V. B. Solana, “Document Organization using Kohonen's Algorithm”, Information Processing and Management, Vol.38, No.1, pp.79-89, 2002   DOI   ScienceOn
16 A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum Likelihood from Incomplete Data via EM algorithm”, Journal of the Royal Statistics Society, Series B, Vol.39, No.1, pp.1-38, 1977
17 P. Jackson, and I. Mouliner, Natural Language Processing for Online Applications: Text Retrieval, Extraction and Categorization, John Benjamins Publishing Company, 2002
18 T. Jo and M. Lee, “String Vectors in Unsupervised Learning for Text Clustering”, Information Systems, submitted, 2007
19 T. Jo, “The Concepts of Text Mining”, The Proceedings of ICACT 2000, pp.124-129, 2000
20 T. Jo, “Dynamic Document Organization using Text Categorization and Text Clustering, PhD Dissertation of University of Ottawa, 2006
21 T. Jo and N. Japkowicz, “Text Clustering using NTSO”, The Proceedings of IJCNN, pp.558-563, 2005