[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.3745/JIPS.04.0164

Combining Distributed Word Representation and Document Distance for Short Text Document Clustering

Kongwudhikunakorn, Supavit (Dept. of Computer Engineering, Faculty of Engineering, Kasetsart University)
Waiyamai, Kitsana (Dept. of Computer Engineering, Faculty of Engineering, Kasetsart University)

Publication Information

Journal of Information Processing Systems / v.16, no.2, 2020 , pp. 277-300 More about this Journal

Abstract

This paper presents a method for clustering short text documents, such as news headlines, social media statuses, or instant messages. Due to the characteristics of these documents, which are usually short and sparse, an appropriate technique is required to discover hidden knowledge. The objective of this paper is to identify the combination of document representation, document distance, and document clustering that yields the best clustering quality. Document representations are expanded by external knowledge sources represented by a Distributed Representation. To cluster documents, a K-means partitioning-based clustering technique is applied, where the similarities of documents are measured by word mover's distance. To validate the effectiveness of the proposed method, experiments were conducted to compare the clustering quality against several leading methods. The proposed method produced clusters of documents that resulted in higher precision, recall, F1-score, and adjusted Rand index for both real-world and standard data sets. Furthermore, manual inspection of the clustering results was conducted to observe the efficacy of the proposed method. The topics of each document cluster are undoubtedly reflected by members in the cluster.

Keywords

Document Clustering; Document Distance; Short Text Documents; Short Text Document Clustering;

Citations & Related Records

Reference

1	P. N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining. Boston, MA: Pearson Education Inc., 2006.
2	J. Soler, F. Tence, L. Gaubert, and C. Buche, "Data clustering and similarity," in Proceedings of the 26th International Florida Artificial Intelligence Research Society Conference (FLAIRS'13), St Pete Beach, FL, 2013, pp. 492-495.
3	M. Steinbach, G. Karypis, and V. Kumar, "A comparison of document clustering techniques," in Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, 2000.
4	D. A. Ingaramo, M. L. Errecalde, and P. Rosso, "Density-based clustering of short-text corpora," Procesamiento del Lenguaje Natural, vol. 41, pp. 81-87, 2008.
5	A. Rangrej, S. Kulkarni, and A. V. Tendulkar, "Comparative study of clustering techniques for short text documents," in Proceedings of the 20th International Conference Companion on World Wide Web, Hyderabad India, 2011, pp. 111-112.
6	N. Singh and N. S. Chaudhari, "A novel clustering technique for short texts," in Proceedings of 2016 5th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), Noida, India, 2016, pp. 228-232.
7	T. S. Madhulatha, "An overview on clustering methods," 2012, https://arxiv.org/abs/1205.1117. DOI
8	H. Singh, "Clustering of text documents by implementation of k-means algorithms," Streamed Info-Ocean, vol. 1, pp. 53-63, 2016.
9	Y. Chen and L. Tu, "Density-based clustering for real-time stream data," in Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Jose, CA, 2007, pp. 133-142.
10	A. K. Jain and R. C. Dubes, Algorithms for Clustering Data. Upper Saddle River, NJ: Prentice-Hall Inc., 1988.
11	S. Yang and Y. Wang, "Density-based clustering of massive short messages using domain ontology," in Proceedings of 2009 Asia-Pacific Conference on Information Processing, Shenzhen, China, 2009, pp. 505-508.
12	M. Ester, H. P. Kriegel, J. Sander, and X. Xu, "A density-based algorithm for discovering clusters a densitybased algorithm for discovering clusters in large spatial databases with noise," in Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, 1996, pp. 226-231.
13	Z. Miller, B. Dickinson, W. Deitrick, W. Hu, and A. H. Wang, "Twitter spammer detection using data stream clustering," Information Sciences, vol. 260, pp. 64-73, 2014. DOI
14	E. K. Ikonomakisa, D. K. Tasoulisa, and M. N. Vrahatisa, "Density based text clustering," in Recent Progress in Computational Sciences and Engineering. Boca Raton, FL: Taylor & Francis, 2006, pp. 218-221.
15	M. T. H. Elbatta and W. M. Ashour, "A dynamic method for discovering density varied clusters," International Journal of Signal Processing, Image Processing and Pattern Recognition, vol. 6, no. 1, pp. 123-134, 2013.
16	V. K. Singh, N. Tiwari, and S. Garg, "Document clustering using k-means, heuristic k-means and fuzzy c-means," in Proceedings of the 2011 International Conference on Computational Intelligence and Communication Networks, Gwalior, India, 2011, pp. 297-301.
17	V. K. R. Sridhar, "Unsupervised topic modeling for short texts using distributed representations of words," in Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, Denver, CO, 2015, pp. 192-200.
18	M. Parimala, D. Lopez, and N. Senthilkumar, "A survey on density based clustering algorithms for mining large spatial databases," International Journal of Advanced Science and Technology, vol. 31, pp. 59-66, 2011.
19	K. Sawant, "Adaptive methods for determining DBSCAN parameters," International Journal of Innovative Science, Engineering & Technology, vol. 1, no. 4, pp. 329-334, 2014.
20	A. K. Pujari, Data Mining Techniques. Hyderabad, India: Universities Press (India) Private Limited, 2001.
21	C. C. Aggarwal, "Mining text and social streams: a review," ACM SIGKDD Explorations Newsletter, vol. 15, no. 2, pp. 9-19, 2014. DOI
22	M. Allahyari, S. Pouriyeh, M. Assefi, S. Safaei, E. D. Trippe, J. B. Gutierrez, and K. Kochut, "A brief survey of text mining: classification, clustering and extraction techniques," 2017, https://arxiv.org/abs/1707.02919.
23	V. Gupta and G. S. Lehal, "A survey of text mining techniques and applications," Journal of Emerging Technologies in Web Intelligence, vol. 1, no. 1, pp. 60-76, 2009.
24	A. M. Jadhav and D. P. Gadekar, "A survey on text mining and its techniques," International Journal of Science and Research, vol. 3, no. 11, pp. 2110-2113, 2014.
25	L. F. S. Coletta, N. F. F. da Silva, E. R. Hruschka, and E. R. Hruschka, "Combining classification and clustering for tweet sentiment analysis," in Proceedings of 2014 Brazilian Conference on Intelligent Systems, Sao Paulo, Brazil, 2014, pp. 210-215.
26	G. Salton and M. J. McGill, Introduction to Modern Information Retrieval. New York, NY: McGraw-Hill Inc., 1986.
27	L. Rokach and O. Maimon, Clustering Methods. Boston, MA: Springer, 2005.
28	C. C. Aggarwal and C. Zhai, "A survey of text clustering algorithms," in Mining Text Data. Boston, MA: Springer, 2012, pp. 77-128.
29	K. S. Jones, "A statistical interpretation of term specificity and its application in retrieval," Journal of Documentation, vol. 28, no. 1, pp. 11-21, 1972. DOI
30	J. Ramos, "Using tf-idf to determine word relevance in document queries," in Proceedings of the 1st Instructional Conference on Machine Learning, Washington, DC, 2003, pp. 133-142.
31	D. Sailaja, M. Kishore, B. Jyothi, and N. R. G. K. Prasad, "An overview of pre-processing text clustering methods," International Journal of Computer Science & Information Technologies, vol. 6, o. 3, pp. 3119-3124, 2015.
32	S. C. Punitha and M. Punithavalli, "A comparative study to find a suitable method for text document clustering," International Journal of Computer Science & Information Technology (IJCSIT), vol. 3, no. 6, pp. 49-59, 2011. DOI
33	S. T. Deokar, "Text documents clustering using k means algorithm," International Journal of Technology & Engineering Science (IJTES), vol. 1, no. 4, pp. 282-286, 2013.
34	R. Rehurek and P. Sojka, "Software framework for topic modelling with large corpora," in Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta, 2010, pp. 45-50.
35	C. Ma, Q. Zhao, J. Pan, and Y. Yan, "Short text classification based on distributional representations of words," IEICE Transactions on Information and Systems, vol. 99, no. 10, pp. 2562-2565, 2016. DOI
36	J. Xu, P. Wang, G. Tian, B. Xu, J. Zhao, F. Wang, and H. Hao, "Short text clustering via convolutional neural networks," in Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, Denver, CO, 2015, pp. 62-69.
37	N. Kalchbrenner, E. Grefenstette, and P. Blunsom, "A convolutional neural network for modelling sentences," 2014, https://arxiv.org/abs/1404.2188.
38	J. Xu, B. Xu, P. Wang, S. Zheng, G. Tian, and J. Zhao, "Self-taught convolutional neural networks for short text clustering," Neural Networks, vol. 88, pp. 22-31, 2017. DOI
39	S. Sharma and V. Gupta, "Recent developments in text clustering techniques," International Journal of Computer Applications, vol. 37, no. 6, pp. 14-19, 2012. DOI
40	Y. Yan, R. Huang, C. Ma, L. Xu, Z. Ding, R. Wang, T. Huang, and B. Liu, "Improving document clustering for short texts by long documents via a Dirichlet multinomial allocation model," in Web and Big Data. Cham: Springer, 2017, pp. 626-641.
41	P. Shrestha, "Corpus-based methods for short text similarity," in Proceedings of the 17th Rencontre des Etudiants Chercheurs en Informatique pour le Traitement Automatique des Langues, Caen, France, 2011.
42	A. I. Kadhim, Y. N. Cheah, and N. H. Ahamed, "Text document preprocessing and dimension reduction techniques for text document clustering," in Proceedings of the 2014 4th International Conference on Artificial Intelligence with Applications in Engineering and Technology, Kota Kinabalu, Malaysia, 2014, pp. 69-73.
43	Y. Song and D. Roth, "Unsupervised sparse vector densification for short text similarity," in Proceedings of 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, CO, 2015, pp. 1275-1280.
44	X. H. Phan, L. M. Nguyen, and S. Horiguchi, "Learning to classify short and sparse text & web with hidden topics from large-scale data collections," in Proceedings of the 17th International Conference on World Wide Web, Beijing, China, 2008, pp. 91-100.
45	M. Steinbach, G. Karypis, and V. Kumar, "A comparison of document clustering techniques," in Proceedings of the International KDD Workshop on Text Mining, Boston, MA, 2000.
46	L. Hong and B. D. Davison, "Empirical study of topic modeling in twitter," in Proceedings of the 1st Workshop on Social Media Analytics, Washington, DC, 2010, pp. 80-88.
47	J. Weng, E. P. Lim, J. Jiang, and Q. He, "Twitterrank: finding topic-sensitive influential twitterers," in Proceedings of the 3rd ACM International Conference on Web Search and Data Mining, New York, NY, 2010, pp. 261-270.
48	R. Mehrotra, S. Sanner, W. Buntine, and L. Xie, "Improving LDA topic models for microblogs via tweet pooling and automatic labeling," in Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, 2013, pp. 889-892.
49	X. Quan, C. Kit, Y. Ge, and S. J. Pan, "Short and sparse text topic modeling via self-aggregation," in Proceedings of the 24th International Conference on Artificial Intelligence, Buenos Aires, Argentina, 2015, pp. 2270-2276.
50	T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, "Distributed representations of words and phrases and their compositionality," Advances in Neural Information Processing Systems, vol. 26, pp. 3111-3119, 2013.
51	P. B. Nagpal and P. A. Mann, "Comparative study of density based clustering algorithms," International Journal of Computer Applications, vol. 27, no. 11, pp. 421-435, 2011.
52	G. Karypis, E. H. Han, and V. Kumar, "Chameleon: hierarchical clustering using dynamic modeling," Computer, vol. 32, no. 8, pp. 68-75, 1999. DOI
53	K. Mumtaz and K. Duraiswamy, "A novel density based improved k-means clustering algorithm - Dbkmeans," International Journal on Computer Science and Engineering, vol. 2, no. 2, pp. 213-218, 2010.
54	S. Baillargeon, S. Halle, and C. Gagne, "Stream clustering of tweets," in Proceedings of 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), San Francisco, CA, 2016, pp. 1256-1261.
55	D. Greene and P. Cunningham, "Practical solutions to the problem of diagonal dominance in kernel document clustering," in Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, 2006, pp. 377-384.
56	S. Vijayarani, M. J. Ilamathi, and M. Nithya, "Preprocessing techniques for text mining-an overview," International Journal of Computer Science & Communication Networks, vol. 5, no. 1, pp. 7-16, 2015.
57	S. Seifzadeh, A. K. Farahat, M. S. Kamel, and F. Karray, "Short-text clustering using statistical semantics," in Proceedings of the 24th International Conference on World Wide Web, Florence, Italy, 2015, pp. 805-810.
58	A. Karami and R. Johansson, "Choosing DBSCAN parameters automatically using differential evolution," International Journal of Computer Applications, vol. 91, no. 7, pp. 1-11, 2014. DOI
59	W. H. Gomaa and A. A. Fahmy, "A survey of text similarity approaches," International Journal of Computer Applications, vol. 68, no. 13, pp. 13-18, 2013. DOI
60	M. Speriosu, N. Sudan, S. Upadhyay, and J. Baldridge, "Twitter polarity classification with label propagation over lexical links and the follower graph," in Proceedings of the 1st Workshop on Unsupervised Learning in NLP, Stroudsburg, PA, 2011, pp. 53-63.
61	T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space," 2013, https://arxiv.org/abs/1301.3781.
62	M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger, "From word embeddings to document distances," in Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 2015, pp. 957-966.
63	J. MacQueen, "Some methods for classification and analysis of multivariate observations," in Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, 1967, pp. 281-297.
64	G. Salton and C. Buckley, "Term-weighting approaches in automatic text retrieval," Information Processing and Management, vol. 24, no. 5, pp. 513-523, 1988. DOI
65	C. De Boom, S. Van Canneyt, S. Bohez, T. Demeester, and B. Dhoedt, "Learning semantic similarity for very short texts," in Proceedings of 2015 IEEE International Conference on Data Mining Workshop (ICDMW), Atlantic City, NJ, 2015, pp. 1229-1234.
66	S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, "Indexing by latent semantic analysis," Journal of the American Society for Information Science, vol. 41, no. 6, pp. 391-407, 1990. DOI
67	T. Hofmann, "Probabilistic latent semantic indexing," in Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley CA, 1999, pp. 50-57.
68	D. M. Blei, A. Y. Ng, and M. I. Jordan, "Latent Dirichlet allocation," Journal of Machine Learning Research, vol. 3, pp. 993-1022, 2003.
69	J. L. Elman, "Distributed representations, simple recurrent networks, and grammatical structure," Machine Learning, vol. 7, no. 2-3, pp. 195-225, 1991. DOI
70	D. E. Rumelhart, J. L. McClelland, Parallel Distributed Processing: Explorations in the Microstructure of Cognition (Volume 1: Foundations). Cambridge, MA: MIT Press, 1986.
71	T. Kenter and M. de Rijke, "Short text similarity with word embeddings," in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, Melbourne, Australia, 2015, pp. 1411-1420.
72	J. Qiang, P. Chen, T. Wang, and X. Wu, "Topic modeling over short texts by incorporating word embeddings," in Advances in Knowledge Discovery and Data Mining. Cham: Springer International Publishing, 2017, pp. 363-374.
73	A. Karandikar, "Clustering short status messages: a topic model based approach," M.S. thesis, Faculty of the Graduate School, University of Maryland Baltimore County, Baltimore, MD, 2010.
74	A. Barron-Cedeno, P. Rosso, E. Agirre, and G. Labaka, "Plagiarism detection across distant language pairs," in Proceedings of the 23rd International Conference on Computational Linguistics, Stroudsburg, PA, 2010, pp. 37-45.
75	A. Huang, "Similarity measures for text document clustering," in Proceedings of the Sixth New Zealand Computer Science Research Student Conference (NZCSRSC), Christchurch, New Zealand, 2008, pp. 49-56.
76	Y. Rubner, C. Tomasi, and L. J. Guibas, "A metric for distributions with applications to image databases," in Proceedings of the 6th International Conference on Computer Vision (IEEE Cat. No. 98CH36271), Bombay, India, 1998, pp. 59-66.
77	Y. Rubner, C. Tomasi, and L. J. Guibas, "The earth mover's distance as a metric for image retrieval," International Journal of Computer Vision, vol. 40, no. 2, pp. 99-121, 2000. DOI
78	J. A. Hartigan, Clustering Algorithms. New York, NY: John Wiley & Sons Inc., 1975.