Browse > Article
http://dx.doi.org/10.7472/jksii.2017.18.2.61

Representative Labels Selection Technique for Document Cluster using WordNet  

Kim, Tae-Hoon (Department of Industrial Engineering, Sungkyunkwan University)
Sohn, Mye (Department of Industrial Engineering, Sungkyunkwan University)
Publication Information
Journal of Internet Computing and Services / v.18, no.2, 2017 , pp. 61-73 More about this Journal
Abstract
In this paper, we propose a Documents Cluster Labeling method using information content of words in clusters to understand what the clusters imply. To do so, we calculate the weight and frequency of the words. These two measures are used to determine the weight among the words in the cluster. As a nest step, we identify the candidate labels using the WordNet. At this time, the candidate labels are matched to least common hypernym of the words in the cluster. Finally, the representative labels are determined with respect to information content of the words and the weight of the words. To prove the superiority of our method, we perform the heuristic experiment using two kinds of measures, named the suitability of the candidate label ($Suitability_{cl}$) and the appropriacy of representative label ($Appropriacy_{rl}$). In applying the method proposed in this research, in case of suitability of the candidate label, it decreases slightly compared with existing methods, but the computational cost is about 20% of the conventional methods. And we confirmed that appropriacy of the representative label is better results than the existing methods. As a result, it is expected to help data analysts to interpret the document cluster easier.
Keywords
Documents Cluster Labeling; Information content; WordNet; Similarity Calculation;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Q. Mei, X. Shen, and C. Zhai, "Automatic labeling of multinomial topic models," In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 490-499, 2007. https://doi.org/10.1145/1281192.1281246   DOI
2 R. Mihalcea and P. Tarau, "TextRank: Bringing order into texts," Association for Computational Linguistics, 2004. http://digital.library.unt.edu/ark:/67531/metadc30962/
3 W. Lu, Q. Cheng and C. Lioma, "Fixed versus dynamic co-occurrence windows in TextRank term weights for information retrieval," In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pp. 1079-1080, 2012. https://doi.org/10.1145/2348283.2348478   DOI
4 F. Role and M. Nadif, "Beyond cluster labeling: Semantic interpretation of clusters' contents using a graph representation," Knowledge-Based Systems, vol. 56, pp. 141-155, 2014. http://dx.doi.org/10.1016/j.knosys.2013.11.005   DOI
5 C. T. Nguyen, X. H. Phan, S. Horiguchi, T. T. Nguyen and Q. T. Ha, "Web search clustering and labeling with hidden topics," ACM Transactions on Asian Language Information Processing (TALIP), vol. 8, issue. 3, pp. 12, 2009. https://doi.org/10.1145/1568292.1568295   DOI
6 Z. S. Syed, T. Finin and A. Joshi, "Wikipedia as an Ontology for Describing Documents," In ICWSM, 2008. http://www.aaai.org/Papers/ICWSM/2008/ICWSM08-024.pdf
7 D. Carmel, H. Roitman and N. Zwerdling, "Enhancing cluster labeling using Wikipedia," In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pp. 139-146, 2009. https://doi.org/10.1145/1571941.1571967   DOI
8 Z. Li, J. Li, Y. Liao, S. Wen and J. Tang, "Labeling clusters from both linguistic and statistical perspectives: A hybrid approach," Knowledge-Based Systems, vol. 76, pp. 219-227, 2015. http://dx.doi.org/10.1016/j.knosys.2014.12.019   DOI
9 Y. H. Tseng, "Generic title labeling for clustered documents," Expert Systems with Applications, vol. 37, issue. 3, pp. 2247-2254, 2010. http://dx.doi.org/10.1016/j.eswa.2009.07.048   DOI
10 C. Bouras and V. Tsogkas, "A clustering technique for news articles using WordNet," Knowledge-Based Systems, vol. 36, pp. 115-128, 2012. http://dx.doi.org/10.1016/j.knosys.2012.06.015   DOI
11 W. H. Gomaa and A. A. Fahmy, "A survey of text similarity approaches," International Journal of Computer Applications, vol. 68, no. 13, pp. 13-18, 2013. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.403.5446&rep=rep1&type=pdf   DOI
12 D. Sanchez, M. Batet, D. Isern and A. Valls, "Ontology-based semantic similarity: A new feature-based approach," Expert Systems with Applications, vol. 39, issue. 9, pp. 7718-7728, 2012. http://dx.doi.org/10.1016/j.eswa.2012.01.082   DOI
13 G. A. Miller, "WordNet: a lexical database for English," Communications of the ACM, vol. 38, issue. 11, pp. 39-41, 1995. https://doi.org/10.1145/219717.219748   DOI
14 T. Pedersen, S. Patwardhan and J. Michelizzi, "WordNet: Similarity: measuring the relatedness of concepts," In Demonstration papers at HLT-NAACL 2004, pp. 38-41, 2004. Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=1614037
15 WordNet, "A lexical database for the English language," Cognitive Science Laboratory, Princeton University. 2004. http://wordnet.princeton.edu
16 P. Treeratpituk and J. Callan, "Automatically labeling hierarchical clusters," In Proceedings of the 2006 international conference on Digital government research, pp. 167-176, 2006. https://doi.org/10.1145/1146598.1146650   DOI
17 H. Anaya-Sanchez, A. Pons-Porrata and R. Berlanga-Llavori, "A new document clustering algorithm for topic discovering and labeling," In Iberoamerican Congress on Pattern Recognition, pp. 161-168, 2008. https://link.springer.com/chapter/10.1007/978-3-540-85920-8_20
18 J. H. Lau, K. Grieser, D. Newman and T. Baldwin, "Automatic labelling of topic models," In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp. 1536-1545, 2011. http://dl.acm.org/citation.cfm?id=2002658
19 T. Okuoka, T. Takahashi, D. Deguchi, I. Ide and H. Murase, "Labeling news topic threads with Wikipedia entries," 11th IEEE International Symposium on Multimedia, pp. 501-504, 2009. https://doi.org/10.1109/ISM.2009.67   DOI
20 X. L. Mao, Z. Y. Ming, Z. J. Zha, T. S. Chua, H. Yan and X. Li, "Automatic labeling hierarchical topics," In Proceedings of the 21st ACM international conference on Information and knowledge management, pp. 2383-2386, 2012. https://doi.org/10.1145/2396761.2398646   DOI
21 I. Hulpus, C. Hayes, M. Karnstedt and D. Greene, "Unsupervised graph-based topic labelling using dbpedia," In Proceedings of the sixth ACM international conference on Web search and data mining, pp. 465-474, 2013. https://doi.org/10.1145/2433396.2433454   DOI
22 H. Roitman, S. Hummel and M. Shmueli-Scheuer, "A fusion approach to cluster labeling," In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pp. 883-886, 2014. https://doi.org/10.1145/2600428.2609465   DOI
23 A. Panchenko and O. Morozova, "A study of hybrid similarity measures for semantic relation extraction," In Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data, pp. 10-18, 2012. http://dl.acm.org/citation.cfm?id=2388634
24 S. Hingmire, S. Chougule, G. K. Palshikar and S. Chakraborti, "Document classification by topic labeling," In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pp. 877-880, 2013. https://doi.org/10.1145/2484028.2484140   DOI
25 T. H. Kim, "A study of Document Cluster Labeling using Information Content of words", Master Dissertation of Sungkyunkwan Unversity, 2016. http://dcollection.skku.edu/jsp/common/DcLoOrgPer.jsp?sItemId=000000096202