Browse > Article
http://dx.doi.org/10.5762/KAIS.2018.19.11.758

Analysis of Massive Scholarly Keywords using Inverted-Index based Bottom-up Clustering  

Oh, Heung-Seon (School of Computer Science and Engineering, KOREATECH)
Jung, Yuchul (Computer Engineering, Kumoh National Institute of Technology)
Publication Information
Journal of the Korea Academia-Industrial cooperation Society / v.19, no.11, 2018 , pp. 758-764 More about this Journal
Abstract
Digital documents such as patents, scholarly papers and research reports have author keywords which summarize the topics of documents. Different documents are likely to describe the same topic if they share the same keywords. Document clustering aims at clustering documents to similar topics with an unsupervised learning method. However, it is difficult to apply to a large amount of documents event though the document clustering is utilized to in various data analysis due to computational complexity. In this case, we can cluster and connect massive documents using keywords efficiently. Existing bottom-up hierarchical clustering requires huge computation and time complexity for clustering a large number of keywords. This paper proposes an inverted index based bottom-up clustering for keywords and analyzes the results of clustering with massive keywords extracted from scholarly papers and research reports.
Keywords
Keyword clustering; Inverted-index; keyword analysis; bottom-up clustering; information retrieval;
Citations & Related Records
연도 인용수 순위
  • Reference
1 O. Egozi, S. Markovitch, E. Gabrilovich, "Concept-Based Information Retrieval Using Explicit Semantic Analysis", ACM Transactions on Information Systems, Vol.29, No.2, pp.1-34, 2011. DOI: https://dx.doi.org/10.1145/1961209.1961211
2 L. Li, R. Zhou, D. Huang, "Two-phase biomedical named entity recognition using CRFs", Computational Biology and Chemistry, Vol.33, No.4, pp.334-338, 2009. DOI: https://dx.doi.org/10.1016/j.compbiolchem.2009.07.004   DOI
3 R. Meng, S. Zhao, S. Han, D. He, P. Brusilovsky, Y. Chi, "Deep Keyphrase Generation", Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.582-592, 2017. DOI: https://dx.doi.org/10.18653/v1/P17-1054
4 Y. G. Kim, J. H. Suh, S. C. Park, "Visualization of patent analysis for emerging technology", Expert Systems with Applications, Vol.34, No.3, pp.1804-1812, 2008. DOI: https://dx.doi.org/10.1016/j.eswa.2007.01.033   DOI
5 R. Meng, S. Zhao, S. Han, D. He, P. Brusilovsky, Y. Chi, "Deep Keyphrase Generation", Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.582-592, 2017. DOI: https://dx.doi.org/10.18653/v1/P17-1054
6 J. Liu, J. Shang, C. Wang, X. Ren, J. Han, "Mining Quality Phrases from Massive Text Corpora", Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data - SIGMOD '15, pp.1729-1744, 2015. DOI: https://dx.doi.org/10.1145/2723372.2751523
7 C. C. Aggarwal, C. A. Zhai, Survey of Text Clustering Algorithms. In Mining Text Data, pp.77-128, Springer US, 2012.
8 C. D. Manning, P. Raghavan, H. Schutze, Introduction to Information Retrieval. Cambridge University Press, 2008.
9 P. Willett, "The Porter stemming algorithm: then and now", Program, Vol.40, No.3, pp.219-223, 2006. DOI: https://dx.doi.org/10.1108/00330330610681295   DOI
10 M. Sahami, T. D. Heilman, "A web-based kernel function for measuring the similarity of short text snippets", Proceedings of the 15th international conference on World Wide Web - WWW '06, pp.377-386, 2006. DOI: https://dx.doi.org/10.1145/1135777.1135834
11 S. Tan, Y. Wang, G. Wu, "Adapting centroid classifier for document categorization", Expert Systems with Applications, Vol.38, No.8, pp.10264-10273, 2011. DOI: https://dx.doi.org/10.1016/j.eswa.2011.02.114   DOI
12 T. Hasegawa, S. Sekine, R. Grishman, "Discovering relations among named entities from large corpora", Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics - ACL '04, pp.415-422, 2004. DOI: https://dx.doi.org/10.3115/1218955.1219008