Browse > Article
http://dx.doi.org/10.3745/KTSDE.2018.7.12.477

A Study on Research Paper Classification Using Keyword Clustering  

Lee, Yun-Soo (대구가톨릭대학교 컴퓨터정보통신공학과)
Pheaktra, They (대구가톨릭대학교 컴퓨터정보통신공학과)
Lee, JongHyuk (대구가톨릭대학교 빅데이터공학과)
Gil, Joon-Min (대구가톨릭대학교 IT공학부)
Publication Information
KIPS Transactions on Software and Data Engineering / v.7, no.12, 2018 , pp. 477-484 More about this Journal
Abstract
Due to the advancement of computer and information technologies, numerous papers have been published. As new research fields continue to be created, users have a lot of trouble finding and categorizing their interesting papers. In order to alleviate users' this difficulty, this paper presents a method of grouping similar papers and clustering them. The presented method extracts primary keywords from the abstracts of each paper by using TF-IDF. Based on TF-IDF values extracted using K-means clustering algorithm, our method clusters papers to the ones that have similar contents. To demonstrate the practicality of the proposed method, we use paper data in FGCS journal as actual data. Based on these data, we derive the number of clusters using Elbow scheme and show clustering performance using Silhouette scheme.
Keywords
Classification Papers; K-Means Clustering; TF-IDF; Map-Reduce;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 Bruno Trstenjak, Sasa Mikac, and Dzenana Donko, "KNN with TF-IDF based Framework for Text Categorization," Procedia Engineering, Vol.69, pp.1356-1364, 2014.   DOI
2 Prafulla Bafna, Dhanya Pramod, and Anagha Vaidya, "Document clustering: TF-IDF approach," in Proceedings of 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), pp.61-66, 2016.
3 Lukas Havrlant and Vladik Kreinovich, "A simple probabilistic explanation of term frequency-inverse document frequency (tf-idf) heuristic (and variations motivated by this explanation)," International Journal of General Systems, Vol.46, No.1, pp.27-36, 2017.   DOI
4 Akiko Aizawa, "An information-theoretic perspective of tf-idf measures," Information Processing and Management, Vol.39, Iss.1, pp.45-65, Jan. 2003.   DOI
5 Shereen Albitar, Sébastien Fournier, and Bernard Espinasse, "An effective TF/IDF-based text-to-text semantic similarity measure for text classification," in Proceedings of International Conference on Web Information Systems Engineering (WISE 2014), pp.105-114, 2014.
6 Chyi-Kwei Yau, Alan Porter, Nils Newman, and Arho Suominen, "Clustering scientific documents with topic modeling," Scientometrics, Vol.100, Iss.3, pp.767-786, Sept. 2014.   DOI
7 Rakesh Chandra Balabantaray, Chandrali Sarma, and Monica Jha. "Document clustering using K-means and K-medoids," International Journal of Knowledge Based Computer Systems, Vol.1, Iss.1, 2015.
8 Rajeev Srivastava and Himanshu Gupta, "K-means based document clustering with automatic "K" selection and cluster refinement," International Journal of Computer Science and Mobile Applications, Vol.2, Iss.5, pp.7-13, 2014.
9 N. K. Nagwani, "Summarizing large text collection using topic modeling and clustering based on MapReduce framework," Journal of Big Data, Vol.2, No.6, pp.1-18, Dec. 2015.
10 FGCS Journal [Internet], https://www.journals.elsevier.com/future-generation-computer-systems
11 Kil-Hong Joo, Eun-Young Shin, Joo-Il Lee, and Won-Suk Lee, "Hierarchical Automatic Classification of News Articles based on Association Rules," Journal of Korean Multimedia Society, Vol.14, No.6, pp.730-741, 2011.   DOI
12 H. Cho and J.-S. Lee, "Data-driven feature word selection for clustering online news comments," in Proceedings of 2016 International Conference on Big Data and Smart Computing (BigComp), pp.494-497, Jan. 2016.
13 Anand Mahendran, Anjali Duraiswamy, Amulya Reddy, and Clayton Gonsalves, "Opinion Mining for text classification," International Journal of Scientific Engineering and Technology, Vol.2, Iss.6, pp.589-594, Jun. 2013.
14 Izzat Alsmadi and Ikdam Alhami, "Clustering and classification of email contents," Journal of King Saud University-Computer and Information Sciences, Vol.27, Iss.1, pp.46-57, Jan. 2015.   DOI
15 Bravo-Alcobendas and C. O. S. Sorzano, "Clustering of biomedical scientific papers," in Proceedings of 2009 IEEE International Symposium on Intelligent Signal Processing, pp.205-209, Aug. 2009.
16 Mohsen Taheriyan, "Subject classification of research papers based on interrelationships analysis," in Proceedings of the 2011 Workshop on Knowledge Discovery, Modeling and Simulation, pp.39-44, Aug. 2011.
17 Trupti M. Kodinariya and Prashant R. Makwana, "Review on determining number of Cluster in K-Means Clustering," International Journal of Advanced. Researches in Computer Science and Management Studies, Vol.1, Iss.6, pp.90-95, Nov. 2013.
18 Charu C. Aggarwal and Chandan K. Reddy, Data clustering: algorithms and applications, CRC press., 2013.
19 Hidetsugu Nanba, Noriko Kando, and Manabu Okumura, "Classification of research papers using citation links and citation types: towards automatic review article generation," in Proceedings of 11th ASIS SIG/CR Classification Research Workshop, pp.117-134, 2011.
20 Thien Hai Nguyen and Kiyoaki Shirai. "Text classification of technical papers based on text segmentation," Lecture Notes in Computer Science, Vol.7934, pp.278-284, 2013.
21 Scikit-Learn [Internet], http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
22 Gilberto V. Oliveira, Felipe P. Coutinho, Ricardo Campello, and Murilo C. Naldi, "Improving k-means through distributed scalable metaheuristics," Neurocomputing, Vol.246, No.12, pp.45-57, Jul. 2017.   DOI
23 Peter J. Rousseeuw, "Silhouettes: a graphical aid to the interpretation and validation of cluster analysis," Journal of Computational and Applied Mathematics, Vol.20, pp.53-65, Nov. 1987.   DOI