Browse > Article
http://dx.doi.org/10.5351/KJAS.2019.32.3.451

Creation and clustering of proximity data for text data analysis  

Jung, Min-Ji (Department of Statistics, Pusan National University)
Shin, Sang Min (Department of Management Information Systems, Dong-A University)
Choi, Yong-Seok (Department of Statistics, Pusan National University)
Publication Information
The Korean Journal of Applied Statistics / v.32, no.3, 2019 , pp. 451-462 More about this Journal
Abstract
Document-term frequency matrix is a type of data used in text mining. This matrix is often based on various documents provided by the objects to be analyzed. When analyzing objects using this matrix, researchers generally select only terms that are common in documents belonging to one object as keywords. Keywords are used to analyze the object. However, this method misses the unique information of the individual document as well as causes a problem of removing potential keywords that occur frequently in a specific document. In this study, we define data that can overcome this problem as proximity data. We introduce twelve methods that generate proximity data and cluster the objects through two clustering methods of multidimensional scaling and k-means cluster analysis. Finally, we choose the best method to be optimized for clustering the object.
Keywords
text mining; proximity data; TF-IDF; multidimensional scaling; cluster analysis;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 Cho, S. G., Cho, J. H., and Kim, S. B. (2015). Discovering meaningful trends in the inaugural addresses of United States presidents via text mining, Journal of the Korean Institute of Industrial Engineers, 41, 453-460.   DOI
2 Cox, T. F. and Cox, M. A. A. (2001). Multidimensional Scaling, Chapman & Hall/CRC, London.
3 Kruskal, J. B. and Wish, M. (1978). Multidimensional Scaling, Sage University Paper Series on Quantitative Applications in the Social Sciences, 07-011, Sage Publications, Beverly Hills and London.
4 Nam, S. C. and Choi, Y. S. (2017). Non-parametric approach for the grouped dissimilarities using the multidimensional scaling and analysis of distance, The Korean Journal of Applied Statistics, 27, 567-578.
5 Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis, Journal of Computational and Applied Mathematics, 20, 53-65.   DOI
6 Satopaa, V., Albrecht, J., Irwin, D., and Raghavan, B. (2011). Finding a "kneedle" in a Haystack: Detecting Knee Points in System Behavior, Distributed Computing Systems Workshops (ICDCSW) 2011 31st International Conference on, IEEE, 166-171.
7 Sim, Y. S. and Kim, H. B. (2016). A study of destination image and measurement using text mining, Journal of Tourism Sciences, 40, 221-245.   DOI
8 Choi, Y. S. (2014). Walk in Multidimensional Scaling, Free Academy, Gyeonggi-do.
9 Choi, Y. S. (2018). Multivariate Data Analysis with R, Kyungmoon, Seoul.