DOI QR코드

DOI QR Code

Creation and clustering of proximity data for text data analysis

텍스트 데이터 분석을 위한 근접성 데이터의 생성과 군집화

  • Jung, Min-Ji (Department of Statistics, Pusan National University) ;
  • Shin, Sang Min (Department of Management Information Systems, Dong-A University) ;
  • Choi, Yong-Seok (Department of Statistics, Pusan National University)
  • Received : 2019.03.04
  • Accepted : 2019.04.17
  • Published : 2019.06.30

Abstract

Document-term frequency matrix is a type of data used in text mining. This matrix is often based on various documents provided by the objects to be analyzed. When analyzing objects using this matrix, researchers generally select only terms that are common in documents belonging to one object as keywords. Keywords are used to analyze the object. However, this method misses the unique information of the individual document as well as causes a problem of removing potential keywords that occur frequently in a specific document. In this study, we define data that can overcome this problem as proximity data. We introduce twelve methods that generate proximity data and cluster the objects through two clustering methods of multidimensional scaling and k-means cluster analysis. Finally, we choose the best method to be optimized for clustering the object.

문서-용어 빈도행렬은 텍스트 마이닝 분야에서 보편적으로 사용되는 데이터의 한 유형으로, 여러 개체들이 제공하는 문서를 기반으로 만들어진다. 그러나 대다수의 연구자들은 개체 정보에 무게를 두지 않고 여러 문서에서 공통적으로 등장하는 공통용어 중 핵심적인 용어를 효과적으로 찾아내는 방법에 집중하는 경향을 보인다. 공통용어에서 핵심어를 선별할 경우 특정 문서에서만 등장하는 중요한 용어들이 공통용어 선정단계에서부터 배제될 뿐만 아니라 개별 문서들이 갖는 고유한 정보가 누락되는 등의 문제가 야기된다. 본 연구에서는 이러한 문제를 극복할 수 있는 데이터를 근접성 데이터라 정의한다. 그리고 근접성 데이터를 생성할 수 있는 12가지 방법 중 개체 군집화의 관점에서 가장 최적화된 방법을 제안한다. 개체 특성 파악을 위한 군집화 알고리즘으로는 다차원척도법과 K-평균 군집분석을 활용한다.

Keywords

GCGHDE_2019_v32n3_451_f0001.png 이미지

Figure 2.1. The process of finding one elbow point. TF-IDF = term frequency-inverse document frequency.

GCGHDE_2019_v32n3_451_f0002.png 이미지

Figure 3.1. Applications of multidimensional scaling using different distance measures. dED = Euclidean distance;dCD = chi-square distance; dWED = weighted Euclidean distance.

GCGHDE_2019_v32n3_451_f0003.png 이미지

Figure 3.2. Applications of multidimensional scaling using different weighting and filtering methods.

Table 2.1. The methods for creating a document-keyword weighted matrix

GCGHDE_2019_v32n3_451_t0001.png 이미지

Table 2.2. Proximity data generated by 12 methods

GCGHDE_2019_v32n3_451_t0002.png 이미지

Table 3.1. The average silhouette statistics according to k

GCGHDE_2019_v32n3_451_t0003.png 이미지

Table 3.2. Real classification of the nineteen government-funded research institutes

GCGHDE_2019_v32n3_451_t0004.png 이미지

Table 3.3. The calculation result of Γ and MIR

GCGHDE_2019_v32n3_451_t0005.png 이미지

References

  1. Cho, S. G., Cho, J. H., and Kim, S. B. (2015). Discovering meaningful trends in the inaugural addresses of United States presidents via text mining, Journal of the Korean Institute of Industrial Engineers, 41, 453-460. https://doi.org/10.7232/JKIIE.2015.41.5.453
  2. Choi, Y. S. (2014). Walk in Multidimensional Scaling, Free Academy, Gyeonggi-do.
  3. Choi, Y. S. (2018). Multivariate Data Analysis with R, Kyungmoon, Seoul.
  4. Cox, T. F. and Cox, M. A. A. (2001). Multidimensional Scaling, Chapman & Hall/CRC, London.
  5. Kruskal, J. B. and Wish, M. (1978). Multidimensional Scaling, Sage University Paper Series on Quantitative Applications in the Social Sciences, 07-011, Sage Publications, Beverly Hills and London.
  6. Nam, S. C. and Choi, Y. S. (2017). Non-parametric approach for the grouped dissimilarities using the multidimensional scaling and analysis of distance, The Korean Journal of Applied Statistics, 27, 567-578.
  7. Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis, Journal of Computational and Applied Mathematics, 20, 53-65. https://doi.org/10.1016/0377-0427(87)90125-7
  8. Satopaa, V., Albrecht, J., Irwin, D., and Raghavan, B. (2011). Finding a "kneedle" in a Haystack: Detecting Knee Points in System Behavior, Distributed Computing Systems Workshops (ICDCSW) 2011 31st International Conference on, IEEE, 166-171.
  9. Sim, Y. S. and Kim, H. B. (2016). A study of destination image and measurement using text mining, Journal of Tourism Sciences, 40, 221-245. https://doi.org/10.17086/JTS.2016.40.7.221.245