DOI QR코드

DOI QR Code

A Methodology for Automatic Multi-Categorization of Single-Categorized Documents

단일 카테고리 문서의 다중 카테고리 자동확장 방법론

  • Hong, Jin-Sung (Graduate School of Business IT, Kookmin University) ;
  • Kim, Namgyu (Graduate School of Business IT, Kookmin University) ;
  • Lee, Sangwon (Division of Information and Electric Commerce, Wonkwang University)
  • 홍진성 (국민대학교 비즈니스IT전문대학원) ;
  • 김남규 (국민대학교 비즈니스IT전문대학원) ;
  • 이상원 (원광대학교 경영대학 정보전자상거래학부)
  • Received : 2014.06.15
  • Accepted : 2014.06.23
  • Published : 2014.09.30

Abstract

Recently, numerous documents including unstructured data and text have been created due to the rapid increase in the usage of social media and the Internet. Each document is usually provided with a specific category for the convenience of the users. In the past, the categorization was performed manually. However, in the case of manual categorization, not only can the accuracy of the categorization be not guaranteed but the categorization also requires a large amount of time and huge costs. Many studies have been conducted towards the automatic creation of categories to solve the limitations of manual categorization. Unfortunately, most of these methods cannot be applied to categorizing complex documents with multiple topics because the methods work by assuming that one document can be categorized into one category only. In order to overcome this limitation, some studies have attempted to categorize each document into multiple categories. However, they are also limited in that their learning process involves training using a multi-categorized document set. These methods therefore cannot be applied to multi-categorization of most documents unless multi-categorized training sets are provided. To overcome the limitation of the requirement of a multi-categorized training set by traditional multi-categorization algorithms, we propose a new methodology that can extend a category of a single-categorized document to multiple categorizes by analyzing relationships among categories, topics, and documents. First, we attempt to find the relationship between documents and topics by using the result of topic analysis for single-categorized documents. Second, we construct a correspondence table between topics and categories by investigating the relationship between them. Finally, we calculate the matching scores for each document to multiple categories. The results imply that a document can be classified into a certain category if and only if the matching score is higher than the predefined threshold. For example, we can classify a certain document into three categories that have larger matching scores than the predefined threshold. The main contribution of our study is that our methodology can improve the applicability of traditional multi-category classifiers by generating multi-categorized documents from single-categorized documents. Additionally, we propose a module for verifying the accuracy of the proposed methodology. For performance evaluation, we performed intensive experiments with news articles. News articles are clearly categorized based on the theme, whereas the use of vulgar language and slang is smaller than other usual text document. We collected news articles from July 2012 to June 2013. The articles exhibit large variations in terms of the number of types of categories. This is because readers have different levels of interest in each category. Additionally, the result is also attributed to the differences in the frequency of the events in each category. In order to minimize the distortion of the result from the number of articles in different categories, we extracted 3,000 articles equally from each of the eight categories. Therefore, the total number of articles used in our experiments was 24,000. The eight categories were "IT Science," "Economy," "Society," "Life and Culture," "World," "Sports," "Entertainment," and "Politics." By using the news articles that we collected, we calculated the document/category correspondence scores by utilizing topic/category and document/topics correspondence scores. The document/category correspondence score can be said to indicate the degree of correspondence of each document to a certain category. As a result, we could present two additional categories for each of the 23,089 documents. Precision, recall, and F-score were revealed to be 0.605, 0.629, and 0.617 respectively when only the top 1 predicted category was evaluated, whereas they were revealed to be 0.838, 0.290, and 0.431 when the top 1 - 3 predicted categories were considered. It was very interesting to find a large variation between the scores of the eight categories on precision, recall, and F-score.

텍스트에 대한 사용자의 접근성을 향상시키기 위해, 이들 문서는 정해진 기준에 따라 카테고리로 분류되어 제공되고 있다. 과거에는 카테고리 분류 작업이 수작업으로 수행되었지만, 문서 작성자에게 분류를 맡기는 경우 분류 정확성을 보장할 수 없고 관리자가 모든 분류를 담당하는 경우 많은 시간과 비용이 소요된다는 어려움이 있었다. 이러한 한계를 극복하기 위해 카테고리를 자동으로 식별할 수 있는 문서 분류 기법에 대한 연구가 활발하게 수행되었다. 하지만 대부분의 문서 분류 기법은 각 문서가 하나의 카테고리에만 속하는 경우를 가정하고 있기 때문에, 하나의 문서가 다양한 주제를 갖는 실제 상황과 부합하지 않는다는 한계를 갖는다. 이를 보완하기 위해 최근 문서의 다중 카테고리 식별을 위한 연구가 일부 수행되었으나, 이들 연구는 대부분 이미 다중 카테고리가 부여되어 있는 문서에 대한 학습을 통해 분류 규칙을 생성하므로 단일 카테고리만 부여되어 있는 기존 문서의 다중 카테고리 식별에는 적용할 수 없다는 제약을 갖는다. 따라서 본 연구에서는 이러한 제약을 극복하기 위해, 카테고리, 토픽, 문서간 관계 분석을 통해 단일 카테고리를 갖는 문서로부터 추가 주제를 발굴하여 이를 다중 카테고리로 자동 확장시킬 수 있는 방법론을 제안하였다. 실험 결과 원 카테고리가 식별된 총 24,000건의 문서 중 23,089건에 대해 카테고리를 확장시킬 수 있었다. 또한 정확도 분석에서 카테고리의 특성에 따라 카테고리 분류 정확도가 상이하게 나타나는 현상을 발견하였다. 본 연구는 단일 카테고리로 분류된 문서에 대해 다중 카테고리를 추가로 식별하여 부여함으로써, 규칙 학습 과정에서 다중 카테고리가 부여된 문서를 필요로 하는 기존 다중 카테고리 문서 분류 알고리즘의 활용성을 매우 향상시킬 수 있을 것으로 기대한다.

Keywords

References

  1. Albright, R., Taming Text with the SVD, SAS Institute Inc., Cary, NC, 2006.
  2. Apte, C., F. Damerau, and S. M. Weiss, "Automated Learning of Decision Rules for Text Categorization," ACM Transactions on Information Systems, Vol.12, No.3(1994), 233-251. https://doi.org/10.1145/183422.183423
  3. Fan, W., L. Wallace, S. Rich, and Z. Zhang, "Tapping the Power of Text Mining," Communications of the ACM, Vol.49, No.9 (2006), 76-82.
  4. Han, J. and M. Kamber, Data Mining: Concepts and Techniques, 3nd, Morgan Kaufmann Publishers, San Francisco, 2011.
  5. Hong, J. S., H. S. Choi, H. J. Han, J. S. Kim, E. J. Yu, S. R. Lim, and N. G. Kim, "A Data Analysis-based Hybrid Methodology for Selecting Pending National Issue Keywords," Entrue Journal of Information Technology, Vol.13, No.1(2014), 97-111.
  6. In, J.-H., J.-H. Kim, and S.-H. Chae, "Combined Feature Set and Hybrid Feature Selection Method for Effective Document Classification," Journal of Korean Society for Internet Information, Vol.14, No.5(2013), 49-57. https://doi.org/10.7472/jksii.2013.14.5.49
  7. Joachims, T., Text categorization with Support Vector Machines: Learning with Many Relevant Features, Springer, Berlin, 1998.
  8. Lewis, D. D. and M. Ringuette, "A Comparison of two learning algorithms for text categorization", Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval, (1994), 81-93.
  9. Lim, H. and D.-W. Kim, "Using Mutual Information for Selecting Features in Multi-label Classification," Journal of KIISE : Software and Applications, Vol.39, No.10 (2012), 806-811.
  10. Lim, H.-S and K. Nam, "Computer Science : Improving of KNN - based Korean text classifier by using heuristic information," The Journal of Korean Association of Computer Education, Vol.5, No.3(2002), 37-44.
  11. Manning, C. D. and H. Schutze, Foundation of Statistical Natural Language Processing, The MIT Press, US, 1999.
  12. Metzler, D., Y. Bernstein, W. B. Croft, A. Moffat, and J. Zobel, "Similarity Measures for Tracking Information Flow," Proceedings of CIKM, (2005), 517-524.
  13. Mooney, R. J. and R. Bunescu, "Mining Knowledge from Text using Information Extraction," ACM SIGKDD Explorations Newsletter, Vol.7, No.1(2005), 3-10.
  14. Rijsbergen, C. J. V., Information Retrieval, 2nd edition, Butterworth, London, 1979.
  15. Salton, G., A. Wong, and C. S. Yang, "A Vector Space Model for Automatic Indexing," Communications of the ACM, Vol.18, No.11 (1975), 613-620. https://doi.org/10.1145/361219.361220
  16. Salton, G. and M. J. McGill, Introduction to Modern Information Retrieval, McGraw Hill, US, 1983.
  17. Sebastiani, F., Classification of Text, Automatic, The Encyclopedia of Language and Linguistics 14, 2nd edition, Elsevier Science Pub, North-Holland, 2006.
  18. Song, S. M., J. S. Yu, and E. M. Kim, "Offering system for major article Using Text Mining and Data Mining," Proceedings of th 32th annual conference on Korea Information Processing Society, (2009), 733-734.
  19. Weiner, E., J. O. Pedersenm, and A. S. Weigend, "A Neural Network Approach to Topic Spotting," Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval, 1995.
  20. Weiss, S. M., N. Indurkhya, and T. Zhang, Fundamentals of Predictive Text Mining, Springer, Berlin, 2010.
  21. Witten, I. H., K. J. Don, M. Dewsnip, and V. Tablan, "Text mining in a digital library," International Journal on Digital Libraries, Vol.4, No.1(2004), 56-59. https://doi.org/10.1007/s00799-003-0066-4
  22. Yang, Y., "Expert network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval," Proceedings of the 17th annual International ACM SIGIR Conference on Research and Development in Information Retrieval, (1994), 13-22.
  23. Yu, E.-J., J.-C. Kim, C.-Y. Lee, and N.-G. Kim, "Using Ontologies for Semantic Text Mining," The Journal of Information Systems, Vol.21, No.3(2012), 137-161. https://doi.org/10.5859/KAIS.2012.21.3.137
  24. Yoon, J., J. Lee, and D.-W. Kim, "Feature Selection in Multi-label Classification using NSGA-II Algorithm," Journal of KIISE : Software and Applications, Vol.40, No.3 (2013), 133-140.

Cited by

  1. Mapping Categories of Heterogeneous Sources Using Text Analytics vol.22, pp.4, 2016, https://doi.org/10.13088/jiis.2016.22.4.193
  2. Analyzing the Issue Life Cycle by Mapping Inter-Period Issues vol.20, pp.4, 2014, https://doi.org/10.13088/jiis.2014.20.4.25
  3. 취업준비생 토픽 분석을 통한 취업난 원인의 재탐색 vol.35, pp.1, 2016, https://doi.org/10.29214/damis.2016.35.1.005
  4. 한국표준산업분류를 기준으로 한 문서의 자동 분류 모델에 관한 연구 vol.24, pp.3, 2014, https://doi.org/10.13088/jiis.2018.24.3.221
  5. 용어 사전의 특성이 문서 분류 정확도에 미치는 영향 연구 vol.37, pp.4, 2014, https://doi.org/10.29214/damis.2018.37.4.003
  6. Doc2Vec 모형에 기반한 자기소개서 분류 모형 구축 및 실험 vol.19, pp.1, 2014, https://doi.org/10.9716/kits.2020.19.1.103