Browse > Article
http://dx.doi.org/10.13088/jiis.2018.24.3.221

A Study on Automatic Classification Model of Documents Based on Korean Standard Industrial Classification  

Lee, Jae-Seong (University of Science & Technology)
Jun, Seung-Pyo (Div. of Data Analysis, Korea Institute of Science & Technology Information/University of Science & Technology)
Yoo, Hyoung Sun (Div. of Data Analysis, Korea Institute of Science & Technology Information/University of Science & Technology)
Publication Information
Journal of Intelligence and Information Systems / v.24, no.3, 2018 , pp. 221-241 More about this Journal
Abstract
As we enter the knowledge society, the importance of information as a new form of capital is being emphasized. The importance of information classification is also increasing for efficient management of digital information produced exponentially. In this study, we tried to automatically classify and provide tailored information that can help companies decide to make technology commercialization. Therefore, we propose a method to classify information based on Korea Standard Industry Classification (KSIC), which indicates the business characteristics of enterprises. The classification of information or documents has been largely based on machine learning, but there is not enough training data categorized on the basis of KSIC. Therefore, this study applied the method of calculating similarity between documents. Specifically, a method and a model for presenting the most appropriate KSIC code are proposed by collecting explanatory texts of each code of KSIC and calculating the similarity with the classification object document using the vector space model. The IPC data were collected and classified by KSIC. And then verified the methodology by comparing it with the KSIC-IPC concordance table provided by the Korean Intellectual Property Office. As a result of the verification, the highest agreement was obtained when the LT method, which is a kind of TF-IDF calculation formula, was applied. At this time, the degree of match of the first rank matching KSIC was 53% and the cumulative match of the fifth ranking was 76%. Through this, it can be confirmed that KSIC classification of technology, industry, and market information that SMEs need more quantitatively and objectively is possible. In addition, it is considered that the methods and results provided in this study can be used as a basic data to help the qualitative judgment of experts in creating a linkage table between heterogeneous classification systems.
Keywords
Automatic Document Classification; Korea Standard Industry Classification; Text mining; Vector space model; Natural language processing;
Citations & Related Records
Times Cited By KSCI : 7  (Citation Analysis)
연도 인용수 순위
1 National Information Society Agency, "2016 The Report on the Digital Divide", 2016.
2 Noh, Y., J. Lim, K. Bok, J. Yoo, "Hot Topic Prediction Scheme Using Modified TF-IDF in Social Network Environments," KIISE Transactions on Computing Practices, Vol.23, No.4(2017), 217-225.   DOI
3 Park, C. H., S. S. Youm, and J. M. Lee, "The Effect of User-Centered Categorization System of Homepages on Directory Search," Korean Journal of Cognitive Science, Vol.11, No.1(2000), 47-65.
4 Pazzani, M. J., J. Muramatsu, and D. Billsus, "Syskill & Webert: Identifying interesting web sites," AAAI/IAAI, Vol. 1. 1996.
5 Ponte, J. M., and W. B. Croft, "A language modeling approach to information retrieval," Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, ACM, (1998).
6 Radecki, T. "Trends in research on information retrieval-the potential for improvements in conventional boolean retrieval systems," Information Processing & Management, Vol.24, No.3(1988), 219-227.   DOI
7 Ruiz, M. E., and P. Srinivasan, "Hierarchical text categorization using neural networks," Information Retrieval, Vol.5, No.1(2002), 87-118.   DOI
8 Salton, G., A. Wong, and C. S. Yang, "A vector space model for automatic indexing," Communications of the ACM, Vol.18, No.11(1975), 613-620.   DOI
9 Salton, G., "Historical Note: The Past Thirty Years in Information Retrieval," Jounal of the American Society for Information Science, Vol.38, No.5(1987).
10 Salton, G. "Automatic text processing: The transformation, analysis, and retrieval of," Reading: Addison-Wesley, (1989).
11 Sebastiani, F., "Machine learning in automated text categorization," ACM computing surveys (CSUR), Vol.34, No.1(2002), 1-47.   DOI
12 Chang, J., "Using the MeSH Hierarchy to Index Bioinformatics Articles," CS224N/Ling237 Final Projects, (2000), 1-10.
13 Aha, D. W., D. Kibler, and M. K. Albert, "Instance-based learning algorithms," Machine learning, Vol.6, No.1(1991), 37-66.   DOI
14 Beel, J., B. Gipp, S. Langer, and C. Breitinger, "paper recommender systems: a literature survey," International Journal on Digital Libraries, Vol.17, No.4(2016), 305-338.   DOI
15 Byun, S., Lee, D., and Kim, N,. "Methodology for Identifying Issues of User Reviews from the Perspective of Evaluation Criteria: Focus on a Hotel Information Site," Journal of Intelligence and Information Systems, Vol.22, No.3(2016), 23-43.   DOI
16 Yang, Y. "Expert network: Effective and efficient learning from human decisions in text categorization and retrieval," Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, Springer-Verlag, New York, Inc., 1994.
17 Sparck Jones, K., "A statistical interpretation of term specificity and its application in retrieval," Journal of documentation, Vol.28, No.1(1972), 11-21.   DOI
18 Vapnik, V. Statistical learning theory, 1998, Wiley, New York, 1998.
19 Witten, I. H., A. Moffat, and T. C. Bell, Managing gigabytes: compressing and indexing documents and images, Morgan Kaufmann, 1999.
20 Yang, Y., and J. O. Pedersen. "A comparative study on feature selection in text categorization," Icml, Vol. 97, (1997).
21 Shavlik, J., and T. Eliassi-Rad, "Intelligent agents for web-based tasks: An advice-taking approach," AAAI/ICML Workshop on Learning for Text Categorization, 1998.
22 Yang, Y. and X. Liu, "A Re-examination of Text Categorization Methods," Proceedings of the 22h Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 99), (1999), 42-49.
23 Yang, Y. "An Evaluation of Statistical Approaches to Text Categorization," Journal of Information Retrieval, Vol.1, No.1(1999), 67-88.
24 Yoo, H. S, J. H. Seo, S.-P. Jun, J. Seo, "A Study on an Estimation Method of Domestic Market Size by Using the Standard Statistical Classfications," Journal of Korea Technology Innovation Society, Vol. 18, No. 3(2015), 387-415.
25 Craven, M., et al. "Learning to construct knowledge bases from the World Wide Web," Artificial intelligence, Vol.118, No.1(2000), 69-113.   DOI
26 Chang, J. Y., "A Study on Research Trends of Graph-Based Text Representations for Text Mining," The Journal of The Institute of Internet, Broadcasting and Communication, Vol.13, No.5(2013), 37-47.   DOI
27 Choi, H. B., "An Artificial Neural Network for Local Library's Book Recommender System," Journal of Korean Institute of Information Technology, Vol.14, No.9(2016), 109-118.
28 Cleverdon, C., "Optimizing Convenient Online Access to Bibliographic Databases," Information Services and Use, Vol.4, No.12(1983), 37-47.
29 Cooper, W. S. "Getting beyond boole," Information Processing & Management, Vol.24, No.3(1988), 243-248.   DOI
30 Craven, M., et al., Learning to extract symbolic knowledge from the World Wide Web, Carnegie-mellon univ pittsburgh pa school of computer Science, 1998.
31 Dillon, M. "Introduction to modern information retrieval: G. Salton and M. McGill", McGraw-Hill, New York, 1983.
32 Drucker, P., Post-capitalist society, Routledge, 2012.
33 Gudivada, V. N., V. V. Raghavan, W. I. Grosky, and R. Kasanagottu, "Information retrieval on the world wide web", IEEE Internet Computing, Vol.1, No.5(1997), 58-68.   DOI
34 Guide to the International Patent Classification, WIPO, 2017.
35 Hamedani, M. R., and S. W. Kim, "A Comparative Study of Vector Space and Probabilistic Models in Computing Similarity of Scientific Papers," Communications of the Korean Institute of Information Scientists and Engineers, Vol.20, No.3(2014), 186-190.
36 Joachims, T., "Text Categorization with Support Vector Machines: Learning with Many Relevant Features," European Conference on Machine Learning(ECML), 1988.
37 Hong, J. S., Kim, N., and Lee, S., "A Methodology for Automatic Multi-Categorization of Single-Categorized Documents," Journal of Intelligence and Information Systems, Vol.20, No.3(2014), 77-92.   DOI
38 Jeon, H. C., and J. M. Choi, "PIRS : Personalized Information Retrieval System using Adaptive User Profiling and Real-time Filtering for Search Results," Journal of Intelligence and Information Systems, Vol.16, No.4(2010), 21-41.
39 Jeong, Y. M., "Information Retrieval Theory", Gumi Trade Publishing Department, 1993.
40 Kim, D. and Yu, S. J., "Reliability Analysis of VOC Data for Opinion Mining," Journal of Intelligence and Information Systems, Vol.22, No.4(2016), 217-245.   DOI
41 Kim, G., "Data Mining for Spam Email Classification," Asia-pacific Journal of Multimedia Services Convergent with Art, Humanities, and Sociology, Vol.6, No.7(2016), 37-47.
42 Kim, H. J., and J. Y. Chang, "A Semantic Text Model with Wikipedia-based Concept Space," The Journal of Society for e-Business Studies, Vol.19, No.3(2014), 107-123.   DOI
43 Kim, S. I., and H. S. Kim, "An Automatic Web Page Classification System Using Meta-Tag," The Korean Institute of Communications and Informaion Sciences, Vol.38, No.4(2013), 291-297.
44 Korea Standard Industry Classification(KSIC) 9th Amendment, Statistics Korea, 2007.
45 Lang, K., "Newsweeder: Learning to filter netnews," Machine Learning Proceedings 1995, (1995), 331-339.
46 Lee, S., G. Lee, O. Hwang, and S. Noh, "Developing Movie Recommendation System Reflecting Movie Viewers' Preferences," Journal of Intelligence and Information Systems 2007 Fall Conference, (2007), 507-513.
47 Lee, H. K., S. Yang, and Y. J. Ko, "Feature Expansion based on LDA Word Distribution for Performance Improvement of Informal Document Classification," Korea Institute of Information Scientists and Engineers, Vol.43, No.9(2016), 1008-1014.
48 Lee, J. M., "UN's Sustainable Development Goals (SDGs) Oriented Research Trend in publications of Korean Society of Rural Planning, 1995-2016: quantitatively analyzed with the Vector Space Model," Journal of Korean Society of Rural Planning, Vol.23, No.2(2017), 29-42.   DOI
49 Lee, J. H., M. H. Kim, and Y. J. Lee, "Ranking documents in thesaurus-based Boolean retrieval systems," Information Processing & Management, Vol.30 No.1(1994), 79-91.   DOI
50 Lee, S., and H. J. Kim, "Keyword Extraction from News Corpus using Modified TF-IDF," The Journal of Society for e-Business Studies, Vol.14, No.4(2009), 59-73.
51 Lewis, D. D., and W. A. Gale, "A sequential algorithm for training text classifiers," Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval. Springer-Verlag New York, Inc., 1994.
52 Lewis, D. D., and K. A. Knowles, "Threading electronic mail: A preliminary study," Information processing & management, Vol.33, No.2(1997), 209-217.   DOI
53 Luhn, H. P., "A statistical approach to mechanized encoding and searching of literary information" IBM Journal of research and development, Vol.1, No.4(1957), 309-317.   DOI
54 Mooney, R. J., and L. Roy, "Content-based book recommending using learning for text categorization," Proceedings of the fifth ACM conference on Digital libraries, ACM, (2000).
55 Manning, C. D., P. Raghavan, and H. Schtze, "Document and query weighting schemes," Introduction to Information Retrieval, (2008), 128.
56 Ministry of SMEs and Startups, "Status of SMEs in Korea", 2014.