Browse > Article
http://dx.doi.org/10.29214/damis.2018.37.4.003

Analyzing the Effect of Characteristics of Dictionary on the Accuracy of Document Classifiers  

Jung, Haegang (Graduate School of Business IT, Kookmin University)
Kim, Namgyu (School of Management Information Systems, Kookmin University)
Publication Information
Management & Information Systems Review / v.37, no.4, 2018 , pp. 41-62 More about this Journal
Abstract
As the volume of unstructured data increases through various social media, Internet news articles, and blogs, the importance of text analysis and the studies are increasing. Since text analysis is mostly performed on a specific domain or topic, the importance of constructing and applying a domain-specific dictionary has been increased. The quality of dictionary has a direct impact on the results of the unstructured data analysis and it is much more important since it present a perspective of analysis. In the literature, most studies on text analysis has emphasized the importance of dictionaries to acquire clean and high quality results. However, unfortunately, a rigorous verification of the effects of dictionaries has not been studied, even if it is already known as the most essential factor of text analysis. In this paper, we generate three dictionaries in various ways from 39,800 news articles and analyze and verify the effect each dictionary on the accuracy of document classification by defining the concept of Intrinsic Rate. 1) A batch construction method which is building a dictionary based on the frequency of terms in the entire documents 2) A method of extracting the terms by category and integrating the terms 3) A method of extracting the features according to each category and integrating them. We compared accuracy of three artificial neural network-based document classifiers to evaluate the quality of dictionaries. As a result of the experiment, the accuracy tend to increase when the "Intrinsic Rate" is high and we found the possibility to improve accuracy of document classification by increasing the intrinsic rate of the dictionary.
Keywords
Text Mining; Start List; Document Classification; Topic Modeling; Intrinsic Rate;
Citations & Related Records
Times Cited By KSCI : 10  (Citation Analysis)
연도 인용수 순위
1 강상욱.김민호.권혁철.전성규.오주현(2015), "세종 전자사전과 한국어 어휘의미망 을 이용한 용언의 어의 중의성 해소," 정보과학회 컴퓨팅의 실제 논문지, 21(7), 500-505.   DOI
2 곽수정.김보겸.이재성(2013), "한국어 형태소 분석을 위한 효율적 기분석 사전의 구성 방법", 정보처리학회논문지, 소프트웨어 및 데이터 공학, 2(12), 881-888.
3 김남규.이동훈.최호창(2017), "텍스트 분석기술 및 활용 동향", 한국통신학회논문지, 42(2), 471-492.   DOI
4 김민철.심규승.한남기.김예은.송민(2013), "트위터 상의 악의적 이용 자동분류", 한구문헌정보학회지, 47(1), 269-286
5 김정수.이석준(2015), "주식시장관리제도와 소셜 미디어의 역할-개인 투자자 집단 유형과 토픽 분석", 경영과 정보연구, 34(5), 23-47.
6 김정수.이석준(2016), "취업준비생 토픽 분석을 통한 취업난 원인의 재탐색", 경영과 정보연구, 35(1), 85-116
7 김창식.최수정.곽기영(2017), "토픽모델링과 시계열회귀분석을 활용한 정보시스템분야 연구동향 분석," 한국디지털콘텐츠학회 논문지, 18(6), 1143-1150.
8 김태훈.손미애(2017), "문서 클러스터를 위한 워드넷기반의 대표 레이블 선정 방법", 인터넷정보학회논문지, 18(2), 61-73.   DOI
9 박주섭.홍순구.김종원(2017), "토픽모델링을 활용한 과학기술동향 및 예측에 관한 연구," 한국산업정보학회논문지, 22(4), 19-28.   DOI
10 박준석.김창식.곽기영(2016), "텍스트마이닝과 소셜네트워크분석 기법을 활용한 호텔분야 연구동향 분석", 관광레저연구, 28(9), 209-226.
11 배상준.고영중(2010), "한국어 위키피디아를 이용한 분류체계 생성과 개체명 사전 자동구축", 정보과학회논문지: 컴퓨팅의 실제 및 레터, 16(4), 492-496.
12 배정환.손지은.송민(2013), "텍스트 마이닝을 이용한 2012년 한국대선 관련 트위터 분석", 지능정보연구, 19(3), 141-156.   DOI
13 송종석.이수원(2011), "상품평 극성 분류를 위한 특징별 서술어 긍정/부정 사전 자동 구축", 정보과학회논문지: 소프트웨어 및 응용, 38(3), 157-168.
14 안정국.김희웅(2015), "집단지성을 이용한 한글 감성어 사전 구축", 지능정보연구, 21(2), 49-67.   DOI
15 윤애선.황순희.이은령.권혁철(2009), "한국어 어휘의미망 [KorLex 1.5]의 구축", 정보과학회논문지: 소프트웨어 및 응용, 36(1), 92-108.
16 이상훈.최정.김종우(2016), "영역별 맞춤형 감성사전 구축을 통한 영화리뷰 감성분석", 지능정보연구, 22(2), 97-113.   DOI
17 조정태.최상편(2015), "영화리뷰 감성 분석을 통한 평점 예측 연구", 경영과 정보연구, 34(3), 161-177.
18 최석재.권오병(2014), "빅데이터 분석을 위한 한국어 SentiWordNet 개발 방안 연구," 한국전자거래학회지, 19(4), 1-19.   DOI
19 최성이.김남규(2014), "토픽 분석을 활용한 웹 카테고리별 방문자 관심 이슈 식별 방안", 한국데이타베이스, 21(4), 415-429
20 홍진성.김남규.이상원(2014), "단일 카테고리 문서의 다중 카테고리 자동확장 방법론", 지능정보연구, 20(3), 77-92.   DOI
21 Amensisa, A. D., Patil, S. and Agrawal, P.(2018), "A survey on text document categorization using enhanced sentence vector space model and bi-gram text representation model based on novel fusion techniques", 2018 2nd International Conference on Inventive Systems and Control(ICISC), 218-225
22 Gao, J. B., Zhang, B. W. and Chen, X. H.(2015), "A WordNet-based semantic similarity measurement combining edgecounting and information content theory", Engineering Applications of Artificial Intelligence, 39, 80-88.   DOI
23 Blei, D. M., Ng, A. Y. and Jordan, M. I.(2003), "Latent dirichlet allocation", Journal of Machine Learning Research, 3, 993-1022.
24 Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. and Harshman, R.(1990), "Indexing by latent semantic analysis", Journal of the American Society for Information Science, 41(6), 391-407.   DOI
25 Fellbaum, C.(1998), "A semantic network of english: the mother of all WordNets", Computers and the Humanities, 32, 209-220.   DOI
26 Gupta, V. and Lehal, G. S.(2009), "A survey of text mining techniques and applications", Journal of Emerging Technologies in Web Intelligence, 1(1), 60-76.
27 Hearst, M. A.(1999), "Untangling text data mining", In Proceedings of the 37th annual meeting of the Association for Computational Linguistics, 3-10
28 Hong, L. and Davison, B. D.(2010), "Empirical study of topic modeling in twitter", In Proceedings of the First Workshop on Social Media Analytics, 80-88.
29 Hotho, A., Nurnberger, A. and Paass, G. (2005), "A brief survey of text mining", In Ldv Forum-GLDV Journal for Computational Linguistics and Language Technology, 20(1), 19-62.
30 Joachims, T.(1998), "Text categorization with support vector machines: Learning with many relevant features", In European Conference on Machine Learning, 137-142.
31 Miller, G. A.(1995), "WordNet: A lexical database for English", Communications of the ACM, 38(11), 39-41.   DOI
32 Rijsbergen, C. J. V., Information Retrieval, 2nd edition, Butterworths, 1979.
33 Mooney, R. J. and Bunescu, R. C.(2006), "Subsequence kernels for relation extraction", In Advances in Neural Information Processing Systems, 171-178.
34 Pedersen, T., Patwardhan, S. and Michelizzi, J.(2004), "WordNet:: Similarity: measuring the relatedness of concepts", In Proceedings of the 5th Annual Meeting of the North American Chapter of the Association for Computational Linguistics, 38-41.
35 Richman, A. E. and Schone, P.(2008), "Mining wiki resources for multilingual named entity recognition", In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 1-9.
36 Rogati, M. and Yang, Y.(2002), "Highperforming feature selection for text classification", In Proceedings of the 11th International Conference on Information and Knowledge Management, 659-661.
37 Salton, G. and McGill, M. J. Introduction to Modern Information Retrieval, McGraw-Hill, 1983.
38 Sebastiani, F.(2006), "Classification of text, automatic", The Encyclopedia of Language and Linguistics, 14, 457-462.
39 Vapnik, V. N., The Nature of Statistical Learning Theory, Springer, 1995.
40 Wang, C. and Blei, D. M.(2011), "Collaborative topic modeling for recommending scientific articles", In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 448-456.
41 Wei, T., Lu, Y., Chang, H., Zhou, Q. and Bao, X.(2015), "A semantic approach for text clustering using WordNet and lexical chains", Expert Systems with Applications, 42(4), 2264-2275.   DOI