[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.6109/jicce.2013.11.4.268

Document Classification Model Using Web Documents for Balancing Training Corpus Size per Category

Park, So-Young (Department of Game Design and Development, Sangmyung University)
Chang, Juno (Department of Game Design and Development, Sangmyung University)
Kihl, Taesuk (Department of Game Design and Development, Sangmyung University)

Publication Information

Journal of information and communication convergence engineering / v.11, no.4, 2013 , pp. 268-273 More about this Journal

Abstract

In this paper, we propose a document classification model using Web documents as a part of the training corpus in order to resolve the imbalance of the training corpus size per category. For the purpose of retrieving the Web documents closely related to each category, the proposed document classification model calculates the matching score between word features and each category, and generates a Web search query by combining the higher-ranked word features and the category title. Then, the proposed document classification model sends each combined query to the open application programming interface of the Web search engine, and receives the snippet results retrieved from the Web search engine. Finally, the proposed document classification model adds these snippet results as Web documents to the training corpus. Experimental results show that the method that considers the balance of the training corpus size per category exhibits better performance in some categories with small training sets.

Keywords

Document classification; Query generation; Text processing; Web documents;

Citations & Related Records

Times Cited By KSCI : 4 (Citation Analysis)

Reference
Cited By KSCI

1	K. Nyberg, T. Raiko, T. Tiinanen, and E. Hyvonen, "Document classification utilising ontologies and relations between documents," in Proceeding of the 8th Workshop on Mining and Learning with Graphs, Washington: DC, pp. 86-93, 2010.
2	R. K. Ayyasamy, B. Tahayna, S. Alhashmi, S. Eu-Gene, and S. Egerton, "Mining Wikipedia knowledge to improve document indexing and classification," in Proceeding of 10th International Conference on Information Sciences, Signal Processing and their Applications, Kuala Lumpur, Malaysia, pp. 806-809, 2010.
3	R. Ferreira, F. Freitas, P. Brito, J. Melo, R. Lima, and E. Costa, "RetriBlog: an architecture-centered framework for developing blog crawlers," Expert Systems with Applications, vol. 40, no. 4, pp. 1177-1195, 2013. DOI ScienceOn
4	S. Park, C. W. Kim, and D. U. An, "E-mail classification and category reorganization using dynamic category hierarchy and PCA," Journal of Information and Communication Engineering, vol. 7, no. 3, pp. 351-355, 2009.
5	H. Yun, "Classifying temporal topics with similar patterns on Twitter," Journal of Information and Communication Engineering, vol. 9, no. 3, pp. 295-300, 2011. DOI ScienceOn
6	H. Yun, "Quantifying influence in social networks and news media," Journal of Information and Communication Convergence Engineering, vol. 10, no. 2, pp. 135-140, 2012. DOI ScienceOn
7	B. Baharudin, L. H. Lee, and K. Khan, "A review of machine learning algorithms for text-documents classification," Journal of Advances in Information Technology, vol. 1, no. 1, pp. 4-20, 2010.
8	T. N. Rubin, A. Chambers, P. Smyth, and M. Steyvers, "Statistical topic models for multi-label document classification," Machine Learning, vol. 88, no. 1-2, pp. 157-208, 2012. DOI
9	G. Lu, P. Huang, L. He, C. Cu, and X. Li, "A new semantic similarity measuring method based on Web search engines," WSEAS Transactions on Computers, vol. 9, no. 1, pp. 1-10, 2010.
10	Z. Jialei, C. G. Hwang, G. D. Jung, and Y. K. Choi, "A design of K-XMDR search system using topic maps," Journal of Information and Communication Engineering, vol. 9, no. 3, pp. 287-294, 2011. DOI ScienceOn
11	S. Samarawickrama and L. Jayaratne, "Automatic text classification and focused crawling," in Proceeding of 6th International Conference on Digital Information Management, Melbourne, Australia, pp. 143-148, 2011.
12	A. K. McCallum, MALLET: a machine learning for language toolkit [Internet]. Available: http://mallet.cs.umass.edu.
13	A. L. Berger, V. J. Della Pietra, and S. A. Della Pietra, "A maximum entropy approach to natural language processing," Computational Linguistics, vol. 22, no. 1, pp. 39-71, 1996.
14	J. H. Lim, Y. S. Hwang, S. Y. Park, and H. C. Rim, "Semantic role labeling using maximum entropy model," in Proceeding of the Conference on Computational Natural Language Learning, Boston: MA, pp. 122-125, 2004.
15	H. Tan, T. Zhao, H. Wang, and W. P. Hong, "Identification of Chinese event types based on local feature selection and explicit positive & negative feature combination," International Journal of the Korean Institute of Maritime Information and Communication Sciences, vol. 5, no. 3, pp. 233-238, 2007. 과학기술학회마을
16	Y. Yang and J. O. Pedersen, "A comparative study on feature selection in text categorization," in Proceeding of the 14th International Conference on Machine Learning, Nashville: TN, pp. 412-420, 1997.
17	S. Y. Park, J. Chang, and T. Kihl, "Application of Web search results for document classification," in Future Information Communication Technology and Applications, Heidelberg, Germany: Springer, pp. 293-298, 2013.
18	K. Seki and J. Mostafa, "An application of text categorization methods to gene ontology annotation," in Proceeding of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil, pp. 138-145, 2005.
19	T. Kihl, J. Chang, and S. Y. Park, "Application tag system based on experience and pleasure for hedonic searches," in Convergence and Hybrid Information Technology, Heidelberg, Germany: Springer, pp. 342-352, 2012.

4	Eun-Jee Song. (2015) Journal of the Korea Institute of Information and Communication Engineering The Sensitivity Analysis for Customer Feedback on Social Media / 19 (4) , 780
3	Jeongman Heo. (2014) 韓國컴퓨터情報學會論文誌 단어 군집 기반 모바일 애플리케이션 범주화 / 19 (3) , 17
4	Jeong-Man Heo. (2013) 韓國컴퓨터情報學會論文誌 모바일 앱 트렌드를 고려한 2단계 군집화 방법 / 20 (4) , 17
10	Hyo-Soon Kong. (2013) 한국정보통신학회논문지 소셜 빅 데이터를 이용한 여행사 평가에 관한 연구 / 19 (10) , 2241
1	(2013) Journal of information and communication convergence engineering Keyword Analysis Based Document Compression System / 16 (1) , 48