Browse > Article
http://dx.doi.org/10.6109/jicce.2013.11.4.268

Document Classification Model Using Web Documents for Balancing Training Corpus Size per Category  

Park, So-Young (Department of Game Design and Development, Sangmyung University)
Chang, Juno (Department of Game Design and Development, Sangmyung University)
Kihl, Taesuk (Department of Game Design and Development, Sangmyung University)
Abstract
In this paper, we propose a document classification model using Web documents as a part of the training corpus in order to resolve the imbalance of the training corpus size per category. For the purpose of retrieving the Web documents closely related to each category, the proposed document classification model calculates the matching score between word features and each category, and generates a Web search query by combining the higher-ranked word features and the category title. Then, the proposed document classification model sends each combined query to the open application programming interface of the Web search engine, and receives the snippet results retrieved from the Web search engine. Finally, the proposed document classification model adds these snippet results as Web documents to the training corpus. Experimental results show that the method that considers the balance of the training corpus size per category exhibits better performance in some categories with small training sets.
Keywords
Document classification; Query generation; Text processing; Web documents;
Citations & Related Records
Times Cited By KSCI : 4  (Citation Analysis)
연도 인용수 순위
1 K. Nyberg, T. Raiko, T. Tiinanen, and E. Hyvonen, "Document classification utilising ontologies and relations between documents," in Proceeding of the 8th Workshop on Mining and Learning with Graphs, Washington: DC, pp. 86-93, 2010.
2 R. K. Ayyasamy, B. Tahayna, S. Alhashmi, S. Eu-Gene, and S. Egerton, "Mining Wikipedia knowledge to improve document indexing and classification," in Proceeding of 10th International Conference on Information Sciences, Signal Processing and their Applications, Kuala Lumpur, Malaysia, pp. 806-809, 2010.
3 R. Ferreira, F. Freitas, P. Brito, J. Melo, R. Lima, and E. Costa, "RetriBlog: an architecture-centered framework for developing blog crawlers," Expert Systems with Applications, vol. 40, no. 4, pp. 1177-1195, 2013.   DOI   ScienceOn
4 S. Park, C. W. Kim, and D. U. An, "E-mail classification and category reorganization using dynamic category hierarchy and PCA," Journal of Information and Communication Engineering, vol. 7, no. 3, pp. 351-355, 2009.
5 H. Yun, "Classifying temporal topics with similar patterns on Twitter," Journal of Information and Communication Engineering, vol. 9, no. 3, pp. 295-300, 2011.   DOI   ScienceOn
6 H. Yun, "Quantifying influence in social networks and news media," Journal of Information and Communication Convergence Engineering, vol. 10, no. 2, pp. 135-140, 2012.   DOI   ScienceOn
7 B. Baharudin, L. H. Lee, and K. Khan, "A review of machine learning algorithms for text-documents classification," Journal of Advances in Information Technology, vol. 1, no. 1, pp. 4-20, 2010.
8 T. N. Rubin, A. Chambers, P. Smyth, and M. Steyvers, "Statistical topic models for multi-label document classification," Machine Learning, vol. 88, no. 1-2, pp. 157-208, 2012.   DOI
9 G. Lu, P. Huang, L. He, C. Cu, and X. Li, "A new semantic similarity measuring method based on Web search engines," WSEAS Transactions on Computers, vol. 9, no. 1, pp. 1-10, 2010.
10 Z. Jialei, C. G. Hwang, G. D. Jung, and Y. K. Choi, "A design of K-XMDR search system using topic maps," Journal of Information and Communication Engineering, vol. 9, no. 3, pp. 287-294, 2011.   DOI   ScienceOn
11 S. Samarawickrama and L. Jayaratne, "Automatic text classification and focused crawling," in Proceeding of 6th International Conference on Digital Information Management, Melbourne, Australia, pp. 143-148, 2011.
12 A. K. McCallum, MALLET: a machine learning for language toolkit [Internet]. Available: http://mallet.cs.umass.edu.
13 A. L. Berger, V. J. Della Pietra, and S. A. Della Pietra, "A maximum entropy approach to natural language processing," Computational Linguistics, vol. 22, no. 1, pp. 39-71, 1996.
14 J. H. Lim, Y. S. Hwang, S. Y. Park, and H. C. Rim, "Semantic role labeling using maximum entropy model," in Proceeding of the Conference on Computational Natural Language Learning, Boston: MA, pp. 122-125, 2004.
15 H. Tan, T. Zhao, H. Wang, and W. P. Hong, "Identification of Chinese event types based on local feature selection and explicit positive & negative feature combination," International Journal of the Korean Institute of Maritime Information and Communication Sciences, vol. 5, no. 3, pp. 233-238, 2007.   과학기술학회마을
16 Y. Yang and J. O. Pedersen, "A comparative study on feature selection in text categorization," in Proceeding of the 14th International Conference on Machine Learning, Nashville: TN, pp. 412-420, 1997.
17 S. Y. Park, J. Chang, and T. Kihl, "Application of Web search results for document classification," in Future Information Communication Technology and Applications, Heidelberg, Germany: Springer, pp. 293-298, 2013.
18 K. Seki and J. Mostafa, "An application of text categorization methods to gene ontology annotation," in Proceeding of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil, pp. 138-145, 2005.
19 T. Kihl, J. Chang, and S. Y. Park, "Application tag system based on experience and pleasure for hedonic searches," in Convergence and Hybrid Information Technology, Heidelberg, Germany: Springer, pp. 342-352, 2012.