Issues and Empirical Results for Improving Text Classification

Ko, Young-Joong;Seo, Jung-Yun;

doi:10.5626/JCSE.2011.5.2.150

Journal of Computing Science and Engineering

제5권2호
/
Pages.150-160
/
2011
/
1976-4677(pISSN)
/
2093-8020(eISSN)

한국정보과학회 (Korean Institute of Information Scientists and Engineers)

DOI QR Code

Issues and Empirical Results for Improving Text Classification

Ko, Young-Joong (Deptartment of Computer Engineering, Dong-A University) ;
Seo, Jung-Yun (Deptartment of Computer Engineering, Sogang University)

투고 : 2011.02.28
심사 : 2011.03.28
발행 : 2011.06.30

https://doi.org/10.5626/JCSE.2011.5.2.150 인용 PDF KPUBS

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

Automatic text classification has a long history and many studies have been conducted in this field. In particular, many machine learning algorithms and information retrieval techniques have been applied to text classification tasks. Even though much technical progress has been made in text classification, there is still room for improvement in text classification. In this paper, we will discuss remaining issues in improving text classification. In this paper, three improvement issues are presented including automatic training data generation, noisy data treatment and term weighting and indexing, and four actual studies and their empirical results for those issues are introduced. First, the semi-supervised learning technique is applied to text classification to efficiently create training data. For effective noisy data treatment, a noisy data reduction method and a robust text classifier from noisy data are developed as a solution. Finally, the term weighting and indexing technique is revised by reflecting the importance of sentences into term weight calculation using summarization techniques.

키워드

참고문헌

F. Sebastiani, "Machine learning in automated text categorization," ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, Mar. 2002. https://doi.org/10.1145/505282.505283
Y. Yang and J. O. Pedersen, "A comparative study on feature selection in text categorization," Proceedings of the 14th International Conference on Machine Learning, Nashville, TN, 1997, pp. 412-420.
Y. Ko and J. Seo, "Automatic text categorization by unsupervised learning," Proceedings of the 18th Conference on Computational Linguistics, Saarbrucken, Germany, 2000, pp. 453-459. https://doi.org/10.3115/990820.990886
A. McCallum and K. Nigam, "A comparison of event models for naive bayes text classification," AAAI/ICML Workshop on Learning for Text Categorization, Madison, WI, 1998, pp. 41-48.
D. D. Lewis, R. E. Schapire, J. P. Callan, and R. Papka, "Training algorithms for linear text classifiers," Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland, 1996, pp. 298-306. https://doi.org/10.1145/243199.243277
Y. Yang, S. Slattery, and R. Ghani, "A study of approaches to hypertext categorization," Journal of Intelligent Information Systems, vol. 18, no. 2-3, pp. 219-241, Mar. 2002. https://doi.org/10.1023/A:1013685612819
T. Joachims, "Learning to classify text using support vector machines," Ph.D. dissertation, University of Dortmund, Dormnund, 2001.
Y. Ko and J. Seo, "Text classification from unlabeled documents with bootstrapping and feature projection techniques," Information Processing and Management, vol. 45, no. 1, pp. 70-83, Jan. 2009. https://doi.org/10.1016/j.ipm.2008.07.004
N. Slonim, N. Friedman, and N. Tishby, "Unsupervised document classification using sequential information maximization," Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland, 2002, pp. 129-136. https://doi.org/10.1145/564376.564401
H. Han, Y. Ko, and J. Seo, "Using the revised EM algorithm to remove noisy data for improving the one-against-the-rest method in binary text classification," Information Processing and Management, vol. 43, no. 5, pp. 1281-1293, Sep. 2007. https://doi.org/10.1016/j.ipm.2006.11.003
Y. Ko and J. Seo, "Using the feature projection technique based on a normalized voting method for text classification," Information Processing and Management, vol. 40, no. 2, pp. 191-208, Mar. 2004. https://doi.org/10.1016/S0306-4573(03)00029-3
Y. Ko, J. Park, and J. Seo, "Improving text categorization using the importance of sentences," Information Processing and Management, vol. 40, no. 1, pp. 65-79, Jan. 2004. https://doi.org/10.1016/S0306-4573(02)00056-0
K. Nigam, A. McCallum, and T. Mitchell, "Semi-supervised text classification using EM," Semi-Supervised Learning, Cambridge, MA: MIT Press, pp. 33-56, 2006.
S. Tong and D. Koller, "Support vector machine active learning with applications to text classification," Journal of Machine Learning Research, vol. 2, pp. 45-66, Nov. 2001. https://doi.org/10.1162/153244302760185243
E. Brill, "Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging," Computational Linguistics, vol. 21, no. 4, pp. 543-565, Dec. 1995.
Y. S. Maarek, D. M. Berry, and G. E. Kaiser, "An information retrieval approach for automatically constructing software libraries," IEEE Transactions on Software Engineering, vol. 17, no. 8, pp. 800-813, Aug. 1991. https://doi.org/10.1109/32.83915
Y. Karov and S. Edelman, "Similarity-based word sense disambiguation," Computational Linguistics, vol. 24, no. 1, pp. 41-59, Mar. 1998.
M. Craven, D. Dipasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery, "Learning to construct knowledge bases from the World Wide Web," Artificial Intelligence, vol. 118, no. 1-2, pp. 69-113, Apr. 2000. https://doi.org/10.1016/S0004-3702(00)00004-7
Y. Yang, "An evaluation of statistical approaches to text categorization," Information Retrieval, vol. 1, no. 1-2, pp. 69-90, 1999. https://doi.org/10.1023/A:1009982220290
T. Joachims, "A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization," Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, TN, 1997, pp. 143-151.
T. Joachims, Learning to Classify Text Using Support Vector Machines, Boston: Kluwer Academic Publishers, 2002.
B. Zadrozny and C. Elkan, "Obtaining calibrated probability estimates from decision trees and naïve Bayesian classifiers," Proceedings of the Eighteenth International Conference on Machine Learning, Williamstown, MA, 2001, pp. 609-616.
B. Zadrozny and C. Elkan, "Transforming classifier scores into accurate multiclass probability estimates," Proceedings of the Eight ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, AB, 2002, pp. 694-699. https://doi.org/10.1145/775047.775151
C. W. Hsu and C. J. Lin, "A comparison of methods of multi-class support vector machines," IEEE Transactions on Neural Networks, vol. 13, no. 2, pp. 415-425, Mar. 2002. https://doi.org/10.1109/72.991427
D. D. Lewis, "Naive (bayes) at forty: the independence assumption in information retrieval," Proceedings of the 10th European Conference on Machine Learning, Chemnitz, Germany, 1998, pp. 4-15. https://doi.org/10.1007/BFb0026666
C. H. Lee, C. R. Lin, and M. S. Chen, "Sliding-window filtering: an efficient algorithm for incremental mining," Proceedings of the ACM CIKM: 10th International Conference on Information and Knowledge Management, Atlanta, GA, 2001, pp. 263-270. https://doi.org/10.1145/502585.502630
T. M. Mitchell, Machine Learning, New York: McGraw-Hill, 1997.
A. P. Dempster, N. M. Laird, and D. B. Rubin, "Maximum likelihood from incomplete data via em algorithm," Journal of the Royal Statistical Society Series B-Methodological, vol. 39, no. 1, pp. 1-38, 1977. https://doi.org/10.2307/2984875
T. Joachims, "Text categorization with support vector machines: learning with many relevant features," Machine Learning: ECML-98. Lecture Notes in Computer Science vol. 1398, Heidelberg: Springer Verlag, pp. 137-142, 2002. https://doi.org/10.1007/BFb0026683
G. Salton and C. Buckley, "Term-weighting approaches in automatic text retrieval," Information Processing and Management, vol. 24, no. 5, pp. 513-523, 1988. https://doi.org/10.1016/0306-4573(88)90021-0
B. Endres-Niggemeyer, Summarizing Information, New York: Springer, pp. 307-338, 1998.

피인용 문헌

Cross-Lingual Annotation Projection for Weakly-Supervised Relation Extraction vol.13, pp.1, 2014, https://doi.org/10.1145/2529994

Journal of Computing Science and Engineering

Issues and Empirical Results for Improving Text Classification

초록

키워드

참고문헌

피인용 문헌

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)