DOI QR코드

DOI QR Code

Issues and Empirical Results for Improving Text Classification

  • 투고 : 2011.02.28
  • 심사 : 2011.03.28
  • 발행 : 2011.06.30

초록

Automatic text classification has a long history and many studies have been conducted in this field. In particular, many machine learning algorithms and information retrieval techniques have been applied to text classification tasks. Even though much technical progress has been made in text classification, there is still room for improvement in text classification. In this paper, we will discuss remaining issues in improving text classification. In this paper, three improvement issues are presented including automatic training data generation, noisy data treatment and term weighting and indexing, and four actual studies and their empirical results for those issues are introduced. First, the semi-supervised learning technique is applied to text classification to efficiently create training data. For effective noisy data treatment, a noisy data reduction method and a robust text classifier from noisy data are developed as a solution. Finally, the term weighting and indexing technique is revised by reflecting the importance of sentences into term weight calculation using summarization techniques.

키워드

참고문헌

  1. F. Sebastiani, "Machine learning in automated text categorization," ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, Mar. 2002. https://doi.org/10.1145/505282.505283
  2. Y. Yang and J. O. Pedersen, "A comparative study on feature selection in text categorization," Proceedings of the 14th International Conference on Machine Learning, Nashville, TN, 1997, pp. 412-420.
  3. Y. Ko and J. Seo, "Automatic text categorization by unsupervised learning," Proceedings of the 18th Conference on Computational Linguistics, Saarbrucken, Germany, 2000, pp. 453-459. https://doi.org/10.3115/990820.990886
  4. A. McCallum and K. Nigam, "A comparison of event models for naive bayes text classification," AAAI/ICML Workshop on Learning for Text Categorization, Madison, WI, 1998, pp. 41-48.
  5. D. D. Lewis, R. E. Schapire, J. P. Callan, and R. Papka, "Training algorithms for linear text classifiers," Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland, 1996, pp. 298-306. https://doi.org/10.1145/243199.243277
  6. Y. Yang, S. Slattery, and R. Ghani, "A study of approaches to hypertext categorization," Journal of Intelligent Information Systems, vol. 18, no. 2-3, pp. 219-241, Mar. 2002. https://doi.org/10.1023/A:1013685612819
  7. T. Joachims, "Learning to classify text using support vector machines," Ph.D. dissertation, University of Dortmund, Dormnund, 2001.
  8. Y. Ko and J. Seo, "Text classification from unlabeled documents with bootstrapping and feature projection techniques," Information Processing and Management, vol. 45, no. 1, pp. 70-83, Jan. 2009. https://doi.org/10.1016/j.ipm.2008.07.004
  9. N. Slonim, N. Friedman, and N. Tishby, "Unsupervised document classification using sequential information maximization," Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland, 2002, pp. 129-136. https://doi.org/10.1145/564376.564401
  10. H. Han, Y. Ko, and J. Seo, "Using the revised EM algorithm to remove noisy data for improving the one-against-the-rest method in binary text classification," Information Processing and Management, vol. 43, no. 5, pp. 1281-1293, Sep. 2007. https://doi.org/10.1016/j.ipm.2006.11.003
  11. Y. Ko and J. Seo, "Using the feature projection technique based on a normalized voting method for text classification," Information Processing and Management, vol. 40, no. 2, pp. 191-208, Mar. 2004. https://doi.org/10.1016/S0306-4573(03)00029-3
  12. Y. Ko, J. Park, and J. Seo, "Improving text categorization using the importance of sentences," Information Processing and Management, vol. 40, no. 1, pp. 65-79, Jan. 2004. https://doi.org/10.1016/S0306-4573(02)00056-0
  13. K. Nigam, A. McCallum, and T. Mitchell, "Semi-supervised text classification using EM," Semi-Supervised Learning, Cambridge, MA: MIT Press, pp. 33-56, 2006.
  14. S. Tong and D. Koller, "Support vector machine active learning with applications to text classification," Journal of Machine Learning Research, vol. 2, pp. 45-66, Nov. 2001. https://doi.org/10.1162/153244302760185243
  15. E. Brill, "Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging," Computational Linguistics, vol. 21, no. 4, pp. 543-565, Dec. 1995.
  16. Y. S. Maarek, D. M. Berry, and G. E. Kaiser, "An information retrieval approach for automatically constructing software libraries," IEEE Transactions on Software Engineering, vol. 17, no. 8, pp. 800-813, Aug. 1991. https://doi.org/10.1109/32.83915
  17. Y. Karov and S. Edelman, "Similarity-based word sense disambiguation," Computational Linguistics, vol. 24, no. 1, pp. 41-59, Mar. 1998.
  18. M. Craven, D. Dipasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery, "Learning to construct knowledge bases from the World Wide Web," Artificial Intelligence, vol. 118, no. 1-2, pp. 69-113, Apr. 2000. https://doi.org/10.1016/S0004-3702(00)00004-7
  19. Y. Yang, "An evaluation of statistical approaches to text categorization," Information Retrieval, vol. 1, no. 1-2, pp. 69-90, 1999. https://doi.org/10.1023/A:1009982220290
  20. T. Joachims, "A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization," Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, TN, 1997, pp. 143-151.
  21. T. Joachims, Learning to Classify Text Using Support Vector Machines, Boston: Kluwer Academic Publishers, 2002.
  22. B. Zadrozny and C. Elkan, "Obtaining calibrated probability estimates from decision trees and naïve Bayesian classifiers," Proceedings of the Eighteenth International Conference on Machine Learning, Williamstown, MA, 2001, pp. 609-616.
  23. B. Zadrozny and C. Elkan, "Transforming classifier scores into accurate multiclass probability estimates," Proceedings of the Eight ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, AB, 2002, pp. 694-699. https://doi.org/10.1145/775047.775151
  24. C. W. Hsu and C. J. Lin, "A comparison of methods of multi-class support vector machines," IEEE Transactions on Neural Networks, vol. 13, no. 2, pp. 415-425, Mar. 2002. https://doi.org/10.1109/72.991427
  25. D. D. Lewis, "Naive (bayes) at forty: the independence assumption in information retrieval," Proceedings of the 10th European Conference on Machine Learning, Chemnitz, Germany, 1998, pp. 4-15. https://doi.org/10.1007/BFb0026666
  26. C. H. Lee, C. R. Lin, and M. S. Chen, "Sliding-window filtering: an efficient algorithm for incremental mining," Proceedings of the ACM CIKM: 10th International Conference on Information and Knowledge Management, Atlanta, GA, 2001, pp. 263-270. https://doi.org/10.1145/502585.502630
  27. T. M. Mitchell, Machine Learning, New York: McGraw-Hill, 1997.
  28. A. P. Dempster, N. M. Laird, and D. B. Rubin, "Maximum likelihood from incomplete data via em algorithm," Journal of the Royal Statistical Society Series B-Methodological, vol. 39, no. 1, pp. 1-38, 1977. https://doi.org/10.2307/2984875
  29. T. Joachims, "Text categorization with support vector machines: learning with many relevant features," Machine Learning: ECML-98. Lecture Notes in Computer Science vol. 1398, Heidelberg: Springer Verlag, pp. 137-142, 2002. https://doi.org/10.1007/BFb0026683
  30. G. Salton and C. Buckley, "Term-weighting approaches in automatic text retrieval," Information Processing and Management, vol. 24, no. 5, pp. 513-523, 1988. https://doi.org/10.1016/0306-4573(88)90021-0
  31. B. Endres-Niggemeyer, Summarizing Information, New York: Springer, pp. 307-338, 1998.

피인용 문헌

  1. Cross-Lingual Annotation Projection for Weakly-Supervised Relation Extraction vol.13, pp.1, 2014, https://doi.org/10.1145/2529994