DOI QR코드

DOI QR Code

Text Classification with Heterogeneous Data Using Multiple Self-Training Classifiers

  • Received : 2019.04.22
  • Accepted : 2019.09.03
  • Published : 2019.12.31

Abstract

Text classification is a challenging task, especially when dealing with a huge amount of text data. The performance of a classification model can be varied depending on what type of words contained in the document corpus and what type of features generated for classification. Aside from proposing a new modified version of the existing algorithm or creating a new algorithm, we attempt to modify the use of data. The classifier performance is usually affected by the quality of learning data as the classifier is built based on these training data. We assume that the data from different domains might have different characteristics of noise, which can be utilized in the process of learning the classifier. Therefore, we attempt to enhance the robustness of the classifier by injecting the heterogeneous data artificially into the learning process in order to improve the classification accuracy. Semi-supervised approach was applied for utilizing the heterogeneous data in the process of learning the document classifier. However, the performance of document classifier might be degraded by the unlabeled data. Therefore, we further proposed an algorithm to extract only the documents that contribute to the accuracy improvement of the classifier.

Keywords

References

  1. Agarwal, S., Godbole, S., Punjani, D., and Roy, S. (2007). How much noise is too much: A study in automatic text classification, In Seventh IEEE International Conference on Data Mining, 3-12. 
  2. Alpaydin, E., and Jordan, M. I. (1996). Local linear perceptrons for classification. IEEE Transactions on Neural Networks, 7(3), 788-794.  https://doi.org/10.1109/72.501737
  3. Ando, R. K., and Zhang, T. (2005). A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6(Nov), 1817-1853. 
  4. Angelova, R., and Weikum, G. (2006). Graph-based text classification: Learn from your neighbors, In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, 485-492. 
  5. Aslam, S. (2018). Twitter by the numbers: Stats, demographics and fun facts. [Online]. Retrieved from https://www.omnicoreagency.com/twitter-statistics/ 
  6. Bennett, K. P., and Demiriz, A. (1999). Semi-supervised support vector machines. In Advances in Neural Information Processing Systems, 368-374. 
  7. Beyer, M. A., and Laney, D. (2012). The importance of 'big data': A definition. Gartner Research, Stamford, CT, USA, Tech. Rep. G00235055. 
  8. Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993-1022. 
  9. Blum, A., and Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual ACM conference on Computational learning theory, 92-100. 
  10. Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123-140. 
  11. Bruce, R. (2001). Semi-supervised learning using prior probabilities and EM. In International Joint Conference on Artificial Intelligence, Workshop on Text Learning: Beyond Supervision. 
  12. Chapelle, O., Chi, M., and Zien, A. (2006a). A continuation method for semi-supervised SVMs. In Proceedings of the 23rd ACM International Conference on Machine Learning, 185-192. 
  13. Chapelle, O., Scholkopf, B., and Zien, A. (2006b). Semi-supervised learning. MIT Press, Cambridge, MA. 
  14. Cozman, F. G., Cohen, I., and Cirelo, M. C. (2003). Semi-supervised learning of mixture models. In Proceedings of the 20th International Conference on Machine Learning, 99-106. 
  15. Dasarathy, B. V., and Sheela, B. V. (1979). A composite classifier system design: Concepts and methodology. In Proceedings of the IEEE, 67(5), 708-713.  https://doi.org/10.1109/PROC.1979.11321
  16. Dimitriadou, E., Weingessel, A., and Hornik, K. (2003). A Cluster Ensembles Framework. IOS Press, Amsterdam, The Netherlands. 
  17. Freund, Y., and Schapire, R. E. (1996). Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Conference on International Conference on Machine Learning, 148-156. 
  18. Freund, Y., and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119-139.  https://doi.org/10.1006/jcss.1997.1504
  19. Giacinto, G., and Roli, F. (2001). An approach to the automatic design of multiple classifier systems. Pattern Recognition Letters, 22(1), 25-33.  https://doi.org/10.1016/S0167-8655(00)00096-9
  20. Grandvalet, Y., and Bengio, Y. (2005). Semi-supervised learning by entropy minimization. In Advances in Neural Information Processing Systems, 529-536. 
  21. Hansen, L. K., and Salamon, P., (1990). Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(10), 993-1001.  https://doi.org/10.1109/34.58871
  22. Hartley, H. O., and Rao, J. N. (1968). Classification and estimation in analysis of variance problems. Revue de l'Institut International de Statistique, 141-147. 
  23. Hernandez, M. A., and Stolfo, S. J. (1998). Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1), 9-37.  https://doi.org/10.1023/A:1009761603038
  24. Hofmann, T. (2001). Unsupervised learning by probabilistic latent semantic analysis. Machine learning, 42(1-2), 177-196.  https://doi.org/10.1023/A:1007617005950
  25. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3(1), 79-87.  https://doi.org/10.1162/neco.1991.3.1.79
  26. Jordan, M. I., and Xu, L. (1995). Convergence results for the EM approach to mixtures of experts architectures. Neural Networks, 8(9), 1409-1431.  https://doi.org/10.1016/0893-6080(95)00014-3
  27. Kim, S., Zhang, H., Wu, R., and Gong, L. (2011). Dealing with noise in defect prediction. In IEEE 33rd International Conference on Software Engineering, 481-490. 
  28. Kuncheva, L. I., Bezdek, J. C., and Duin, R. P. (2001). Decision templates for multiple classifier fusion: An experimental comparison. Pattern Recognition, 34(2), 299-314.  https://doi.org/10.1016/S0031-3203(99)00223-X
  29. L'Heureux, A., Grolinger, K., ElYamany, H. F., and Capretz, M. (2017). Machine learning with big data: Challenges and approaches. IEEE Access. 
  30. Li, M., and Zhou, Z. H. (2005). SETRED: Self-training with editing. In PAKDD, 3518, 611-621. 
  31. Liu, W., Liu, S., Gu, Q., Chen, X., and Chen, D. (2015). Fecs: A cluster based feature selection method for software fault prediction with noises. In IEEE 39th Annual Computer Software and Applications Conference (COMPSAC), 2, 276-281. 
  32. Mallapragada, P. K., Jin, R., Jain, A. K., and Liu, Y. (2009). Semiboost: Boosting for semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(11), 2000-2014.  https://doi.org/10.1109/TPAMI.2008.235
  33. Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S., and McClosky, D. (2014). The stanford coreNLP natural language processing toolkit. In ACL (System Demonstrations), 55-60. 
  34. Maulik, U., and Chakraborty, D. (2011). A self-trained ensemble with semisupervised SVM: An application to pixel classification of remote sensing imagery. Pattern Recognition, 44(3), 615-623.  https://doi.org/10.1016/j.patcog.2010.09.021
  35. Mitra, V., Wang, C. J., and Banerjee, S. (2007). Text classification: A least square support vector machine approach. Applied Soft Computing, 7(3), 908-914.  https://doi.org/10.1016/j.asoc.2006.04.002
  36. Nigam, K., McCallum, A. K., Thrun, S., and Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3), 103-134.  https://doi.org/10.1023/A:1007692713085
  37. Polikar, R. (2006). Ensemble based systems in decision making. IEEE Circuits and Systems Magazine, 6(30), 21-45.  https://doi.org/10.1109/MCAS.2006.1688199
  38. Provost, F., and Fawcett, T. (2013). Data science for business: What you need to know about data mining and data-analytic thinking. O'Reilly Media, Inc.. 
  39. Riloff, E., Wiebe, J., and Phillips, W. (2005). Exploiting subjectivity classification to improve information extraction. In Proceedings of the National Conference on Artificial Intelligence, 20(3), 1106. 
  40. Rosenberg, C., Hebert, M., and Schneiderman, H. (2005). Semi-supervised self-training of object detection models. In Seventh IEEE Workshops on Application of Computer Vision, 1, 29-36. 
  41. Saez, J. A., Galar, M., Luengo, J., and Herrera, F. (2013). Tackling the problem of classification with noisy data using Multiple Classifier Systems: Analysis of the performance and robustness. Information Sciences, 247, 1-20.  https://doi.org/10.1016/j.ins.2013.06.002
  42. Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5(2), 197-227. 
  43. Seeger, M. (2000). Learning with labeled and unlabeled data. Tech. Rep. Edinburgh, UK: University of Edinburgh. 
  44. Tanha, J., van Someren, M., and Afsarmanesh, H. (2011). Disagreement-based co-training. In 23rd IEEE International Conference on Tools with Artificial Intelligence, 803-810. 
  45. Tanha, J., van Someren, M., and Afsarmanesh, H. (2017). Semi-supervised self-training for decision tree classifiers. International Journal of Machine Learning and Cybernetics, 8(1), 355-370.  https://doi.org/10.1007/s13042-015-0328-7
  46. Triguero, I., Garcia, S., and Herrera, F. (2015). Self-labeled techniques for semi-supervised learning: Taxonomy, software and empirical study. Knowledge and Information Systems, 42(2), 245-284.  https://doi.org/10.1007/s10115-013-0706-y
  47. Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley, Reading, Mass. 
  48. Wang, G., Hao, J., Ma, J., and Jiang, H. (2011). A comparative assessment of ensemble learning for credit scoring. Expert Systems with Applications, 38(1), 223-230.  https://doi.org/10.1016/j.eswa.2010.06.048
  49. Wang, L., Chan, K. L., and Zhang, Z. (2003). Bootstrapping SVM active learning by incorporating unlabelled images for image retrieval. In CVPR, 629-634. 
  50. Wang, X. Z., Zhang, S. F., and Zhai, J. H. (2007). A nonlinear integral defined on partition and its application to decision trees. Soft Computing, 11(4), 317-321.  https://doi.org/10.1007/s00500-006-0083-5
  51. Wang, Y., Xu, X., Zhao, H., and Hua, Z. (2010). Semi-supervised learning based on nearest neighbor rule and cut edges. Knowledge-Based Systems, 23(6), 547-554.  https://doi.org/10.1016/j.knosys.2010.03.012
  52. Witten, I. H., Frank, E., Hall, M. A., and Pal, C. J. (2016). Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann. 
  53. Woods, K., Kegelmeyer, W. P., and Bowyer, K. (1997). Combination of multiple classifiers using local accuracy estimates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4), 405-410.  https://doi.org/10.1109/34.588027
  54. Wu, X. (1996). Knowledge acquisition from databases. Ablex Publishing Corp., Norwood, NJ, USA. 
  55. Wu, X., and Zhu, X. (2008). Mining with noise knowledge: error-aware data mining. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 38(4), 917-932.  https://doi.org/10.1109/TSMCA.2008.923034
  56. Yarowsky, D. (1995). Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics, 189-196. 
  57. Zhou, Z. H. (2012). Ensemble methods: Foundations and algorithms. CRC press. 
  58. Zhu, X., and Goldberg, A. B. (2009). Introduction to semi-supervised learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 3(1), 1-130. 
  59. Zhu, X., Lafferty, J., and Rosenfeld, R. (2005). Semi-supervised learning with graphs. Doctoral dissertation. School of Computer Science, Language Technologies Institute, Carnegie Mellon University. 
  60. Zhu, X., Wu, X., and Chen, Q. (2003). Eliminating class noise in large datasets. In Proceedings of the 20th International Conference on Machine Learning, 920-927.