Browse > Article
http://dx.doi.org/10.13088/jiis.2018.24.3.021

Improving the Accuracy of Document Classification by Learning Heterogeneity  

Wong, William Xiu Shun (Kookmin University)
Hyun, Yoonjin (Kookmin University)
Kim, Namgyu (Kookmin University)
Publication Information
Journal of Intelligence and Information Systems / v.24, no.3, 2018 , pp. 21-44 More about this Journal
Abstract
In recent years, the rapid development of internet technology and the popularization of smart devices have resulted in massive amounts of text data. Those text data were produced and distributed through various media platforms such as World Wide Web, Internet news feeds, microblog, and social media. However, this enormous amount of easily obtained information is lack of organization. Therefore, this problem has raised the interest of many researchers in order to manage this huge amount of information. Further, this problem also required professionals that are capable of classifying relevant information and hence text classification is introduced. Text classification is a challenging task in modern data analysis, which it needs to assign a text document into one or more predefined categories or classes. In text classification field, there are different kinds of techniques available such as K-Nearest Neighbor, Naïve Bayes Algorithm, Support Vector Machine, Decision Tree, and Artificial Neural Network. However, while dealing with huge amount of text data, model performance and accuracy becomes a challenge. According to the type of words used in the corpus and type of features created for classification, the performance of a text classification model can be varied. Most of the attempts are been made based on proposing a new algorithm or modifying an existing algorithm. This kind of research can be said already reached their certain limitations for further improvements. In this study, aside from proposing a new algorithm or modifying the algorithm, we focus on searching a way to modify the use of data. It is widely known that classifier performance is influenced by the quality of training data upon which this classifier is built. The real world datasets in most of the time contain noise, or in other words noisy data, these can actually affect the decision made by the classifiers built from these data. In this study, we consider that the data from different domains, which is heterogeneous data might have the characteristics of noise which can be utilized in the classification process. In order to build the classifier, machine learning algorithm is performed based on the assumption that the characteristics of training data and target data are the same or very similar to each other. However, in the case of unstructured data such as text, the features are determined according to the vocabularies included in the document. If the viewpoints of the learning data and target data are different, the features may be appearing different between these two data. In this study, we attempt to improve the classification accuracy by strengthening the robustness of the document classifier through artificially injecting the noise into the process of constructing the document classifier. With data coming from various kind of sources, these data are likely formatted differently. These cause difficulties for traditional machine learning algorithms because they are not developed to recognize different type of data representation at one time and to put them together in same generalization. Therefore, in order to utilize heterogeneous data in the learning process of document classifier, we apply semi-supervised learning in our study. However, unlabeled data might have the possibility to degrade the performance of the document classifier. Therefore, we further proposed a method called Rule Selection-Based Ensemble Semi-Supervised Learning Algorithm (RSESLA) to select only the documents that contributing to the accuracy improvement of the classifier. RSESLA creates multiple views by manipulating the features using different types of classification models and different types of heterogeneous data. The most confident classification rules will be selected and applied for the final decision making. In this paper, three different types of real-world data sources were used, which are news, twitter and blogs.
Keywords
Text Mining; Text Classification; Heterogeneity Learning; Semi-Supervised Learning; Ensemble Learning;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 Wu, X. and X. Zhu, "Mining with Noise Knowledge: Error-Aware Data Mining," IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, Vol. 38, No. 4(2008), 917-932.   DOI
2 Yarowsky, D., "Unsupervised Word Sense Disambiguation Rivaling Supervised Methods," Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics, (1995), 189-196.
3 Zhu, X., "Semi-Supervised Learning Literature Survey," Computer Sciences TR 1530, University of Wisconsin, 2008. Available at http://pages.cs.wisc.edu/;jerryzhu/pub/ssl_survey.pdf
4 Zhu, X. and A. B. Goldberg, "Introduction to Semi-Supervised Learning," Synthesis Lectures on Artificial Intelligence and Machine Learning, Vol. 3, No. 1(2009), 1-130.
5 Zhu, X., J. Lafferty, and R. Rosenfeld, "Semi-Supervised Learning with Graphs," Doctoral Dissertation, Language Technologies Institute, Carnegie Mellon University, 2005.
6 Polikar, R., "Ensemble based Systems in Decision Making," IEEE Circuits and Systems Magazine, Vol. 6, No. 3(2006), 21-45.   DOI
7 Blei, D.M., A. Y. Ng, and M. I. Jordan, "Latent Dirichlet Allocation," Journal of Machine Learning Research, Vol. 3, No. Jan(2003), 993-1022.
8 Ando, R. K. and T. Zhang, "A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data," Journal of Machine Learning Research, Vol. 6 (2005), 1817-1853.
9 Angelova, R. and G. Weikum, "Graph-Based Text Classification: Learn from Your Neighbors," Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, (2006), 485-492.
10 Belkin, M., P. Niyogi, and V. Sindhwani, "Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples," Journal of Machine Learning Research, Vol. 7(2006), 2399-2434.
11 Blum, A. and T. Mitchell, "Combining Labeled and Unlabeled Data with Co-Training," Proceedings of the eleventh annual conference on Computational learning theory, (1998), 92-100.
12 Breiman, L., "Bagging Predictors," Machine learning, Vol. 24, No. 2(1996), 123-140.   DOI
13 Dasarathy, B. V. and B. V. Sheela, "A Composite Classifier System Design: Concepts and Methodology," Proceedings of the IEEE, Vol. 67, No. 5(1979), 708-713.   DOI
14 Dietterich, T.G., "Ensemble Methods in Machine Learning," Multiple Classifier Systems, Vol. 1857(2000), 1-15.
15 Freund, Y. and R. E. Schapire, "Experiments with a New Boosting Algorithm," Proceedings of the Thirteenth International Conference on International Conference on Machine Learning, (1996),148-156.
16 Freund, Y. and R. E. Schapire, "A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting," Journal of Computer and System Sciences, Vol. 55, No. 1(1997), 119-139.   DOI
17 Hansen, L. K. and P. Salamon, "Neural Network Ensembles," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 12, No. 10(1990), 993-1001.   DOI
18 Bennett, K. P. and A. Demiriz, "Semi-Supervised Support Vector Machines," Advances in Neural Information Processing Systems, Vol. 11(1999), 368-374.
19 Hofmann, T., "Unsupervised Learning by Probabilistic Latent Semantic Analysis," Machine learning, Vol. 42, No. 1-2(2001), 177-196.   DOI
20 Jacobs, R. A., M. I. Jordan, S. J. Nowlan, and G. E. Hinton, "Adaptive Mixtures of Local Experts," Neural Computation, Vol. 3, No. 1(1991), 79-87.   DOI
21 Kim, M., "Ensemble Learning with Support Vector Machines for Bond Rating," Journal of Intelligence and Information Systems, Vol. 18, No. 2(2012), 29-45.   DOI
22 Joachims, T., "Transductive Inference for Text Classification using Support Vector Machines," International Conference on Machine Learning, Vol. 99(1999), 200-209.
23 Jordan, M. I. and L. Xu, "Convergence Results for the EM Approach to Mixtures of Experts Architectures," Neural Networks, Vol. 8, No. 9(1995), 1409-1431.   DOI
24 Jordan, M. I. and R. A. Jacobs, "Hierarchical Mixtures of Experts and the EM Algorithm," Neural Computation, Vol. 6, No. 2(1994), 181-214.   DOI
25 Kim, D., and N. Kim, "Mapping Categories of Heterogeneous Sources using Text Analytics," Journal of Intelligence and Information Systems, Vol. 22, No. 4(2016), 193-215.   DOI
26 Kim, S., H. Zhang, R. Wu, and L. Gong, "Dealing with Noise in Defect Prediction," Proceedings of the 33rd International Conference on Software Engineering, (2011), 481-490.
27 L'Heureux, A., K. Grolinger, H. F. ElYamany, and M. Capretz, "Machine Learning with Big Data: Challenges and Approaches," IEEE Access, Vol. 5(2017), 7776-7797.   DOI
28 Li, M. and Z. H. Zhou, "SETRED: Self-Training with Editing," Pacific-Asia Conference on Knowledge Discovery and Data Mining, Vol. 3518(2005), 611-621.
29 Mallapragada, P. K., R. Jin, A. K. Jain, and Y. Liu, "Semiboost: Boosting for Semi-Supervised Learning," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 31, No. 11(2009), 2000-2014.   DOI
30 Liu, W., S. Liu, Q. Gu, X. Chen, and D. Chen, "Fecs: A Cluster based Feature Selection Method for Software Fault Prediction with Noises," IEEE 39th Annual Computer Software and Applications Conference (COMPSAC), Vol. 2(2015), 276-281.
31 Mitra, V., C. J. Wang, and S. Banerjee, "Text Classification: A Least Square Support Vector Machine Approach," Applied Soft Computing, Vol. 7, No. 3(2007), 908-914.   DOI
32 Maulik, U. and D. Chakraborty, "A Self-Trained Ensemble with Semisupervised SVM: An Application to Pixel Classification of Remote Sensing Imagery," Pattern Recognition, Vol. 44, No. 3(2011), 615-623.   DOI
33 McClosky, D., E. Charniak, and M. Johnson, "Effective Self-Training for Parsing," Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, (2006), 152-159.
34 Min, S., "Bankruptcy Prediction using an Improved Bagging Ensemble," Journal of Intelligence and Information Systems, Vol. 20, No. 4(2014), 121-139.   DOI
35 Nigam, K., A. K. McCallum, S. Thrun, and T. Mitchell, "Text Classification from Labeled and Unlabeled Documents using EM," Machine Learning, Vol. 39, No. 2(2000), 103-134.   DOI
36 Provost, F. and T. Fawcett, Data Science for Business: What You Need to Know About Data Mining and Data-Analytic Thinking, O'Reilly Media, Inc., California, 2013.
37 Schapire, R.E., "The Strength of Weak Learnability," Machine Learning, Vol. 5, No. 2(1990), 197-227.   DOI
38 Rosenberg, C., M. Hebert, and H. Schneiderman, "Semi-Supervised Self-Training of Object Detection Models," Seventh IEEE Workshops on Application of Computer Vision, Vol. 1(2005), 29-36.
39 Saez, J.A., M. Galar, J. Luengo, and F. Herrera, "Tackling the Problem of Classification with Noisy Data using Multiple Classifier Systems: Analysis of the Performance and Robustness," Information Sciences, Vol. 247(2013), 1-20.   DOI
40 Salton, G. and C. Buckley, "Term Weighting Approaches in Automatic Text Retrieval," Technical Report, Cornell University, 1987.
41 Triguero, I., J. A. Saez, J. Luengo, S. Garcia, and F. Herrera, "On the Characterization of Noise Filters for Self-Training Semi-Supervised in Nearest Neighbor Classification," Neurocomputing, Vol. 132(2014), 30-41.   DOI
42 Shahshahani, B.M. and D. A. Landgrebe, "The Effect of Unlabeled Samples in Reducing the Small Sample Size Problem and Mitigating the Hughes Phenomenon," IEEE Transactions on Geoscience and Remote Sensing, Vol. 32, No. 5(1994), 1087-1095.   DOI
43 Tanha, J., M. van Someren, and H. Afsarmanesh, "Disagreement-based Co-Training," 23rd IEEE International Conference on Tools with Artificial Intelligence (ICTAI), (2011), 803-810.
44 Tanha, J., M. van Someren, and H. Afsarmanesh, "Semi-Supervised Self-Training for Decision Tree Classifiers," International Journal of Machine Learning and Cybernetics, Vol. 8, No. 1(2017), 355-370.   DOI
45 Triguero, I., S. Garcia, and F. Herrera, "Self-Labeled Techniques for Semi-Supervised Learning: Taxonomy, Software and Empirical Study," Knowledge and Information Systems, Vol. 42, No. 2(2015), 245-284.   DOI
46 Wolpert, D.H., 1992. "Stacked Generalization," Neural Networks, Vol. 5, No. 2(1992), 241-259.   DOI