Browse > Article
http://dx.doi.org/10.3745/JIPS.02.0080

Impact of Instance Selection on kNN-Based Text Categorization  

Barigou, Fatiha (Dept. of Computer Science, University of Oran 1 Ahmed Ben Bella)
Publication Information
Journal of Information Processing Systems / v.14, no.2, 2018 , pp. 418-434 More about this Journal
Abstract
With the increasing use of the Internet and electronic documents, automatic text categorization becomes imperative. Several machine learning algorithms have been proposed for text categorization. The k-nearest neighbor algorithm (kNN) is known to be one of the best state of the art classifiers when used for text categorization. However, kNN suffers from limitations such as high computation when classifying new instances. Instance selection techniques have emerged as highly competitive methods to improve kNN through data reduction. However previous works have evaluated those approaches only on structured datasets. In addition, their performance has not been examined over the text categorization domain where the dimensionality and size of the dataset is very high. Motivated by these observations, this paper investigates and analyzes the impact of instance selection on kNN-based text categorization in terms of various aspects such as classification accuracy, classification efficiency, and data reduction.
Keywords
Classification Accuracy; Classification Efficiency; Data Reduction; Instance Selection; k-Nearest Neighbors; Text Categorization;
Citations & Related Records
연도 인용수 순위
  • Reference
1 J. Bien and R. Tibshirani, "Prototype selection for interpretable classification," Annals of Applied Statistics Journal, vol. 5, no. 4, pp. 2403-2424, 2011.   DOI
2 M. B. Stojanovic, M. M. Bozic, M. M. Stankovic, & Z. P. Stajic, "A methodology for training set instance selection using mutual information in time series prediction," Neurocomputing, vol. 141, pp. 236-245, 2014.   DOI
3 P. Hart, "The condensed nearest neighbor rule," IEEE Transactions on Information Theory, vol. 14, no. 3, pp. 515-516, 1968.   DOI
4 G. Gates, "The reduced nearest neighbor rule," IEEE Transactions on Information Theory, vol. 18, no. 3, pp. 431-433, 1972.   DOI
5 G. Ritter, H. Woodruff, S. Lowry, and T. Isenhour, "An algorithm for a selective nearest neighbor decision rule," IEEE Transactions on Information Theory, vol. 21, no. 6, pp. 665-669, 1975.   DOI
6 I. Tomek, "Two modifications of CNN," IEEE Transactions on Systems, Man, and Cybernetics, vol. 6, no. 6, pp. 769-772, 1976.
7 V. Devi and M. Murty, "An incremental prototype set building technique," Pattern Recognition, vol. 35, no. 2, pp. 505-513, 2002.   DOI
8 J. Riquelme, J. Aguilar-Ruiz, and M. Toro, "Finding representative patterns with ordered projections," Pattern Recognition, vol. 36, no. 4, pp. 1009-1018, 2003.   DOI
9 F. Angiulli, "Fast condensed nearest neighbor rule," in Proceedings of the 22d International Conference on Machine Learning, Bonn, Germany, 2005, pp. 25-325.
10 D. L. Wilson, "Asymptotic properties of nearest neighbor rules using edited data," IEEE Transactions on Systems, Man, and Cybernetics, vol. 2, no. 3, pp. 408-421, 1972.   DOI
11 D. Randall Wilson and T. R. Martinez, "Reduction techniques for instance-based learning algorithms," Machine Learning, vol. 38, no. 3, pp. 257-286, 2000.   DOI
12 K. Hattori and M. Takahashi, "A new edited k-nearest neighbor rule in the pattern classification problem," Pattern Recognition, vol. 33, no. 3, pp. 521-528, 2000.   DOI
13 J. S. Sanchez, F. Pla, and F. J. Ferri," Prototype selection for the nearest neighbor rule through proximity graphs," Pattern Recognition Letters, vol. 18, no. 6, pp. 507-513, 1997.   DOI
14 D. Aha, D. Kibler, and M. Albert, "Instance-based learning algorithms," Machine Learning, vol. 6, no. 1, pp. 37-66, 1991.   DOI
15 H. Brighton and C. Mellish. "Advances in instance selection for instance-based learning algorithms," Data Mining and Knowledge Discovery, vol. 6, no. 2, pp. 153-172, 2002.   DOI
16 J. Derrac, S. Garcia, and F. Herrera, "A survey on evolutionary instance selection and generation," International Journal of Applied Metaheuristic Computing, vol. 1, no. 1, pp. 60-92, 2010.   DOI
17 E. Leyva, A. Gonzalez, and R. Perez, "Knowledge-based instance selection: a compromise between efficiency and versatility," Knowledge-Based Systems, vol. 47, pp. 65-76, 2013.   DOI
18 J. R. Cano, F. Herrera, and M. Lozano, "Stratification for scaling up evolutionary prototype selection," Pattern Recognition Letters, vol. 26, no. 7, pp. 953-963, 2005.   DOI
19 J. R. Cano, F. Herrera, and M. Lozano, "Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study," IEEE Transactions on Evolutionary Computation, vol. 7, no. 6, pp. 561-575, 2003.   DOI
20 S. Garcia, J. R. Cano, and F. Herrera, "A memetic algorithm for evolutionary prototype selection: a scaling up approach," Pattern Recognition, vol. 8, no. 41, pp. 2693-2709, 2008.
21 Y. Yang and J. O. Pedersen, "A comparative study on feature selection in text categorization," in Proceedings of the 14th International Conference on Machine Learning, San Francisco, CA, 1997, pp. 412-420.
22 C. Garcia-Osorio, A. de Haro-Garcia, and N. Garcia-Pedrajas, "Democratic instance selection: a linear complexity instance selection algorithm based on classifier ensemble concepts," Artificial Intelligence, vol. 174, no. 5/6, pp. 410-441, 2010.   DOI
23 I. Triguero, D. Peralta, J. Bacardit, S. Garcia, and F. Herrera, "MRPR: a MapReduce solution for prototype reduction in big data classification," Neurocomputing, vol. 150, pp. 331-345, 2015.   DOI
24 M. F. Porter, "An algorithm for suffix stripping," Program, vol. 14, no. 3, pp. 130-137, 1980.   DOI
25 J. A. Olvera-Lopez, J. A. Carrasco-Ochoa, J. F. Martinez-Trinidad, and J. Kittler, "A review of instance selection methods," Artificial Intelligence Review, vol. 34, no. 2, pp. 133-143, 2010.   DOI
26 T. Liu, A. W. Moore, and A. Gray, "New algorithms for efficient high dimensional nonparametric classification," Journal of Machine Learning Research, vol. 7, pp. 1135-1158, 2006.
27 M. Grochowski and N. Jankowski, "Comparison of instance selection algorithms II. Results and comments," in Proceedings of the 7th International Conference on Artificial Intelligence and Soft Computing, Zakopane, Poland, 2004, pp. 580-585.
28 C. F. Tsai, Z. Y. Chen, and S. W. Ke, "Evolutionary instance selection for text classification," Journal of Systems and Software vol. 90, pp. 104-113, 2014.   DOI
29 F. Barigou, "A new term weighting scheme for text categorization," International Journal of Intelligent Systems Technologies and Applications, vol. 14, no. 3/4, pp. 256-272, 2015.   DOI
30 H. Zhou, J. Guo, and Y. Wang, "A feature selection approach based on term distributions," SpringerPlus, vol. 5, article no. 249, 2016.
31 T. Cover amd P. Hart, "Nearest neighbor pattern classification," IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21-27, 1967.   DOI
32 N. Jankowski and M. Grochowski, "Comparison of Instance Selection Algorithms I. Algorithms survey," in Proceedings of the 7th International Conference on Artificial Intelligence and Soft Computing, Zakopane, Poland, 2004, pp. 598-603.
33 H. Brighton and C. Mellish, "Advances in instance selection for instance-based learning algorithms," Data Mining and Knowledge Discovery, vol. 6, no. 2, pp. 153-172, 2002.   DOI
34 I. Triguero, J. Derrac, S. Garcia, and F. Herrera, "A taxonomy and experimental study on prototype generation for nearest neighbor classification," IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 1, pp. 86-100, 2012.   DOI
35 T. Songbo, "An effective refinement strategy for KNN text classifier," Expert Systems with Applications, vol. 30, no. 2, pp. 290-298, 2006.   DOI
36 G. L. Libralon, A. C. P. D. L. Carvalho, and A. C. Lorena, "Pre-processing for noise detection in gene expression classification data," Journal of the Brazilian Computer Society, vol. 15, no. 1, pp. 3-11, 2009.   DOI
37 A. Arnaiz-Gonzalez, J. F. Diez-Pastor, J. J. Rodriguez, and C. Garcia-Osorio, "Instance selection for regression: adapting DROP," Neurocomputing, vol. 201, pp. 66-81, 2016.   DOI
38 F. Sebastiani, "Machine learning in automated text categorization," ACM Computing Surveys, vol. 34, no 1, pp. 1-47, 2002.   DOI
39 S. Garcia, J. Derrac, R. Cano, and F. Herrera, "Prototype selection for nearest neighbor classification: Taxonomy and empirical study," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 3, pp. 317-435, 2012.