Browse > Article

Selection of An Initial Training Set for Active Learning Using Cluster-Based Sampling  

강재호 (부산대학교 컴퓨터공학과)
류광렬 (부산대학교 컴퓨터공학)
권혁철 (부산대학교 컴퓨터공학과)
Abstract
We propose a method of selecting initial training examples for active learning so that it can reach high accuracy faster with fewer further queries. Our method is based on the assumption that an active learner can reach higher performance when given an initial training set consisting of diverse and typical examples rather than similar and special ones. To obtain a good initial training set, we first cluster examples by using k-means clustering algorithm to find groups of similar examples. Then, a representative example, which is the closest example to the cluster's centroid, is selected from each cluster. After these representative examples are labeled by querying to the user for their categories, they can be used as initial training examples. We also suggest a method of using the centroids as initial training examples by labeling them with categories of corresponding representative examples. Experiments with various text data sets have shown that the active learner starting from the initial training set selected by our method reaches higher accuracy faster than that starting from randomly generated initial training set.
Keywords
Active teaming; Initial training set selection; flustering; Text classification;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Yang, Y., 'An evaluation of statistical approaches to text categorization,' Journal of Information Retrieval, Vol. 1, Nos. 1/2, pp. 67-88, 1999   DOI
2 Yates, B. and Neto, R., Modem Information Retrieval, Addison-Wesley, 1999
3 Lewis D., and Gale, W., 'A sequential algorithm for training text classifiers,' In Proceedings of the 17th ACM-SIGIR Conference, pp. 3-12, 1994
4 Seung, H. S., Opper, M. and Sompolinsky, H., 'Query by committee,' In Computational Learing Theory, pp. 287-294, 1992
5 Brinker, K., 'Incorporating Diversity in Active Learning with Support Vector Machines,' In Proceedings of 20th International Conference on Machine Learning, pp. 59-66, 2003
6 UCI Knowledge Discovery in Databases Archive, http://kdd.ics.uci.edu/
7 Basu, S., Banerjee, A., and Mooney, R., 'Semi-supervised clustering by seeding,' In Proceedings of the 19th International Conference on Machine Learning, pp. 19-26, 2002
8 Abe, N., and Mamitsuka, H. 'Querying learning using boosting and bagging,' In Proceedings of International Conference on Machine Learning, pp. 1-10, 1998
9 Cohn, D., Ghahramani, Z., Jordan, M. I., 'Active learning with statistical models,' Journal of Artificial Intelligence Research, Vol. 4, pp. 129-145, 1996
10 Freund, Y., Seung, H. S., Shamir, E. and Tishby, N., 'Selective sampling using the query by committee algorithm,' Machine Learning, Vol. 28, Nos. 2-3, pp. 133-168, 1997   DOI
11 Muslea, I., Minton, S. and Knoblock. C, 'Selective sampling with redundant views,' In Proceedings National Conference on Artificial Intelligence, pp. 621-626, 2000
12 Blum A. and Mitchell, T., 'Combining labeled and unlabeled data with co-training,' In COLT: Proceedings of the Workshop on Computational Learning Theory, Morgan Kaufmann Publishers, pp. 92-100, 1998   DOI
13 Mitra, P. Murthy, C.A. and Pal, S. K. 'Density Based Multiscale Data Condensation,' IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24, No. 6, pp. 734-747, 2002   DOI   ScienceOn
14 Nigam, K., and Ghani, R. 'Analyzing the effectiveness and applicability of co-training,' In Proceedings of Information and Knowledge Management, pp. 86-93, 2000   DOI
15 Muslea, I., Minton, S., Knoblock, C., 'Active + Semi-Supervised Learning = Robust Multi-View Learning,' In Proceedings of the 19th International Conference on Machine Learning, pp. 435-442, 2002
16 McCallum, A., and Nigam, K., 'Employing EM in pool-based active learning for text classification,' In Proceedings of the 15th International Conference on Machine Learning, pp. 359-367, 1998
17 Plutowski, M. and White, H. 'Selecting Concise Training Sets from Clean Data,' IEEE Trans. Neural Networks, Vol. 4, No.2, pp. 305-318, 1993   DOI   ScienceOn
18 Jung, G. and Opper, M. 'Selection of examples for a linear classifier,' Journal of Physics A, 29, pp. 1367-1380, 1996   DOI   ScienceOn
19 Provost, F., and Kolluri, V., 'A survey of methods for scaling up inductive algorithms,' Data Mining Knowledge Discuvery, Vol. 2, pp. 131-169, 1999   DOI   ScienceOn
20 Shih, L., Rennie, J. D. M., Chang, Y.-H., and Karger, D. R., 'Text Bundling: Statistics-Based Data Reduction,' In Proceedings of the 20th International Coriference on Machine Learning, pp, 696-703, 2003
21 Roy N. and McCallum, A., 'Toward optimal active learning through sampling estimation of error reduction,' In Proceedings of the 18th International Conference on Machine Learning, pp. 441 -448, 200l