[KSCI] Korea Science Citation Index Service

Selection of An Initial Training Set for Active Learning Using Cluster-Based Sampling

강재호 (부산대학교 컴퓨터공학과)
류광렬 (부산대학교 컴퓨터공학)
권혁철 (부산대학교 컴퓨터공학과)

Publication Information

Journal of KIISE:Software and Applications / v.31, no.7, 2004 , pp. 859-868 More about this Journal

Abstract

We propose a method of selecting initial training examples for active learning so that it can reach high accuracy faster with fewer further queries. Our method is based on the assumption that an active learner can reach higher performance when given an initial training set consisting of diverse and typical examples rather than similar and special ones. To obtain a good initial training set, we first cluster examples by using k-means clustering algorithm to find groups of similar examples. Then, a representative example, which is the closest example to the cluster's centroid, is selected from each cluster. After these representative examples are labeled by querying to the user for their categories, they can be used as initial training examples. We also suggest a method of using the centroids as initial training examples by labeling them with categories of corresponding representative examples. Experiments with various text data sets have shown that the active learner starting from the initial training set selected by our method reaches higher accuracy faster than that starting from randomly generated initial training set.

Keywords

Active teaming; Initial training set selection; flustering; Text classification;

Citations & Related Records

Reference

1	Yang, Y., 'An evaluation of statistical approaches to text categorization,' Journal of Information Retrieval, Vol. 1, Nos. 1/2, pp. 67-88, 1999 DOI
2	Yates, B. and Neto, R., Modem Information Retrieval, Addison-Wesley, 1999
3	Lewis D., and Gale, W., 'A sequential algorithm for training text classifiers,' In Proceedings of the 17th ACM-SIGIR Conference, pp. 3-12, 1994
4	Seung, H. S., Opper, M. and Sompolinsky, H., 'Query by committee,' In Computational Learing Theory, pp. 287-294, 1992
5	Brinker, K., 'Incorporating Diversity in Active Learning with Support Vector Machines,' In Proceedings of 20th International Conference on Machine Learning, pp. 59-66, 2003
6	UCI Knowledge Discovery in Databases Archive, http://kdd.ics.uci.edu/
7	Basu, S., Banerjee, A., and Mooney, R., 'Semi-supervised clustering by seeding,' In Proceedings of the 19th International Conference on Machine Learning, pp. 19-26, 2002
8	Cohn, D., Ghahramani, Z., Jordan, M. I., 'Active learning with statistical models,' Journal of Artificial Intelligence Research, Vol. 4, pp. 129-145, 1996
9	Freund, Y., Seung, H. S., Shamir, E. and Tishby, N., 'Selective sampling using the query by committee algorithm,' Machine Learning, Vol. 28, Nos. 2-3, pp. 133-168, 1997 DOI
10	Abe, N., and Mamitsuka, H. 'Querying learning using boosting and bagging,' In Proceedings of International Conference on Machine Learning, pp. 1-10, 1998
11	Muslea, I., Minton, S. and Knoblock. C, 'Selective sampling with redundant views,' In Proceedings National Conference on Artificial Intelligence, pp. 621-626, 2000
12	Blum A. and Mitchell, T., 'Combining labeled and unlabeled data with co-training,' In COLT: Proceedings of the Workshop on Computational Learning Theory, Morgan Kaufmann Publishers, pp. 92-100, 1998 DOI
13	Nigam, K., and Ghani, R. 'Analyzing the effectiveness and applicability of co-training,' In Proceedings of Information and Knowledge Management, pp. 86-93, 2000 DOI
14	Muslea, I., Minton, S., Knoblock, C., 'Active + Semi-Supervised Learning = Robust Multi-View Learning,' In Proceedings of the 19th International Conference on Machine Learning, pp. 435-442, 2002
15	McCallum, A., and Nigam, K., 'Employing EM in pool-based active learning for text classification,' In Proceedings of the 15th International Conference on Machine Learning, pp. 359-367, 1998
16	Plutowski, M. and White, H. 'Selecting Concise Training Sets from Clean Data,' IEEE Trans. Neural Networks, Vol. 4, No.2, pp. 305-318, 1993 DOI ScienceOn
17	Jung, G. and Opper, M. 'Selection of examples for a linear classifier,' Journal of Physics A, 29, pp. 1367-1380, 1996 DOI ScienceOn
18	Mitra, P. Murthy, C.A. and Pal, S. K. 'Density Based Multiscale Data Condensation,' IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24, No. 6, pp. 734-747, 2002 DOI ScienceOn
19	Provost, F., and Kolluri, V., 'A survey of methods for scaling up inductive algorithms,' Data Mining Knowledge Discuvery, Vol. 2, pp. 131-169, 1999 DOI ScienceOn
20	Shih, L., Rennie, J. D. M., Chang, Y.-H., and Karger, D. R., 'Text Bundling: Statistics-Based Data Reduction,' In Proceedings of the 20th International Coriference on Machine Learning, pp, 696-703, 2003
21	Roy N. and McCallum, A., 'Toward optimal active learning through sampling estimation of error reduction,' In Proceedings of the 18th International Conference on Machine Learning, pp. 441 -448, 200l

KSCI

Selection of An Initial Training Set for Active Learning Using Cluster-Based Sampling 능동적 학습을 위한 군집기반 초기훈련집합 선정

Selection of An Initial Training Set for Active Learning Using Cluster-Based Sampling