Selection of An Initial Training Set for Active Learning Using Cluster-Based Sampling

;;;

Journal of KIISE:Software and Applications (한국정보과학회논문지:소프트웨어및응용)

Volume 31 Issue 7
/
Pages.859-868
/
2004
/
1229-6848(pISSN)

Korean Institute of Information Scientists and Engineers (한국정보과학회)

Selection of An Initial Training Set for Active Learning Using Cluster-Based Sampling

능동적 학습을 위한 군집기반 초기훈련집합 선정

강재호 (부산대학교 컴퓨터공학과) ;
류광렬 (부산대학교 컴퓨터공학) ;
권혁철 (부산대학교 컴퓨터공학과)

Published : 2004.07.01

PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

We propose a method of selecting initial training examples for active learning so that it can reach high accuracy faster with fewer further queries. Our method is based on the assumption that an active learner can reach higher performance when given an initial training set consisting of diverse and typical examples rather than similar and special ones. To obtain a good initial training set, we first cluster examples by using k-means clustering algorithm to find groups of similar examples. Then, a representative example, which is the closest example to the cluster's centroid, is selected from each cluster. After these representative examples are labeled by querying to the user for their categories, they can be used as initial training examples. We also suggest a method of using the centroids as initial training examples by labeling them with categories of corresponding representative examples. Experiments with various text data sets have shown that the active learner starting from the initial training set selected by our method reaches higher accuracy faster than that starting from randomly generated initial training set.

본 논문에서는 능동적 학습이 보다 적은 수의 훈련예제로도 높은 학습성능을 달성할 수 있도록 군집화기법을 이용하여 초기훈련집합을 선정하는 방안을 제안한다. 본 제안 방안은 유사한 예제들보다는 다양한 예제들로 그리고 특수한 예제들보다는 보편적인 예제들로 구성한 집합이 학습에 유리할 것이라는 가정을 바탕으로, 먼저 k-means 군집화 기법으로 예제들을 군집화한 후, 각 군집을 가장 잘 표현하는 대표예제로 개별 군집의 중심점과 가장 가까운 예제를 선정하여 초기훈련집합을 구성한다. 또한 개별 군집의 중심점을 가상의 예제로 가정하여, 이와 연관된 대표예제의 카테고리를 부여함으로써 추가의 훈련예제로 활용하는 방안을 함께 제안한다. 여러 문서 분류 문제를 대상으로 실험한 결과, 본 제안 방안으로 선정한 초기훈련집합에서 출발한 능동적 학습이 임의로 선정한 초기훈련집합에서 출발한 경우에 비해 보다 적은 수의 훈련예제로도 동등한 성능을 달성할 수 있음을 확인하였다.

Keywords

References

Lewis D., and Gale, W., 'A sequential algorithm for training text classifiers,' In Proceedings of the 17th ACM-SIGIR Conference, pp. 3-12, 1994
Roy N. and McCallum, A., 'Toward optimal active learning through sampling estimation of error reduction,' In Proceedings of the 18th International Conference on Machine Learning, pp. 441 -448, 200l
Brinker, K., 'Incorporating Diversity in Active Learning with Support Vector Machines,' In Proceedings of 20th International Conference on Machine Learning, pp. 59-66, 2003
UCI Knowledge Discovery in Databases Archive, http://kdd.ics.uci.edu/
Basu, S., Banerjee, A., and Mooney, R., 'Semi-supervised clustering by seeding,' In Proceedings of the 19th International Conference on Machine Learning, pp. 19-26, 2002
Yang, Y., 'An evaluation of statistical approaches to text categorization,' Journal of Information Retrieval, Vol. 1, Nos. 1/2, pp. 67-88, 1999 https://doi.org/10.1023/A:1009982220290
Yates, B. and Neto, R., Modem Information Retrieval, Addison-Wesley, 1999
Seung, H. S., Opper, M. and Sompolinsky, H., 'Query by committee,' In Computational Learing Theory, pp. 287-294, 1992
Freund, Y., Seung, H. S., Shamir, E. and Tishby, N., 'Selective sampling using the query by committee algorithm,' Machine Learning, Vol. 28, Nos. 2-3, pp. 133-168, 1997 https://doi.org/10.1023/A:1007330508534
Abe, N., and Mamitsuka, H. 'Querying learning using boosting and bagging,' In Proceedings of International Conference on Machine Learning, pp. 1-10, 1998
Muslea, I., Minton, S. and Knoblock. C, 'Selective sampling with redundant views,' In Proceedings National Conference on Artificial Intelligence, pp. 621-626, 2000
Blum A. and Mitchell, T., 'Combining labeled and unlabeled data with co-training,' In COLT: Proceedings of the Workshop on Computational Learning Theory, Morgan Kaufmann Publishers, pp. 92-100, 1998 https://doi.org/10.1145/279943.279962
Nigam, K., and Ghani, R. 'Analyzing the effectiveness and applicability of co-training,' In Proceedings of Information and Knowledge Management, pp. 86-93, 2000 https://doi.org/10.1145/354756.354805
Muslea, I., Minton, S., Knoblock, C., 'Active + Semi-Supervised Learning = Robust Multi-View Learning,' In Proceedings of the 19th International Conference on Machine Learning, pp. 435-442, 2002
Cohn, D., Ghahramani, Z., Jordan, M. I., 'Active learning with statistical models,' Journal of Artificial Intelligence Research, Vol. 4, pp. 129-145, 1996
McCallum, A., and Nigam, K., 'Employing EM in pool-based active learning for text classification,' In Proceedings of the 15th International Conference on Machine Learning, pp. 359-367, 1998
Plutowski, M. and White, H. 'Selecting Concise Training Sets from Clean Data,' IEEE Trans. Neural Networks, Vol. 4, No.2, pp. 305-318, 1993 https://doi.org/10.1109/72.207618
Jung, G. and Opper, M. 'Selection of examples for a linear classifier,' Journal of Physics A, 29, pp. 1367-1380, 1996 https://doi.org/10.1088/0305-4470/29/7/010
Mitra, P. Murthy, C.A. and Pal, S. K. 'Density Based Multiscale Data Condensation,' IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24, No. 6, pp. 734-747, 2002 https://doi.org/10.1109/TPAMI.2002.1008381
Provost, F., and Kolluri, V., 'A survey of methods for scaling up inductive algorithms,' Data Mining Knowledge Discuvery, Vol. 2, pp. 131-169, 1999 https://doi.org/10.1023/A:1009876119989
Shih, L., Rennie, J. D. M., Chang, Y.-H., and Karger, D. R., 'Text Bundling: Statistics-Based Data Reduction,' In Proceedings of the 20th International Coriference on Machine Learning, pp, 696-703, 2003

Journal of KIISE:Software and Applications (한국정보과학회논문지:소프트웨어및응용)

Selection of An Initial Training Set for Active Learning Using Cluster-Based Sampling

능동적 학습을 위한 군집기반 초기훈련집합 선정

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)