Selection of An Initial Training Set for Active Learning Using Cluster-Based Sampling

능동적 학습을 위한 군집기반 초기훈련집합 선정

  • Published : 2004.07.01

Abstract

We propose a method of selecting initial training examples for active learning so that it can reach high accuracy faster with fewer further queries. Our method is based on the assumption that an active learner can reach higher performance when given an initial training set consisting of diverse and typical examples rather than similar and special ones. To obtain a good initial training set, we first cluster examples by using k-means clustering algorithm to find groups of similar examples. Then, a representative example, which is the closest example to the cluster's centroid, is selected from each cluster. After these representative examples are labeled by querying to the user for their categories, they can be used as initial training examples. We also suggest a method of using the centroids as initial training examples by labeling them with categories of corresponding representative examples. Experiments with various text data sets have shown that the active learner starting from the initial training set selected by our method reaches higher accuracy faster than that starting from randomly generated initial training set.

본 논문에서는 능동적 학습이 보다 적은 수의 훈련예제로도 높은 학습성능을 달성할 수 있도록 군집화기법을 이용하여 초기훈련집합을 선정하는 방안을 제안한다. 본 제안 방안은 유사한 예제들보다는 다양한 예제들로 그리고 특수한 예제들보다는 보편적인 예제들로 구성한 집합이 학습에 유리할 것이라는 가정을 바탕으로, 먼저 k-means 군집화 기법으로 예제들을 군집화한 후, 각 군집을 가장 잘 표현하는 대표예제로 개별 군집의 중심점과 가장 가까운 예제를 선정하여 초기훈련집합을 구성한다. 또한 개별 군집의 중심점을 가상의 예제로 가정하여, 이와 연관된 대표예제의 카테고리를 부여함으로써 추가의 훈련예제로 활용하는 방안을 함께 제안한다. 여러 문서 분류 문제를 대상으로 실험한 결과, 본 제안 방안으로 선정한 초기훈련집합에서 출발한 능동적 학습이 임의로 선정한 초기훈련집합에서 출발한 경우에 비해 보다 적은 수의 훈련예제로도 동등한 성능을 달성할 수 있음을 확인하였다.

Keywords

References

  1. Lewis D., and Gale, W., 'A sequential algorithm for training text classifiers,' In Proceedings of the 17th ACM-SIGIR Conference, pp. 3-12, 1994
  2. Roy N. and McCallum, A., 'Toward optimal active learning through sampling estimation of error reduction,' In Proceedings of the 18th International Conference on Machine Learning, pp. 441 -448, 200l
  3. Brinker, K., 'Incorporating Diversity in Active Learning with Support Vector Machines,' In Proceedings of 20th International Conference on Machine Learning, pp. 59-66, 2003
  4. UCI Knowledge Discovery in Databases Archive, http://kdd.ics.uci.edu/
  5. Basu, S., Banerjee, A., and Mooney, R., 'Semi-supervised clustering by seeding,' In Proceedings of the 19th International Conference on Machine Learning, pp. 19-26, 2002
  6. Yang, Y., 'An evaluation of statistical approaches to text categorization,' Journal of Information Retrieval, Vol. 1, Nos. 1/2, pp. 67-88, 1999 https://doi.org/10.1023/A:1009982220290
  7. Yates, B. and Neto, R., Modem Information Retrieval, Addison-Wesley, 1999
  8. Seung, H. S., Opper, M. and Sompolinsky, H., 'Query by committee,' In Computational Learing Theory, pp. 287-294, 1992
  9. Freund, Y., Seung, H. S., Shamir, E. and Tishby, N., 'Selective sampling using the query by committee algorithm,' Machine Learning, Vol. 28, Nos. 2-3, pp. 133-168, 1997 https://doi.org/10.1023/A:1007330508534
  10. Abe, N., and Mamitsuka, H. 'Querying learning using boosting and bagging,' In Proceedings of International Conference on Machine Learning, pp. 1-10, 1998
  11. Muslea, I., Minton, S. and Knoblock. C, 'Selective sampling with redundant views,' In Proceedings National Conference on Artificial Intelligence, pp. 621-626, 2000
  12. Blum A. and Mitchell, T., 'Combining labeled and unlabeled data with co-training,' In COLT: Proceedings of the Workshop on Computational Learning Theory, Morgan Kaufmann Publishers, pp. 92-100, 1998 https://doi.org/10.1145/279943.279962
  13. Nigam, K., and Ghani, R. 'Analyzing the effectiveness and applicability of co-training,' In Proceedings of Information and Knowledge Management, pp. 86-93, 2000 https://doi.org/10.1145/354756.354805
  14. Muslea, I., Minton, S., Knoblock, C., 'Active + Semi-Supervised Learning = Robust Multi-View Learning,' In Proceedings of the 19th International Conference on Machine Learning, pp. 435-442, 2002
  15. Cohn, D., Ghahramani, Z., Jordan, M. I., 'Active learning with statistical models,' Journal of Artificial Intelligence Research, Vol. 4, pp. 129-145, 1996
  16. McCallum, A., and Nigam, K., 'Employing EM in pool-based active learning for text classification,' In Proceedings of the 15th International Conference on Machine Learning, pp. 359-367, 1998
  17. Plutowski, M. and White, H. 'Selecting Concise Training Sets from Clean Data,' IEEE Trans. Neural Networks, Vol. 4, No.2, pp. 305-318, 1993 https://doi.org/10.1109/72.207618
  18. Jung, G. and Opper, M. 'Selection of examples for a linear classifier,' Journal of Physics A, 29, pp. 1367-1380, 1996 https://doi.org/10.1088/0305-4470/29/7/010
  19. Mitra, P. Murthy, C.A. and Pal, S. K. 'Density Based Multiscale Data Condensation,' IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24, No. 6, pp. 734-747, 2002 https://doi.org/10.1109/TPAMI.2002.1008381
  20. Provost, F., and Kolluri, V., 'A survey of methods for scaling up inductive algorithms,' Data Mining Knowledge Discuvery, Vol. 2, pp. 131-169, 1999 https://doi.org/10.1023/A:1009876119989
  21. Shih, L., Rennie, J. D. M., Chang, Y.-H., and Karger, D. R., 'Text Bundling: Statistics-Based Data Reduction,' In Proceedings of the 20th International Coriference on Machine Learning, pp, 696-703, 2003