사전정보를 활용한 앙상블 클러스터링 알고리즘

An Ensemble Clustering Algorithm based on a Prior Knowledge

  • 고송 (중앙대학교 컴퓨터공학과) ;
  • 김대원 (중앙대학교 컴퓨터공학과)
  • 발행 : 2009.02.15

초록

사전정보는 클러스터링 성능을 유도할 수 있는 요인이지만, 활용 방법에 따라 차이는 발생한다. 특히, 사전정보를 초기 중심으로 활용할 때, 사전정보 간 유사도에 대해 고려하는 것이 필요하다. 레이블이 같더라도 낮은 유사도를 갖는 사전정보로 인해 초기 중심 설정 시 문제가 발생할 수 있기 때문에, 이들을 구분하여 활용하는 방법이 필요하다. 따라서 본 논문은 낮은 유사도를 갖는 사전정보를 구분하여 문제를 해결하는 방법을 제시한다. 또한 유사도에 의해 구분된 사전정보는 다양하게 활용함으로써 생성되는 다양한 클러스터링 결과를 연관규칙에 기반하여 앙상블 함으로써 통합된 하나의 분석 결과를 도출하여 클러스터링 분석 성능을 더욱 개선시킬 수 있다.

Although a prior knowledge is a factor to improve the clustering performance, it is dependant on how to use of them. Especial1y, when the prior knowledge is employed in constructing initial centroids of cluster groups, there should be concerned of similarities of a prior knowledge. Despite labels of some objects of a prior knowledge are identical, the objects whose similarities are low should be separated. By separating them, centroids of initial group were not fallen in a problem which is collision of objects with low similarities. There can use the separated prior knowledge by various methods such as various initializations. To apply association rule, proposed method makes enough cluster group number, then the centroids of initial groups could constructed by separated prior knowledge. Then ensemble of the various results outperforms what can not be separated.

키워드

참고문헌

  1. A.K. Jain, M.N. Murty, P.J. Flynn, 'Data Clustering : A Review,' ACM Computing Surveys, Vol.31, No.3, September https://doi.org/10.1145/331499.331504
  2. Brian S.Everitt et al, 'Cluster Analysis,' ARNOLD
  3. Aidong zhang, 'advanced analysis of gene expression microarray data,' World Scientific, 2006
  4. Danh V. Nguyen et al, 'Tumor classification by partial least squares using microarray gene expressiondata,' Bioinformatics, Vol.18, No.1, p. 39-50, Jun 2002 https://doi.org/10.1093/bioinformatics/18.1.39
  5. Sugato Basu, 'Semi-supervised Clustering by Seeding,' Proceedings of the 19th International Conference on Machine Learning, (ICML-2002), pp. 19-26, Sydney, Australia, July 2002
  6. Akinori Fujino et al, 'Semisupervised Learning for a Hybrid Generative/Discriminative Classifier Based on the Maximum Entropy Principle,' IEEE Trans, Pattern Analysis and machine intelligence, Vol.30, No.3, MARCH 2008 https://doi.org/10.1109/TPAMI.2007.70710
  7. Dan Klein, Sepandar D. Kamvar, Christopher D. Manning, 'From Instance-level Constraints to Spacelevel Constraints : Making the Most of Prior Knowledge in Data Clustering'
  8. Kiri Wagsta, 'Constrained K-means Clustering with Background Knowledge,' Proceedings of the Eighteenth International Conference on Machine Learning, pp. 577-584, 2001
  9. M.A.T. Figueiredo et al, 'Unsupervised Learning of Finite Mixture Models,' IEEE Trans, Pattern Analysis and machine intelligence, March Vol.24, No.3, pp. 381-396, 2002 https://doi.org/10.1109/34.990138
  10. Ana L.N. Fred, Anil K. Jain, 'Combining Multiple Clusterings Using Evidence Accumulation,' IEEE Trans, Pattern Analysis and machine intelligence, Vol.27, No.6, JUNE 2005 https://doi.org/10.1109/TPAMI.2005.113
  11. Yi Hong, 'Unsupervised feature selection using clustering ensembles and population based incremental learning algorithm,' Pattern Recognition, Vol.41, Issue. 9, SEPTEMBER 2008 https://doi.org/10.1016/j.patcog.2008.03.007
  12. Lawrence Hubert, 'Comparing Partitions,' journal of Classification, 2:193-218, 1985 https://doi.org/10.1007/BF01908075
  13. David Hand et al, 'principal of Data mining,' A Bradford Book The MIT Press Cambridge, Massachusetts London, England, 2001
  14. http://www.geneontology.org