Browse > Article
http://dx.doi.org/10.3745/KIPSTB.2006.13B.4.489

Feature Filtering Methods for Web Documents Clustering  

Park Heum (유비텍(주))
Kwon Hyuk-Chul (부산대학교 전자전기정보컴퓨터공학부)
Abstract
Clustering results differ according to the datasets and the performance worsens even while using web documents which are manually processed by an indexer, because although representative clusters for a feature can be obtained by statistical feature selection methods, irrelevant features(i.e., non-obvious features and those appearing in general documents) are not eliminated. Those irrelevant features should be eliminated for improving clustering performance. Therefore, this paper proposes three feature-filtering algorithms which consider feature values per document set, together with distribution, frequency, and weights of features per document set: (l) features filtering algorithm in a document (FFID), (2) features filtering algorithm in a document matrix (FFIM), and (3) a hybrid method combining both FFID and FFIM (HFF). We have tested the clustering performance by feature selection using term frequency and expand co link information, and by feature filtering using the above methods FFID, FFIM, HFF methods. According to the results of our experiments, HFF had the best performance, whereas FFIM performed better than FFID.
Keywords
Feature Selection; Feature Filtering; Clustering; Web Document;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 Brank, J., Grobelnik, M., Mili'c-Frayling, N. & Mladenic, D., 'Interaction of feature selection methods and linear classification models', Proceedings of the ICML-02 Workshop on Text Learning, Sydney, AU, 2002
2 Y. Yang and J. P. Pedersen, 'A comparative study on feature selection in text categorization', In Proceedings of the International Conference on Machine Learning, pp.412-420, 1997
3 Kyo-Woon Lee, Young-Gi Kim, Hyuk-Chul Kwon, 'Clustering of Web Documents with the Use of Term Frequency and Co-link in Hypertext', Proceedings of the International Conference on APIS2003, 2003
4 Zhao, Ying and Karypis, George, 'Criterion functions for document clustering - experiment and analysis', Technical Report TR #01-40, Department of Computer Science, University of Minnesota, 2001
5 Zhao, Ying and Karypis, George, 'Evaluation of hierarchical clustering algorithms for document datasets', Technical Report TR #02-22, Department of Computer Science, University of Minnesota, 2002
6 Karypis, George, 'CLUTO: A Clustering Toolkit', Technical Report TR #02-017, Department of Computer Science, University of Minnesota, 2002
7 Zhi-Hong Deng, Shi-Wei Tang, Dong-Qing Yang, Ming Zhang, Xiao-Bin Wu and Meng Yang, 'Two Odds-Radio-Based Text Classification Algorithms', Proceedings of Web Information Systems Engineering(Workshops) pp.223-231, 2002
8 H.Yaun, S.S.Tseng, W.Gangshan, and Z.Fuyan. 'A two-phase feature selection method using both filter and wrapper', In IEEE International conference on Systems, Man, and Cybernetics, Vol. 2, pp.132-136, 1999   DOI
9 Heum Park, 'A Feature Selection for Korean Web Document Clustering', The 30th Annual Conference of IEEE Industrial Electronics Society, 2004
10 Hall, M. 'Correlation-based feature selection of discrete and numeric class machine learning', In Proceedings of the International Conference on Machine Learning, pp.359-366, San Francisco, CA. Morgan Kaufmann Publishers, 2000
11 A.Y. Ng, 'On feature selection: learning with exponentially many irrelevant features as training examples'. In Proc. 15th Intl. Conf. on Machine Learning, pp.404-412, 1998
12 이재윤, '자질값투표 기법과 문서측 자질 선정을 이용한 고속문서 분류기',12회 정보관리학회지 pp.71-78, 2005
13 정영미, 이재윤,'지식 분류의 자동화를 위한 클러스트링 모형연구',정보관리학회지 ,Vol.18권,No.2, pp.203-230, 2001
14 고영증, 서정연, '문서 관리를 위한 자동 문서 범주화에 대한 이론 및 기법', 정보관리 연구논문지, Vol.33, No.2, pp.16-32, June, 2002
15 이원희, 이교운, 박흠, 김영기, 권혁철, '웹 문서의 단어정보와 링크정보 결합을 이용한 클러스트링 기법',15회 한국정보과학회지, pp.101-107, 2003
16 국민상, 정영미, '자질선정에 따른 Naive Bayesian 분류기의 상능 비교', 7회 정보관리학회 제7회 학술대회 논문집, pp.33- 36, 2000