A performance improvement methodology of web document clustering using FDC-TCT

Ko, Suc-Bum;Youn, Sung-Dae;

doi:10.3745/KIPSTD.2005.12D.4.637

The KIPS Transactions:PartD (정보처리학회논문지D)

Volume 12D Issue 4 Serial No. 100
/
Pages.637-646
/
2005
/
1598-2866(pISSN)

Korea Information Processing Society (한국정보처리학회)

DOI QR Code

A performance improvement methodology of web document clustering using FDC-TCT

FDC-TCT를 이용한 웹 문서 클러스터링 성능 개선 기법

고석범 (부경대학교 대학원 전자계산학과) ;
윤성대 (부경대학교 전자계산학과)

Published : 2005.08.01

https://doi.org/10.3745/KIPSTD.2005.12D.4.637 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

There are various problems while applying classification or clustering algorithm in that document classification which requires post processing or classification after getting as a web search result due to my keyword. Among those, two problems are severe. The first problem is the need to categorize the document with the help of the expert. And, the second problem is the long processing time the document classification takes. Therefore we propose a new method of web document clustering which can dramatically decrease the number of times to calculate a document similarity using the Transitive Closure Tree(TCT) and which is able to speed up the processing without loosing the precision. We also compare the effectivity of the proposed method with those existing algorithms and present the experimental results.

키워드를 통한 웹 검색 결과의 분류와 같은 후처리가 요구되는 문서 분류 문제에서, 기존의 문서 분류 또는 클러스터링 알고리즘을 적용하는 데에는 많은 문제가 있다 그 중에서 고려해야 할 가장 심각한 두 가지 문제가 있다. 첫째는 전문가가 관여하여 범주를 선정하는 문제이고, 둘째는 문서분류에 소요되는 수행시간이 긴 문제이다. 따라서 본 논문에서는 이행적 폐쇄 트리를 이용하여 문서 유사도 계산 횟수를 크게 줄이고, 정확도의 희생을 최소화하면서 신속한 처리가 가능한 새로운 웹 문서 클러스터링 기법을 제안하다. 또한, 제안된 기법의 효율성을 검증하기 위하여 기존의 알고리즘과 비교 평가 및 분석한다.

Keywords

References

A. E. Monge and C. P. Elkan, 'An efficient domain independent algorithm for detecting approximately duplicate database records', Proceeding of the ACM SIGMOD workshop on research Issues on knowledge discovery and data mining, pp.125-130, 1997
Hwanjo Yu, Jiawei Han, and Kevin C. Chang, 'PEBL: Web Page Classification without Negative Examples', IEEE tran. on knowledge and data engineering, Vol.16, No.1, pp.70-81, Jan., 2004 https://doi.org/10.1109/TKDE.2004.1264823
J. C. Song and J. Y. Shen, 'A web document clustering algorithm based on concept of neighbor', Proceedings of the second international conference on machine learning and cybernatics, Xi'an, 2-5, pp.46-50, Nov., 2003
Khaled Alsabti, et al, 'An efficient K-Means Clustering Algorithm', IIPS 11th International Parallel Processing Symposium, 1998
Khaled M. Hammouda and Mohamed S. Kamel, 'Efficient phrase-based document Indexing for web document clusering', IEEE tran. on knowledge and data engineering, Vol.16, No.10, pp.1279-1296, Oct. 2004 https://doi.org/10.1109/TKDE.2004.58
M. A. Hernandez and S. J. Stolfo, 'The merge/purge problem for large databases', Proceedings of the ACM SIGMOD International Conference on Management of Data, pp.127-138, May, 1998 https://doi.org/10.1145/223784.223807
P. S. Bradley, Uama M Fayyad, 'Refining Initial Points for K-Means Clustering', Proceedings of the Fifteenth International Conference on Machine Learning, 1998
W. Klosgen and J.M. Zytkow, 'Handbook of Data Mining and Knowledge Discovery', Oxford University Press, New York, 2002
W. L. Low, M. L. Lee, and T. W. Ling, 'A knowledge-based approach for duplicate elimination in data cleaning', Information Systems, Vol.26, No.8, pp.585-606, Dec., 2001 https://doi.org/10.1016/S0306-4379(01)00041-2
O. Zamir and O. Etzioni, 'Fast and intuitive clustering of web document', KDD97, pp.287-290, 1997
P. Soucy and G. W. Mineau, 'A Simple KNN Alforithm for Text Categorization', Proceeding of 1st. IEEE international conference on data mining, Vol.28, pp.647-648, 2001 https://doi.org/10.1109/ICDM.2001.989592
Ying Zhao and George Karypis, 'Web clustering: Evaluation of hierarchical clustering algorithms for document datasets', Proceedings of the eleventh international conference on Information and knowledge management, pp.515-524, Nov., 2002 https://doi.org/10.1145/584792.584877
Y. Wang, 'Use link-based clustering to improve web search results', Proceedings of 2nd. conference on web information systems engineering, Vol.1, pp.115-123, Dec., 2001
강동혁, 주길홍, 이원석, '대용량 문서 데이터베이스를 위한 효율적인 점진적 문서 클러스터링 기법', 정보처리학회논문지D, 제10-D권, 제1호, pp.57-66, 2003. 02 https://doi.org/10.3745/KIPSTD.2003.10D.1.057
강승식, 'HAM : 한국어 분석 모듈', http://nlp.kookmin.ac.kr
김제욱, 김한준, 이상구, '베이지언 문서분류시스템을 위한 능동적 학습 기반의 학습문서집합 구성 방법', 정보과학회논문지: 소프트웨어 및 응용, 29권, 12호, pp.966-978, 2004. 09
박우창 외, '데이터 마이닝:개념 및 기법', 자유아카데미, 2003. 09