Browse > Article

Feature Extraction of Web Document using Association Word Mining  

고수정 (인하대학교 대학원 전자계산공학과)
최준혁 (김포대학 컴퓨터계열)
이정현 (인하대학교 전자계산공학과)
Abstract
The previous studies to extract features for document through word association have the problems of updating profiles periodically, dealing with noun phrases, and calculating the probability for indices. We propose more effective feature extraction method which is using association word mining. The association word mining method, by using Apriori algorithm, represents a feature for document as not single words but association-word-vectors. Association words extracted from document by Apriori algorithm depend on confidence, support, and the number of composed words. This paper proposes an effective method to determine confidence, support, and the number of words composing association words. Since the feature extraction method using association word mining does not use the profile, it need not update the profile, and automatically generates noun phrase by using confidence and support at Apriori algorithm without calculating the probability for index. We apply the proposed method to document classification using Naive Bayes classifier, and compare it with methods of information gain and TFㆍIDF. Besides, we compare the method proposed in this paper with document classification methods using index association and word association based on the model of probability, respectively.
Keywords
feature extraction; association word mining;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 D. D. Lewis, Representation and Learning in Information Retrieval, PhD thesis(Technical Report, Computer Science Dept., Univ. of Massachussetts at Amherst, 1992
2 T. Michael, Maching Learning, McGraw Hill, pp. 154 200, 1997
3 I. Moulinier and G. Raskinis and J. Ganascia, 'Text categorization: a symbolic approach,' Proceedings of Fifth Annual Symposium on Document Analysis and Information Retrieval, 1996
4 E. Wiener and J. O. Pederson and A. S. Weigend, 'A neural network approach to topic spotting,' Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval, 1995
5 D. Mladenic, 'Feature subset selection in text learning,' Proceedings of the 10th European Conference on Machine Learning, pp. 95 100, 1998
6 D. Mladenic and M. Grobelnik, 'Feature selection for classification based on text hierarchy,' Proceedings of the Workshop on Learning from Text and the Web, 1998
7 인하대학교, 사용자 중심의 지능형 정보 검색 시스템, 최종 연구 개발 보고서, 정보통신부, 1997
8 임형근, 장덕성, '색인어 연관성을 이용한 의료정보문서 분류에 관한 연구', 한국정보처리학회 논문지, 제8 B권 제5호, 2001   과학기술학회마을
9 Y. Yang and J. O. Pedersen, 'A Comparative Study on Feature Selection in Text Categorization,' Proceedings of the Fourteenth International Conference on Machine Learning, pp. 412 420, 1997
10 M. Pazzani, D. Billsus, Learning and Revising User Profiles: The Identification of interesting Web Sites, Machine Learning 27, Kluwer Academic Publishers, pp. 313-331, 1997   DOI
11 신집섭, 이창훈, '단어의 연관성을 이용한 문서의 자동분류', 한국정보처리학회 논문지, 제6권 제9호, pp. 2422 2430, 1999   과학기술학회마을
12 Cognitive Science Laboratory, Princeton University, 'WordNet a Lexical Database forEnglish,' http://www.cogsci.princeton.edu/~wn/
13 정영미, 정보검색론, 구미무역(주) 출판부, 1993
14 V. Hatzivassiloglou and K. McKeown, 'Towards the automatic identification of adjectival scales: Clustering adjectives according to meaning,' Proceedings of the 31st Annual Meeting of the ACI, pp. 172 182, 1993   DOI
15 Y. H. Li and A. K. Jain, 'Classification of Text Documents,' Computer Journal, Vol. 41, No. 8, pp. 537 546, 1998   DOI   ScienceOn
16 R. Agrawal and R. Srikant, 'Fast Algorithms for Mining Association Rules,' Proceedings of the 20th VLDB Conference, Santiago, Chile, 1994
17 고영근외, 표준국어문법론, 탑출판사, 1994