A Hangul Document Classification System using Case-based Reasoning

사례기반 추론을 이용한 한글 문서분류 시스템

  • 이재식 (아주대학교 경영대학) ;
  • 이종운 (대우정보시스템 건설시스템팀)
  • Published : 2002.06.30

Abstract

In this research, we developed an efficient Hangul document classification system for text mining. We mean 'efficient' by maintaining an acceptable classification performance while taking shorter computing time. In our system, given a query document, k documents are first retrieved from the document case base using the k-nearest neighbor technique, which is the main algorithm of case-based reasoning. Then, TFIDF method, which is the traditional vector model in information retrieval technique, is applied to the query document and the k retrieved documents to classify the query document. We call this procedure 'CB_TFIDF' method. The result of our research showed that the classification accuracy of CB_TFIDF was similar to that of traditional TFIDF method. However, the average time for classifying one document decreased remarkably.

Keywords

References

  1. 김시천, Memory-Based Reasoning을 이용한 HTML 문서분류 시스템의 설계및 구축, 아주대학교 경영정보학과 석사학위 논문, 1999
  2. 안수산, 신경식, '데이터마이닝 기법을 활용한 스팸메일의 분류 및 예측모형 구축에 관한 연구, 한국지능정보시스템학회 2000년 추계학술 대회 논문집, 2000, pp. 359-366
  3. 이형일, 향상된 메모리기반 추론에 의한 지능형 문서여과 에이전트 구현, 명지대학교 컴퓨터 공학과 박사학위 논문, 1999
  4. 한글공학 연구소, 한국어 분석 라이브러리 HAM 사용 설명서, 한성대학교, 1999
  5. Aamodt, A. and E. Plaza, 'Case-Based Reasoning: Foundational Issues, Methodological Variations, and System Approaches,' Artificial Intelligence Communications, Vol. 7, No. 1, 1996, pp. 9-13
  6. Baeza-Yates, R. and B. Ribeiro-Neto, Modern Information Retrieval, Addison Wesley, 1999
  7. Cho, W.V., Knowledge Discovery from Distributed and Textual Data, Ph.D. Dissertation, Dept. of Computer Science, Hong Kong University of Science and Technology, 1999
  8. Gudivada, V., V.V. Raghhavan, W.I. Grosky, and R. Kasanagottu, 'Information Retrieval on the World Wide Web,' IEEE Internet Computing, 1997
  9. Joachims, T. A., 'Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization,' Proc. 14th Int'l Conf. Machine Learning, 1997, pp. 143-151
  10. Kolodner, J., Case-Based Reasoning, Morgan Kaufman Pub. Inc., 1993
  11. Lewis, D.D. and M. Ringuette, 'Comparison of Two Learning Algorithms for Text Categorization,' Proc. 3rd Ann. Symp. Document Analysis and lniormation Retrieval, 1994, pp. 81-93
  12. Linoff, G. and M.J. A. Berry, Mastering Data Mining, Wiley, 2000
  13. Mladenic, D., 'Text-Learning and Related Intelligent Agents : A Survey,' IEEE Intelligent Systems, 1999
  14. Salton, G. and C. Buckley, 'Term-weighting Approaches in Automatic Retrieval,' Information Processing and Management, Vol. 24, No. 5, 1988, pp. 513-523 https://doi.org/10.1016/0306-4573(88)90021-0
  15. Trybula, W,J., Text Mining and Knowledge Discernment: An Exploratory Investigation, Ph.D. Dissertation, The University of Texas at Austin, 1999
  16. Watson, I., Applying Case-Based Reasoning: Techniques for Enterprise Systems, Morgan Kaufman Pub. Inc., 1997
  17. Weiner, E., J. O. Pedersen and A. S., Weigend, 'A Neural Network Approach to Topic Spotting,' Proc. 4th Ann. Symp. Document Analysis and Information Retrieval, 1995, pp. 197-208