Improving Text Categorization with High Quality Bigrams

Lee, Chan-Do;Tan, Chade-Meng;Wang, Yuan-Fang;

doi:10.3745/KIPSTB.2002.9B.4.415

The KIPS Transactions:PartB (정보처리학회논문지B)

Volume 9B Issue 4
/
Pages.415-420
/
2002
/
1598-284X(pISSN)

Korea Information Processing Society (한국정보처리학회)

DOI QR Code

Improving Text Categorization with High Quality Bigrams

고품질 바이그램을 이용한 문서 범주화 성능 향상

Lee, Chan-Do (Dept.of Computer Information Communication, Engineering, Daejeon University) ;
Tan, Chade-Meng ;
Wang, Yuan-Fang

이찬도 (대전대학교 컴퓨터정보통신공학부) ;
탄체이드멩 (UCSB 대학원 컴퓨터과학과) ;
왕유안팡 (UCSB 컴퓨터과학과)

Published : 2002.08.01

https://doi.org/10.3745/KIPSTB.2002.9B.4.415 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

This paper presents an efficient text categorization algorithm that generates high quality bigrams by using the information gain metric, combined with various frequency thresholds. The bigrams, along with unigrams, are then given as features to a Naive Bayes classifier. The experimental results suggest that the bigrams, while small in number, can substantially contribute to improving text categorization. Upon close examination of the results, we conclude that the algorithm is most successful in correctly classifying more positive documents, but may cause more negative documents to be classified incorrectly.

본 논문은 정보이익을 사용하여 고품질 바이그램을 생성하는 효율적 문서 범주화 알고리즘을 제안한다. 실험 결과 유니그램에 적은 수의 바이그램을 추가해서 나이브 베이즈 분류기에 적용했을 때 문서 범주화 성공률은 상당히 향상되었다. 결과 분석은 제안한 알고리즘이 양의 문서를 분류하는데 더 우수하다는 것을 제시한다.

Keywords

References

Apte, C, Damerau, F., and Weiss, S., 'Automated learning of decision rules for text categorization,' A CM Transactions on Information Systems, 12(3), pp.233-251, 1994 https://doi.org/10.1145/183422.183423
Dumais, S., Piatt, J., Heckman, D., and Sahami, M., 'Inductive Learning Algorithms and Representations for Text Categorization,' In Gardarin et. al. (Ed.), Proceedings of CKIM-98, 7th ACM International Conference on Information and Knowledge Management, New York : ACM Press, pp.148-155, 1998 https://doi.org/10.1145/288627.288651
Furnkranz, J., A Study Using n-gram features for Text Categorization, Technical Report OEFAI-TR-98-30, Austrian Research Institute for Artificial Intelligence, Vienna, Austria, 1998
Joachims, T., 'Text Categorization with Support Vector Machines : Learning with Many Relevant Features,' In Nedellec, C. and Rouveiro, C. (Ed.), Proceedings of ECML-98, 10th European Conference on Machine Learning, Heidelberg : Springer Verlag, pp.137-142, 1998 https://doi.org/10.1007/BFb0026683
Lewis, D., Representation and Learning in Information Retrieval, Technical Report UM-CS-1991-093, Department of Computer Science, University of Massachusetts, Amherst, MA, 1992
Mladenic, D. and Grobelnik, M., 'Word sequences as features in text learning,' In Proceedings of the 17th Electro-technical and Computer Science Conference (ERK-98), Ljubljana, Slovenia, pp.145-148, 1998
Nigam, K, McCallum, A., Thrun, S., and Mitchell, T., 'Text Classification from Labeled and Unlabeled Documents using EM,' Machine Learning, 39, pp.103-134, 2000 https://doi.org/10.1023/A:1007692713085
Schapire, R, Singer, Y., and Singhal, A., 'Boosting and Rocchio Applied to Text Filtering,' In Croft et. al. (Ed.), Proceedings of SIGIR-98, 21st ACM International Corference on Research and Development in Information Retrieval, New York : ACM Press, pp.215-223, 1998 https://doi.org/10.1145/290941.290996
Schiltze, H., Hull, D., and Pederson, J., 'A Comparison of Classifiers and Document Representations for the Routing Problem,' In Croft et. al. (Ed.), Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval, New York : ACM Press, pp.229-237, 1995 https://doi.org/10.1145/215206.215365

Cited by

Korean Document Classification Using Extended Vector Space Model vol.18B, pp.2, 2011, https://doi.org/10.3745/KIPSTB.2011.18B.2.093

The KIPS Transactions:PartB (정보처리학회논문지B)

Improving Text Categorization with High Quality Bigrams

고품질 바이그램을 이용한 문서 범주화 성능 향상

Abstract

Keywords

References

Cited by

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)