Journal of Information Technology Applications and Management
- 제12권2호
- /
- Pages.15-27
- /
- 2005
- /
- 1598-6284(pISSN)
- /
- 2508-1209(eISSN)
바이그램이 문서범주화 성능에 미치는 영향에 관한 연구
A Study on the Effectiveness of Bigrams in Text Categorization
초록
Text categorization systems generally use single words (unigrams) as features. A deceptively simple algorithm for improving text categorization is investigated here, an idea previously shown not to work. It is to identify useful word pairs (bigrams) made up of adjacent unigrams. The bigrams it found, while small in numbers, can substantially raise the quality of feature sets. The algorithm was tested on two pre-classified datasets, Reuters-21578 for English and Korea-web for Korean. The results show that the algorithm was successful in extracting high quality bigrams and increased the quality of overall features. To find out the role of bigrams, we trained the Na