A Study on the Effectiveness of Bigrams in Text Categorization

Lee, Chan-Do;Choi, Joon-Young;

Journal of Information Technology Applications and Management

제12권2호
/
Pages.15-27
/
2005
/
1598-6284(pISSN)
/
2508-1209(eISSN)

한국데이터전략학회 (Korea Data Strategy Society)

바이그램이 문서범주화 성능에 미치는 영향에 관한 연구

A Study on the Effectiveness of Bigrams in Text Categorization

이찬도 (대전대학교 정보통신인터넷공학부) ;
최준영 (대전대학교 혜화의료원)

발행 : 2005.06.01

PDF

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

Text categorization systems generally use single words (unigrams) as features. A deceptively simple algorithm for improving text categorization is investigated here, an idea previously shown not to work. It is to identify useful word pairs (bigrams) made up of adjacent unigrams. The bigrams it found, while small in numbers, can substantially raise the quality of feature sets. The algorithm was tested on two pre-classified datasets, Reuters-21578 for English and Korea-web for Korean. The results show that the algorithm was successful in extracting high quality bigrams and increased the quality of overall features. To find out the role of bigrams, we trained the Na$\"{i}$ve Bayes classifiers using both unigrams and bigrams as features. The results show that recall values were higher than those of unigrams alone. Break-even points and F1 values improved in most documents, especially when documents were classified along the large classes. In Reuters-21578 break-even points increased by 2.1%, with the highest at 18.8%, and F1 improved by 1.5%, with the highest at 3.2%. In Korea-web break-even points increased by 1.0%, with the highest at 4.5%, and F1 improved by 0.4%, with the highest at 4.2%. We can conclude that text classification using unigrams and bigrams together is more efficient than using only unigrams.

Journal of Information Technology Applications and Management

바이그램이 문서범주화 성능에 미치는 영향에 관한 연구

A Study on the Effectiveness of Bigrams in Text Categorization

초록

키워드

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)