HKIB-20000 & HKIB-40075: Hangul Benchmark Collections for Text Categorization Research

Kim, Jin-Suk;Choe, Ho-Seop;You, Beom-Jong;Seo, Jeong-Hyun;Lee, Suk-Hoon;Ra, Dong-Yul;

doi:10.5626/JCSE.2009.3.3.165

Journal of Computing Science and Engineering

제3권3호
/
Pages.165-180
/
2009
/
1976-4677(pISSN)
/
2093-8020(eISSN)

한국정보과학회 (Korean Institute of Information Scientists and Engineers)

DOI QR Code

HKIB-20000 & HKIB-40075: Hangul Benchmark Collections for Text Categorization Research

Kim, Jin-Suk (Department of Information Technology Research, KISTI) ;
Choe, Ho-Seop (Department of Information Technology Research, KISTI) ;
You, Beom-Jong (Department of Information Technology Research, KISTI) ;
Seo, Jeong-Hyun (Department of Cyber Environment Development, KISTI) ;
Lee, Suk-Hoon (Department of Information & Statistics, Chungnam National University) ;
Ra, Dong-Yul (Computer & Telecommunication Engineering Division, Yonsei University)

발행 : 2009.09.30

https://doi.org/10.5626/JCSE.2009.3.3.165 인용 PDF

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

The HKIB, or Hankookilbo, test collections are two archives of Korean newswire stories manually categorized with semi-hierarchical or hierarchical category taxonomies. The base newswire stories were made available by the Hankook Ilbo (The Korea Daily) for research purposes. At first, Chungnam National University and KISTI collaborated to manually tag 40,075 news stories with categories by semi-hierarchical and balanced three-level classification scheme, where each news story has only one level-3 category (single-labeling). We refer to this original data set as HKIB-40075 test collection. And then Yonsei University and KISTI collaborated to select 20,000 newswire stories from the HKIB-40075 test collection, to rearrange the classification scheme to be fully hierarchical but unbalanced, and to assign one or more categories to each news story (multi-labeling). We refer to this modified data set as HKIB-20000 test collection. We benchmark a k-NN categorization algorithm both on HKIB-20000 and on HKIB-40075, illustrating properties of the collections, providing baseline results for future studies, and suggesting new directions for further research on Korean text categorization problem.

키워드

참고문헌

HERSH, W., C. BUCKLEY, T. J. LEONE, AND D. H. HICKMAN. 1994. OHSUMED: an Interactive Retrieval Evaluation and New Large Text Collection for Research. In Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 94), 192-201.
KIM, JINSUK AND MYOUNG HO KIM. 2004. An Evaluation of Passage-based Text Categorization. Journal of Intelligent Information Systems 23(1):47-65. https://doi.org/10.1023/B:JIIS.0000029670.53363.d0
KIM, JINSUK, DU-SEOK JIN, YUNSOO CHOI, CHANG-HOO JEONG, KWANGYOUNG KIM, SUNG-PIL CHOI, MINHO LEE, MIN-HEE CHO, HO-SEOP CHOE, HWA-MOOK YOON, AND JEONG-HYUN SEO. 2007. Toward DB-IR Integration: Per-Document Basis Transactional Index Maintenance. In The 6th International Conference on Advanced Language Processing and Web Information Technology (ALPIT 2007) 6:452-462, Luoyang, China. https://doi.org/10.1109/ALPIT.2007.15
KIM, JINSUK. 2009. HKIB-20000/HKIB-40075 Korean Text Categorization Test Collections. README file (version 1.0). Manuscript, May 31, 2009.
http://www.kristalinfo.com/TestCollections/readme_hkib.html
KIM, JINSUK. 2009. Experimental Results for KRISTAL's kNN Classifier on HKIB-20000 & HKIB- 40075 Hangul Benchmark Collections for Korean Text Categorization Research. Manuscript, June 10, 2009.
http://www.kristalinfo.com/TestCollections/supp_hkib.pdf
LEWIS, DAVID D. 1992. An Evaluation of Phrasal and Clustered Representations on a Text Categorization Task. In Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 92), 37-50. https://doi.org/10.1145/133160.133172
LEWIS, DAVID D. 2004. Reuters-21578 Text Categorization Test Collection. Distribution 1.0 README file (version 1.3). Manuscript, May 14, 2004.
http://www.daviddlewis.com/resources/testcollections/reuters21578/readme.txt
LEWIS, DAVID D., YIMING YANG, TONY G. ROSE, AND FAN LI. 2004. RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5:361-397.
SEBASTIANI, F. 2002. Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1):1-47. https://doi.org/10.1145/505282.505283
VAN RIJSBERGEN, C. J. 1979. Information Retrieval. Buttersworths, London, second edition.
WITTEN, I. H., A. MOFFAT, AND T. C. BELL. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images. San Francisco: Morgan Kaufmann Publishing.
YANG, Y. AND J. O. PEDERSEN. 1997. A Comparative Study on Feature Selection in Text Categorization. In The Fourteenth International Conference on Machine Learning (ICML 97), 412-420.
YANG, Y. 1999. An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval 1(1):67-88. https://doi.org/10.1023/A:1009982220290
YANG, Y. AND X. LIU. 1999. A Re-examination of Text Categorization Methods. In Proceedings of the Twenty-Second Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 99), 42-49. https://doi.org/10.1145/312624.312647
YANG, Y., S. SLATTERY, AND R. GHANI. 1999. A Study of Approaches to Hypertext Categorization. Journal of Intelligent Information Systems 17(2):219-241. https://doi.org/10.1023/A:1013685612819

피인용 문헌

A Study on Feature Selection for kNN Classifier using Document Frequency and Collection Frequency vol.44, pp.1, 2013, https://doi.org/10.16981/kliss.44.1.201303.27

Journal of Computing Science and Engineering

HKIB-20000 & HKIB-40075: Hangul Benchmark Collections for Text Categorization Research

초록

키워드

참고문헌

피인용 문헌

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)