• Title/Summary/Keyword: Hangul test collections

Search Result 3, Processing Time 0.014 seconds

Developing the KRIST Test Collection for Researches in Information Retrieval (정보 검색 연구를 위한 KRIST 테스트 컬렉션의 개발)

  • 이준호
    • Journal of the Korean Society for information Management
    • /
    • v.12 no.2
    • /
    • pp.225-232
    • /
    • 1995
  • It has been known that test collections play an important role for researches in information retrieval. A variety of test collections have been created in foreign countries, and have been heavily used by researchers. Although research interests in Hangul information retrieval have been rapidly grown up in Korea these days, lack of Hangul test collec tions makes it difficult to develop retrieval techniques for Hangul texts. This study describes the development of the KRIST test collection. The KRIST test collection consists of 13.515 bibliographic records. 30 queries and a list of relevant documents to the queries.

  • PDF

HKIB-20000 & HKIB-40075: Hangul Benchmark Collections for Text Categorization Research

  • Kim, Jin-Suk;Choe, Ho-Seop;You, Beom-Jong;Seo, Jeong-Hyun;Lee, Suk-Hoon;Ra, Dong-Yul
    • Journal of Computing Science and Engineering
    • /
    • v.3 no.3
    • /
    • pp.165-180
    • /
    • 2009
  • The HKIB, or Hankookilbo, test collections are two archives of Korean newswire stories manually categorized with semi-hierarchical or hierarchical category taxonomies. The base newswire stories were made available by the Hankook Ilbo (The Korea Daily) for research purposes. At first, Chungnam National University and KISTI collaborated to manually tag 40,075 news stories with categories by semi-hierarchical and balanced three-level classification scheme, where each news story has only one level-3 category (single-labeling). We refer to this original data set as HKIB-40075 test collection. And then Yonsei University and KISTI collaborated to select 20,000 newswire stories from the HKIB-40075 test collection, to rearrange the classification scheme to be fully hierarchical but unbalanced, and to assign one or more categories to each news story (multi-labeling). We refer to this modified data set as HKIB-20000 test collection. We benchmark a k-NN categorization algorithm both on HKIB-20000 and on HKIB-40075, illustrating properties of the collections, providing baseline results for future studies, and suggesting new directions for further research on Korean text categorization problem.

Construction of a Balanced Test Collection for Evaluation of Information Retrieval System (정보 검색 시스템 평가를 위한 균형 테스트 컬렉션 구축)

  • 맹성현;이석훈;이준호;이응봉;송사광
    • Journal of the Korean Society for information Management
    • /
    • v.16 no.2
    • /
    • pp.135-148
    • /
    • 1999
  • There has been some research in Korea on test collections for evaluation of information retrieval (IR) systems. The test collections constructed as an outcome from the research have provided a starting point and opportunities to test Korean IR systems in an objective manner. However, they are well short of the standard practice in the broader IR community in that they are small in their size and usually unbalanced in terms of the characteristics of the documents and the queries (such as the subject domains). In this article, we describe our research effort to alleviate this problem and the resulting test collection, called HANTEC (Hangul TEst Collection). HANTEC is balanced in terms of the subject domains, document lengths, and user types, and currently consists of 120,000 documents divided into three groups: general area, social science area and scienceltechnology area. The 30 queries in the collection are grouped into the same three areas in one dimension and into three distinct user groups in the other dimension.

  • PDF