• Title/Summary/Keyword: text collection construction and evaluation

Search Result 2, Processing Time 0.018 seconds

Building a text collection for Urdu information retrieval

  • Rasheed, Imran;Banka, Haider;Khan, Hamaid M.
    • ETRI Journal
    • /
    • v.43 no.5
    • /
    • pp.856-868
    • /
    • 2021
  • Urdu is a widely spoken language in the Indian subcontinent with over 300 million speakers worldwide. However, linguistic advancements in Urdu are rare compared to those in other European and Asian languages. Therefore, by following Text Retrieval Conference standards, we attempted to construct an extensive text collection of 85 304 documents from diverse categories covering over 52 topics with relevance judgment sets at 100 pool depth. We also present several applications to demonstrate the effectiveness of our collection. Although this collection is primarily intended for text retrieval, it can also be used for named entity recognition, text summarization, and other linguistic applications with suitable modifications. Ours is the most extensive existing collection for the Urdu language, and it will be freely available for future research and academic education.

A Study on the Construction of Financial-Specific Language Model Applicable to the Financial Institutions (금융권에 적용 가능한 금융특화언어모델 구축방안에 관한 연구)

  • Jae Kwon Bae
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.29 no.3
    • /
    • pp.79-87
    • /
    • 2024
  • Recently, the importance of pre-trained language models (PLM) has been emphasized for natural language processing (NLP) such as text classification, sentiment analysis, and question answering. Korean PLM shows high performance in NLP in general-purpose domains, but is weak in domains such as finance, medicine, and law. The main goal of this study is to propose a language model learning process and method to build a financial-specific language model that shows good performance not only in the financial domain but also in general-purpose domains. The five steps of the financial-specific language model are (1) financial data collection and preprocessing, (2) selection of model architecture such as PLM or foundation model, (3) domain data learning and instruction tuning, (4) model verification and evaluation, and (5) model deployment and utilization. Through this, a method for constructing pre-learning data that takes advantage of the characteristics of the financial domain and an efficient LLM training method, adaptive learning and instruction tuning techniques, were presented.