DOI QR코드

DOI QR Code

Construction of Text Summarization Corpus in Economics Domain and Baseline Models

  • Received : 2023.04.01
  • Accepted : 2023.11.04
  • Published : 2024.03.31

Abstract

Automated text summarization (ATS) systems rely on language resources as datasets. However, creating these datasets is a complex and labor-intensive task requiring linguists to extensively annotate the data. Consequently, certain public datasets for ATS, particularly in languages such as Thai, are not as readily available as those for the more popular languages. The primary objective of the ATS approach is to condense large volumes of text into shorter summaries, thereby reducing the time required to extract information from extensive textual data. Owing to the challenges involved in preparing language resources, publicly accessible datasets for Thai ATS are relatively scarce compared to those for widely used languages. The goal is to produce concise summaries and accelerate the information extraction process using vast amounts of textual input. This study introduced ThEconSum, an ATS architecture specifically designed for Thai language, using economy-related data. An evaluation of this research revealed the significant remaining tasks and limitations of the Thai language.

Keywords

Acknowledgement

This study was conducted with collaborative support from Promes Co., Ltd. (Backyard Group) as part of the Joint Research Project funded by the National Science and Technology Development Agency (NSTDA), Thailand. Additionally, we extend our gratitude to PTT Digital Solutions Co. Ltd., Thailand for generously providing public datasets for Thai ATS. The Program Management Unit for National Competitiveness Enhancement, under the Office of the National Higher Education Science Research and Innovation Policy Council in Thailand, provided financial support for data collection and construction.

References

  1. S. Deo and D. Banik, "Text summarization using textrank and lexrank through latent semantic analysis," in Proceeding of the International Conference on Information Technology (OCIT) 2022, Odisha, India, pp. 113-118, 2022. DOI: 10.1109/OCIT56763.2022.00031. 
  2. K. Kaikhah, "Automatic text summarization with neural networks," in Proceeding of the 2nd International IEEE Conference on 'Intelligent Systems', Varna, Bulgaria, pp. 40-44, 2004. DOI: 10.1109/IS.2004.1344614. 
  3. Z. Yang, Y. Dong, J. Deng, B. Sha, and T. Xu, "Research on automatic news text summarization technology based on GPT2 model," in Proceeding of the 3rd International Conference on Artificial Intelligence and Advanced Manufacture, Manchester, United Kingdom, pp. 418-423, 2021. DOI: 10.1145/3495018.3495091. 
  4. N. Chumpolsathien, "Using knowledge distillation from keyword extraction to improve the informativeness of neural cross-lingual summarization," Masters thesis, Beijing Institute of Technology, 2020. 
  5. J. Zhu, Q. Wang, Y. Wang, Y. Zhou, J. Zhang, S. Wang, and C. Zong, "NCLS: Neural cross-lingual summarization," arXiv preprint arXiv:1909.00156, Aug. 2019. DOI: 10.48550/arXiv.1909.00156.
  6. T. Hasan et al., "XL-sum: Large-scale multilingual abstractive summarization for 44 languages," arXiv preprint arXiv:2106.13822, Jun. 2021. DOI: 10.48550/arXiv.2106.13822. 
  7. N. Ketui, T. Theeramunkong, and C. Onsuwan, "An EDU-based approach for Thai multi-document summarization and its application," ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), vol. 14, no. 1, pp. 1-26, Jan. 2015. DOI: 10.1145/2641567. 
  8. S. Jumpathong, T. Theeramunkong, T. Supnithi, and M. Okumura, "A performance analysis of deep-learning-based thai news abstractive summarization: Word positions and document length," in Proceeding of the 7th International Conference on Business and Industrial Research (ICBIR), Bangkok, Thailand, pp. 279-284, 2022. DOI: 10.1109/ICBIR54589.2022.9786413. 
  9. W. S. El-Kassas, C. R. Salama, A. A. Rafea, and H. K. Mohamed, "Automatic text summarization: A comprehensive survey," Expert systems with applications, vol. 165, pp. 113679, Mar. 2021. DOI: 10.1016/j.eswa.2020.113679. 
  10. P. Wang, B. Xu, J. Xu, G. Tian, C. L. Liu, and H. Hao, in "Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification," Neurocomputing, vol. 174, pp. 806-814, 2016. DOI: 10.1016/j.neucom.2015.09.096. 
  11. M. Joshi, H. Wang, and S. McClean, "Dense semantic graph and its application in single document summarisation," Emerging Ideas on Information Filtering and Retrieval, Springer, pp. 55-67, Oct. 2017. DOI: 10.1007/978-3-319-68392-8_4. 
  12. R. Z. Al-Abdallah and A. T. Al-Taani, "Arabic single-document text summarization using particle swarm optimization algorithm," Procedia Computer Science, vol. 117, pp. 30-37, 2017. DOI: 10.1016/j.procs.2017.10.091. 
  13. K. Krishnakumari and E. Sivasankar, "Scalable aspect-based summarization in the hadoop environment," Big Data Analytics, Springer, pp. 439-449, Oct. 2017. DOI: 10.1007/978-981-10-6620-7_42. 
  14. P. Nuanplord and M. Sodanil, "Health news summarization using semantic ontology," in Proceeding of the 3rd International Conference on Next Generation Computing, Chiang Mai, Thailand, 2017. 
  15. C. Yongkiatpanich and D. Wichadakul, "Extractive text summarization using ontology and graph-based method," in Proceeding of the 4th International Conference on Computer and Communication Systems (ICCCS), pp. 105-110, 2019. DOI: 10.1109/CCOMS.2019.8821755. 
  16. O. Chaowalit and O. Sornil, "Abstractive thai opinion summarization," Advanced Materials Research, vol. 971-973, pp. 2273-2280, 2014. DOI: 10.4028/www.scientific.net/AMR.971-973.2273. 
  17. L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel, "mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer," in Proceeding of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 483-498, Jun. 2021. DOI: 10.18653/v1/2021.naacl-main.41. 
  18. "Economic news dataset." Thailand's national electronics and computer technology center, Thailand, [Internet], Available: https://aiforthai.in.th/. 
  19. I. Loshchilov and F. Hutter, "Fixing weight decay regularization in adam," in 6th International Conference on Learning Representations, pp. 1-14, 2018. [Internet], Available: https://openreview.net/pdf?id=rk6qdGgCZ. 
  20. C. Y. Lin, "Rouge: A package for automatic evaluation of summaries," in Proceedings of the workshop on text summarization branches out (WAS 2004), no. 1, pp. 25-26, 2004. [Internet], Available: papers2://publication/uuid/5DDA0BB8-E59F-44C1-88E6-2AD316DAEF85. 
  21. S. Jumpathong, A. Takhom, P. Boonkwan, V. Sutantayawalee, P. Porkaew, S. Phaholphinyo, C. Phrombut, T. Supnithi, K. Choke-Mangmi, S. Yamasathien, N. Tretasayuth, K. Kanwatchara, and A. Aiemleuk, "ThEconSum: an Economics-domained Dataset for Thai Text Summarization and Baseline Models," in 2022 17th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), Chiang Mai, Thailand, pp. 1-6, 2022. DOI: 10.1109/iSAI-NLP56921.2022.9960271.