DOI QR코드

DOI QR Code

Development of Tourism Information Named Entity Recognition Datasets for the Fine-tune KoBERT-CRF Model

  • Received : 2022.02.24
  • Accepted : 2022.03.02
  • Published : 2022.05.31

Abstract

A smart tourism chatbot is needed as a user interface to efficiently provide smart tourism services such as recommended travel products, tourist information, my travel itinerary, and tour guide service to tourists. We have been developed a smart tourism app and a smart tourism information system that provide smart tourism services to tourists. We also developed a smart tourism chatbot service consisting of khaiii morpheme analyzer, rule-based intention classification, and tourism information knowledge base using Neo4j graph database. In this paper, we develop the Korean and English smart tourism Name Entity (NE) datasets required for the development of the NER model using the pre-trained language models (PLMs) for the smart tourism chatbot system. We create the tourism information NER datasets by collecting source data through smart tourism app, visitJeju web of Jeju Tourism Organization (JTO), and web search, and preprocessing it using Korean and English tourism information Name Entity dictionaries. We perform training on the KoBERT-CRF NER model using the developed Korean and English tourism information NER datasets. The weight-averaged precision, recall, and f1 scores are 0.94, 0.92 and 0.94 on Korean and English tourism information NER datasets.

Keywords

Acknowledgement

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No.2021R1A2C1093283).

References

  1. JeongWoo Jwa, "Development of Personalized Travel Products for Smart Tour Guidance Services", International Journal of Engineering & Technology, 7 (3.33) 58-61, 2018. DOI: DOI: 10.14419/ijet.v7i3.33.18524
  2. Dong-Hyun Kim, Hyeon-Su Im, Jong-Heon Hyeon, Jeong-Woo Jwa, "Development of the Rule-based Smart Tourism Chatbot using Neo4J graph database", International Journal of Internet, Broadcasting and Communication, Vol.13, No.2, pp 179-186, 2021. DOI: 10.7236/IJIBC.2021.13.2.179
  3. Kakao khaiii (Kakao Hangul Analyzer III), https://tech.kakao.com/2018/12/13/khaiii/
  4. Neo4j graph database, https://neo4j.com/
  5. Guendalina Caldarini, Sardar Jaf, Kenneth McGarry, 'A Literature Survey of Recent Advances in Chatbots', Information vol.13, no.1, 41, 2022. DOI: 10.3390/info13010041
  6. Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Eric Michael Smith, Y-Lan Boureau, Jason Weston, 'Recipes for Building an Open-Domain Chatbot', EACL 2021, pp. 300-325. 2021. DOI: 10.18653/v1/2021.eacl-main.24,
  7. Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, Xuanjing Huang, "Pre-trained Models for Natural Language Processing: A Survey", Science China Technological Sciences 63(10), pp.1872-1897, 2020. DOI: 10.1007/s11431-020-1647-3
  8. Jing Li, Aixin Sun, Jianglei Han, and Chenliang Li, "A Survey on Deep Learning for Named Entity Recognition", IEEE Trans. on Knowledge and Data Eng., pp. 50-70, 2020. DOI:10.1109/TKDE.2020.2981314
  9. https://github.com/kmounlp/NER
  10. National Institute of the Korean Language NER data , https://corpus.korean.go.kr/
  11. NAVER NLP Challenge 2018 NER data, https://github.com/naver/nlp-challenge/tree/master/missions/ner
  12. Visit Jeju Website, https://www.visitjeju.net/kr
  13. SKT KoBERT, https://github.com/SKTBrain/KoBERT