DOI QR코드

DOI QR Code

Development of a Ranking System for Tourist Destination Using BERT-based Semantic Search

BERT 기반 의미론적 검색을 활용한 관광지 순위 시스템 개발

  • KangWoo Lee (Smart Governance Research Center, Dong-A University) ;
  • MyeongSeon Kim (Department of Computer Engineering, Dong-A University) ;
  • Soon Goo Hong (International School, Duy Tan University) ;
  • SuGyeong Roh (Department of Management Information Systems, Dong-A University)
  • Received : 2024.08.02
  • Accepted : 2024.08.21
  • Published : 2024.08.30

Abstract

A tourist destination ranking system was designed that employs a semantic search to extract information with reasonable accuracy. To this end the process involves collecting data, preprocessing text reviews of tourist spots, and embedding the corpus and queries with SBERT. We calculate the similarity between data points, filter out those below a specified threshold, and then rank the remaining tourist destinations using a count-based algorithm to align them semantically with the query. To assess the efficacy of the ranking algorithm experiments were conducted with four queries. Furthermore, 58,175 sentences were directly labeled to ascertain their semantic relevance to the third query, 'crowdedness'. Notably, human-labeled data for crowdedness showed similar results. Despite challenges including optimizing thresholds and imbalanced data, this study shows that a semantic search is a powerful method for understanding user intent and recommending tourist destinations with less time and costs.

본 연구의 목적은 시맨틱 검색 기법을 활용하여 사용자 쿼리 기반의 타당한 정확도를 가진 관광지 랭킹시스템을 설계하는 것이다. 이를 위해 관광지에 대한 텍스트 리뷰 데이터 수집, 데이터 전처리 및 SBERT를 활용한 임베딩 과정을 거쳤다. 이후 유사도를 측정하고 임계값을 충족하는 데이터를 필터링한 후 카운트 기반 랭킹 알고리즘을 적용하여 쿼리와 의미적으로 유사한 순서로 관광지 순위를 도출하였다. 제안된 랭킹 알고리즘의 평가를 위해 4개의 쿼리로 실험을 진행하여 연관성이 높은 상위 5개 관광지를 도출하였다. 도출된 결과값의 비교를 위해 58,175개의 문장에 직접 라벨을 붙여 세 번째 쿼리인 혼잡도와 의미적으로 연관성이 있는지를 확인하였다. 두 결과값이 유사하여 본 연구에서 제시된 랭킹 알고리즘의 효율성이 검증되었다. 임계값 최적화, 데이터 불균형 등의 문제에도 불구하고 이 연구는 시맨틱 검색 기법을 이용하여 적은 비용과 시간으로도 사용자의 의도를 파악하여 관광지를 추천하는 것이 가능하다는 것을 보여주었다.

Keywords

Acknowledgement

This study was supported by the Ministry of Education of the Republic of Korea and the National Research Foundation of Korea (NRF-2018S1A3A2075240).

References

  1. Analyst Prep. (2022). https://analystprep.com/study-notes/cfa-level-2/quantitative-method/supervised-machine-learning-unsupervised-machine-learning-deep-learning/ (Accessed on Nov. 1th, 2022) 
  2. Carter, J. V., Pan, J., Rai, S. N. and Galandiuk, S. (2016). ROC-ing along: Evaluation and Interpretation of Receiver Operating Characteristic Curves, Surgery, 159(6), 1638-1645. 
  3. Everitt, B. S., Landau, S., Leese, M. and Stahl, D. (2011). Cluster Analysis, Chichester, Wiley.
  4. Grant, J. (2018). wordsegment 1.3.1 https://pypi.org/project/wordsegment/ (Accessed on Jul. 30th, 2022) 
  5. Hand, D. (2012). Assessing the Performance of Classification Methods, International Statistical Review, 80(3), 400-414. 
  6. Hugging Face. (2024). all-MiniLM-L6-v2. https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 (Accessed on Jul. 30th, 2022) 
  7. IBM. (2024). What is Data Labeling? https://www.ibm.com/topics/data-labeling (Accessed on Jul. 30th, 2024) 
  8. Kim, S. H., Kim, M. G. and Ryu, M. H. (2022). Importance-Performance Analysis for Korea Mobile Banking Applications: Using Google Playstore Review Data, Journal of Korea Society of Industrial Information Systems, 27(6), 115-126. 
  9. Lee, T. W. (2020). A Study on Analysis of Topic Modeling using Customer Reviews based on Sharing Economy: Focusing on Sharing Parking, Journal of Korea Society of Industrial Information Systems, 25(3), 39-51. 
  10. Loper, E. and Bird, S. (2002). NLTK: The Natural Language Toolkit, In P roceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, 63-70.
  11. Parry, F. (2012). Website Visibility: The Theory and Practice of Improving Rankings, Aslib Proceedings, 64(2), 215-215. 
  12. Pfitzner, D., Leibbrandt, R. and Powers, D. (2009). Characterization and Evaluation of Similarity Measures for Pairs of Clusterings, Knowledge and Information Systems, 19(3), 361-394. doi:10.1007/s10115-008-0150-6. S2CID6935380. 
  13. Powers, D. M. W. (2011). Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation, Journal of Machine Learning Technologies, 2(1), 37-63. 
  14. Reimers, N. and Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 3982-3992. 
  15. Roy, S., Modak, A., Barik, D. and Goon, S. (2019). An Overview of Semantic Search Engines, International Journal of Research & Review, 6(10), 73-85. 
  16. Ryu, M. H. and Cho, H. S. (2020). An Analysis of IoT Service using Sentiment Analysis on Online Reviews: Focusing on the Characteristics of Service Providers, Journal of Korea Society of Industrial Information Systems, 25(5), 91-102. 
  17. SBERT. (2022). https://www.sbert.net (Accessed on Nov. 1th, 2022) 
  18. Shelke, P., Shewale, C., Mirajkar, R., Dedgoankar, S., Wawage, P. and Pawar, R. (2023). A Systematic and Comparative Analysis of Semantic Search Algorithms, International Journal on Recent and Innovation Trends in Computing and Communication, 11(11s), 222-229. 
  19. Sun, H. L., Liang, K. P., Liao, H. and Chen, D. B. (2021). Evaluating User Reputation of Online Rating Systems by Rating Statistical Patterns, Knowledge-Based Systems, 219. 
  20. Sun, X., Wang, Z., Zhou, M., Wang, T. and Li, H. (2024). Segmenting Tourists' Motivations via Online Reviews: An Exploration of the Service Strategies for Enhancing Tourist Satisfaction, Heliyon, 10(1). 
  21. TripAdvisor. (2024). BUSAN Reviews. In TripAdvisor. from https://www.tripadvisor.com/Tourism-g297884-Busan-Vacations.html (Accessed on Jul. 30th, 2024) 
  22. Valcarce, D., Bellogin, A., Parapar, J. and Castells, P. (2020). Assessing Ranking Metrics in Top-N Recommendation, Information Retrieval Journal, 23(4), 411-448. 
  23. Wikipedia Semantic Search. (2022). https://en.wikipedia.org/wiki/Semantic_search (Accessed on Nov. 1th, 2022) 
  24. Wikipedia Signal-to-noise Ratio. (2022). https://en.wikipedia.org/wiki/Signal-to-noise_ratio (Accessed on Nov. 1th, 2022) 
  25. Xu, X., Yan, Z. and Xu, S. (2015). Estimating Wind Speed Probability Distribution by Diffusion-based Kernel Density Method, Electric Power Systems Research, 121, 28-37. 
  26. Ye, Q., Law, R., Gu, B. and Chen, W. (2011). The Influence of User-generated Content on Traveler Behavior: An Empirical Investigation on the Effects of E-word-of-mouth to Hotel Online Bookings, Computers in Human Behavior, 27(2), 634-639. 
  27. Zhang, H., Liu, R. and Egger, R. (2023). Unlocking Uniqueness: Analyzing Online Reviews of Airbnb Experiences Using BERT-based Models, Journal of Travel Research, 63(7), https://doi.org/10.1177/00472875231197381. 
  28. Zheng, X., Huang, J., Wu, J., Sun, S. and Wang, S. (2023). Emerging Trends in Online Reviews Research in Hospitality and Tourism: A Scientometric Update (2000-2020), Tourism Management Perspectives, 47. 
  29. Zhuang, Y. and Kim, J. K. (2021). A BERT-Based Multi-Criteria Recommender System for Hotel Promotion Management, Sustainability, 13(14), 8039.