DOI QR코드

DOI QR Code

Application of Domain-specific Thesaurus to Construction Documents based on Flow Margin of Semantic Similarity

  • Youmin PARK (Department of Industrial and Systems Engineering, Gyeongsang National University) ;
  • Seonghyeon MOON (Department of Industrial and Systems Engineering, Gyeongsang National University) ;
  • Jinwoo KIM (Department of Civil & Environmental Engineering, Hanyang University) ;
  • Seokho CHI (Department of Civil and Environmental Engineering, Seoul National University)
  • Published : 2024.07.29

Abstract

Large Language Models (LLMs) still encounter challenges in comprehending domain-specific expressions within construction documents. Analogous to humans acquiring unfamiliar expressions from dictionaries, language models could assimilate domain-specific expressions through the use of a thesaurus. Numerous prior studies have developed construction thesauri; however, a practical issue arises in effectively leveraging these resources for instructing language models. Given that the thesaurus primarily outlines relationships between terms without indicating their relative importance, language models may struggle in discerning which terms to retain or replace. This research aims to establish a robust framework for guiding language models using the information from the thesaurus. For instance, a term would be associated with a list of similar terms while also being included in the lists of other related terms. The relative significance among terms could be ascertained by employing similarity scores normalized according to relevance ranks. Consequently, a term exhibiting a positive margin of normalized similarity scores (termed a pivot term) could semantically replace other related terms, thereby enabling LLMs to comprehend domain-specific terms through these pivotal terms. The outcome of this research presents a practical methodology for utilizing domain-specific thesauri to train LLMs and analyze construction documents. Ongoing evaluation involves validating the accuracy of the thesaurus-applied LLM (e.g., S-BERT) in identifying similarities within construction specification provisions. This outcome holds potential for the construction industry by enhancing LLMs' understanding of construction documents and subsequently improving text mining performance and project management efficiency.

Keywords

Acknowledgement

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. RS-2023-00241758). This research was supported by the Ministry of Education, Singapore under its Academic Research Fund Tier 1 (RG136/22). Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not reflect the views of the Ministry of Education, Singapore.

References

  1. Aitchison. J, Gilchrist. A, Bawden. D, "Thesaurus Construction and Use: a Practical Manual", Routledge, 2003.
  2. Zou. Y, Kiviniemi. A, Jones. S. W, "Retrieving similar cases for construction project risk management using Natural Language Processing techniques.", Automation in Construction, 80, pp. 66-76, 2017.
  3. Kim. T, and Chi. S, "Accident Case Retrieval and Analyses: Using Natural Language Processing in the Construction Industry.", Journal of Construction Engineering and Management, 145(3), 0401900, 2019.
  4. Zhang. J, and El-Gohary. N. M, "Automated Information Transformation for Automated Regulatory Compliance Checking in Construction.", Journal of Computing in Civil Engineering, 29(4), pp. 1-16, 2015.
  5. Manning. C. D, Raghaven. P, Schutze. H, "Introduction to Information Retrieval", Cambridge University Press, 2008.
  6. Mikolov. T, Chen. K, Corrado. G, Dean. J, "Efficient Estimation of Word Representations in Vector Space.", 1, pp. 1-12, 2013.
  7. Le. Q, Mikolov. T, "Distributed Representations of Sentences and Documents.", Proceedings of the 31st International Conference on Machine Learning, pp. 1188-1196, 2014.
  8. Kleinberg. J. M, "Authoritative sources in a hyperlinked environment.", Journal of the ACM, 46(5), pp. 604-632, 1999.
  9. Joulin. A, Grave. E, Bojanowski. P, Mikolov. T, "Bag of Tricks for Efficient Text Classification.", Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pp. 1-5, 2016.
  10. Wang. S, Manning. C. D, "Baselines and bigrams: Simple, good sentiment and topic classification.", 50th Annual Meeting of the Association for Computational Linguistics, ACL 2012 - Proceedings of the Conference, 2(July), pp. 90-94, 2012.