DOI QR코드

DOI QR Code

Large Multimodal Model for Context-aware Construction Safety Monitoring

  • Taegeon Kim (Department of Civil and Environmental Engineering, Yonsei University) ;
  • Seokhwan Kim (Department of Civil and Environmental Engineering, Yonsei University) ;
  • Minkyu Koo (Department of Civil and Environmental Engineering, Yonsei University) ;
  • Minwoo Jeong (Department of Civil and Environmental Engineering, Yonsei University) ;
  • Hongjo Kim (Department of Civil and Environmental Engineering, Yonsei University)
  • Published : 2024.07.29

Abstract

Recent advances in construction automation have led to increased use of deep learning-based computer vision technology for construction monitoring. However, monitoring systems based on supervised learning struggle with recognizing complex risk factors in construction environments, highlighting the need for adaptable solutions. Large multimodal models, pretrained on extensive image-text datasets, present a promising solution with their capability to recognize diverse objects and extract semantic information. This paper proposes a methodology that generates training data for multimodal models, including safety-centric descriptions using GPT-4V, and fine-tunes the LLaVA model using the LoRA method. Experimental results from seven construction site hazard scenarios show that the fine-tuned model accurately assesses safety status in images. These findings underscore the proposed approach's effectiveness in enhancing construction site safety monitoring and illustrate the potential of large multimodal models to tackle domain-specific challenges.

Keywords

Acknowledgement

This research was conducted with the support of the "2023 Yonsei University Future-Leading Research Initiative (No. 2023-22-0114)" and the "National R&D Project for Smart Construction Technology (No. RS-2020-KA156488)" funded by the Korea Agency for Infrastructure Technology Advancement under the Ministry of Land, Infrastructure and Transport, and managed by the Korea Expressway Corporation. And this research used datasets from 'The Open AI Dataset Project (AI-Hub, S. Korea)'. All data information can be accessed through 'AI-Hub (www.aihub.or.kr)'.

References

  1. O. Maali, C.-H. Ko, and P. H. D. Nguyen, "Applications of existing and emerging construction safety technologies," Automation in Construction, vol. 158, p. 105231, Feb. 2024 
  2. G. Jocher et al., "ultralytics/yolov5: v7.0 - YOLOv5 SOTA Realtime Instance Segmentation." Zenodo, Nov. 22, 2022. 
  3. D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, "YOLACT: Real-time Instance Segmentation." arXiv, Oct. 24, 2019. 
  4. W.-C. Chern, J. Hyeon, T. V. Nguyen, V. K. Asari, and H. Kim, "Context-aware safety assessment system for far-field monitoring," Automation in Construction, vol. 149, p. 104779, May 2023. 
  5. H. Guo, Z. Zhang, R. Yu, Y. Sun, and H. Li, "Action Recognition Based on 3D Skeleton and LSTM for the Monitoring of Construction Workers' Safety Harness Usage," Journal of Construction Engineering and Management, vol. 149, no. 4, p. 04023015, Apr. 2023. 
  6. X. Luo, H. Li, X. Yang, Y. Yu, and D. Cao, "Capturing and Understanding Workers' Activities in Far-Field Surveillance Videos with Deep Action Recognition and Bayesian Nonparametric Learning," Computer-Aided Civil and Infrastructure Engineering, vol. 34, no. 4, pp. 333-351, 2019. 
  7. P. Zhai, J. Wang, and L. Zhang, "Extracting Worker Unsafe Behaviors from Construction Images Using Image Captioning with Deep Learning-Based Attention Mechanism," Journal of Construction Engineering and Management, vol. 149, no. 2, p. 04022164, Feb. 2023. 
  8. A. Radford et al., "Learning Transferable Visual Models From Natural Language Supervision." arXiv, Feb. 26, 2021. 
  9. J. Li, D. Li, C. Xiong, and S. Hoi, "BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation." arXiv, Feb. 15, 2022. 
  10. H. Chen et al., "Augmented reality, deep learning and vision-language query system for construction worker safety," Automation in Construction, vol. 157, p. 105158, Jan. 2024. 
  11. M. R. Morris et al., "Levels of AGI: Operationalizing Progress on the Path to AGI." arXiv, Jan. 05, 2024. 
  12. X. Chen et al., "PaLI: A Jointly-Scaled Multilingual Language-Image Model." arXiv, Jun. 05, 2023. 
  13. OpenAI et al., "GPT-4 Technical Report." arXiv, Dec. 18, 2023. 
  14. J. Gu et al., "A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models." arXiv, Jul. 24, 2023.
  15. H. Strobelt et al., "Interactive and Visual Prompt Engineering for Ad-hoc Task Adaptation with Large Language Models," IEEE Transactions on Visualization and Computer Graphics, vol. 29, no. 01, pp. 1146-1156, Jan. 2023 
  16. X. Chen et al., "Microsoft COCO Captions: Data Collection and Evaluation Server." arXiv, Apr. 03, 2015. 
  17. S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, "Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts." arXiv, Mar. 30, 2021. 
  18. C. Schuhmann et al., "LAION-5B: An open large-scale dataset for training next generation image-text models." arXiv, Oct. 15, 2022. 
  19. H. Liu, C. Li, Q. Wu, and Y. J. Lee, "Visual Instruction Tuning." arXiv, Dec. 11, 2023. 
  20. F. Gilardi, M. Alizadeh, and M. Kubli, "ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks," Proc. Natl. Acad. Sci. U.S.A., vol. 120, no. 30, p. e2305016120, Jul. 2023. 
  21. E. J. Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models." arXiv, Oct. 16, 2021.