DOI QR코드

DOI QR Code

Large-scale Language-image Model-based Bag-of-Objects Extraction for Visual Place Recognition

영상 기반 위치 인식을 위한 대규모 언어-이미지 모델 기반의 Bag-of-Objects 표현

  • Seung Won Jung (School of Mechanical Engineering, Korea University of Technology and Education) ;
  • Byungjae Park (School of Mechanical Engineering, Korea University of Technology and Education)
  • 정승운 (한국기술교육대학교 기계공학부) ;
  • 박병재 (한국기술교육대학교 기계공학부)
  • Received : 2024.02.16
  • Accepted : 2024.03.12
  • Published : 2024.03.31

Abstract

We proposed a method for visual place recognition that represents images using objects as visual words. Visual words represent the various objects present in urban environments. To detect various objects within the images, we implemented and used a zero-shot detector based on a large-scale image language model. This zero-shot detector enables the detection of various objects in urban environments without additional training. In the process of creating histograms using the proposed method, frequency-based weighting was applied to consider the importance of each object. Through experiments with open datasets, the potential of the proposed method was demonstrated by comparing it with another method, even in situations involving environmental or viewpoint changes.

Keywords

Acknowledgement

이 논문은 2023년도 정부(과학기술정통부)의 재원으로 한국과학재단의 지원을 받아 수행된 연구임 (No. 2021R1F1A1057949).

References

  1. S. Singh, A. Gupta, and A. A. Efros, "Unsupervised discovery of mid-level discriminative patches", Proc. of Computer Vision-ECCV 2012: 12th European Conf. Computer Vision, pp. 73-86, Florence, Italy, 2012.
  2. D. G. Lowe, "Object recognition from local scale-invariant features", Proc. of the seventh IEEE international conf. computer vision, pp. 1150-1157, Kerkyra, Greece, 1999.
  3. H. Bay, T. Tuytelaars, and L. Van Gool, "Surf: Speeded up robust features", Proc. of Computer Vision-ECCV 2006: 9th European Conf. Computer Vision, pp. 404-417, Graz, Austria, 2006.
  4. G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray, "Visual categorization with bags of keypoints", Proc. of Workshop on statistical learning in computer vision, Vol. 1. No. 1-22, pp. 1-6, 2004.
  5. H. Jegou, M. Douze, C. Schmid, and P. Perez, "Aggregating local descriptors into a compact image representation", Proc. of 2010 IEEE Computer Society Conf. Computer Vision and Pattern Recognition, pp. 3304-3311, San Francisco, USA, 2010.
  6. R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, "NetVLAD: CNN Architecture for Weakly Supervised Place Recognition", IEEE Trans. Pattern Anal. Mach. Intell., Vol. 40, No. 6, pp. 1437-1451, 2018.
  7. D. Filliat, "A visual bag of words method for interactive qualitative localization and mapping", Proc. of 2007 IEEE International Conf. Robotics and Automation, pp. 1-7, Rome, Italy, 2007.
  8. M. Cummins and P. Newman, "Appearance-only SLAM at large scale with FAB-MAP 2.0", Int. J. Rob. Res., Vol. 30, No. 9, pp. 1100-1123, 2011.
  9. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, "Going deeper with convolutions", Proc. of the IEEE conf. computer vision and pattern recognition, pp. 73-86, Boston, USA, 2012.
  10. K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition", Proc. of the IEEE conf. computer vision and pattern recognition, pp. 770-778, Las Vegas, USA, 2016.
  11. J. Hu, L. Shen, and G. Sun, "Squeeze-and-excitation networks", Proc. of the IEEE conf. computer vision and pattern recognition, pp.7132-7141, Salt Lake City, USA, 2018.
  12. J. Park, S. Woo, J.-Y. Lee, and I. S. Kweon, "Bam: Bottleneck attention module", arXiv preprint arXiv:1807.06514, pp. 1-14, 2018.
  13. S. Woo, J. Park, J. Y. Lee, and I. S. Kweon, "Cbam: Convolutional block attention module", Proc. of the European conf. computer vision (ECCV), pp. 3-19, Munich, Germany, 2018.
  14. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need", Proc. of 31st Annual Conf. Neural Information Processing Systems (NIPS 2017), pp. 1-11, California, USA, 2017.
  15. M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, and I. Sutskever, "Generative pretraining from pixels", Proc. of In International conf. machine learning (ICML), pp. 1691-1703, 2020.
  16. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, "An image is worth 16x16 words: Transformers for image recognition at scale", arXiv preprint arXiv:2010.11929, pp. 1-22, 2020.
  17. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, "Learning transferable visual models from natural language supervision", Proc. of International conf. machine learning (ICML), pp. 8748-8763, 2020.
  18. C. L. Zitnick and P. Dollar, "Edge boxes: Locating object proposals from edges", Proc. of Computer Vision-ECCV 2014: 13th European Conf., pp. 391-405, Zurich, Switzerland, 2014.
  19. X. Tan, K. Xu, Y. Cao, Y. Zhang, L. Ma, and R. W. H. Lau, "Night-time scene parsing with a large real dataset". IEEE Trans. Image Process., Vol. 30, pp. 9085-9098, 2021.