Acknowledgement
본 연구는 과학기술정보통신부 및 정보통신기획평가원의 인공지능융합혁신인재양성사업 연구 결과로 수행되었으며(IITP-2023-RS-2023-00256629), 정부(과학기술정보통신부)의 재원으로 한국연구재단의 지원을 받아 수행된 연구임(No. RS-2022-00165919).
References
- Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning. PMLR, 2021.
- Gao, Yuting, et al. "Pyramidclip: Hierarchical feature alignment for vision-language model pretraining." Advances in neural information processing systems 35 (2022): 35959-35970.
- Gao, Yuting, et al. "Softclip: Softer cross-modal alignment makes clip stronger." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 38. No. 3. 2024.
- Huang, Yufeng, et al. "Structure-clip: Enhance multimodal language representations with structure knowledge." arXiv preprint arXiv:2305.06152 2.3 (2023).
- Thrush, Tristan, et al. "Winoground: Probing vision and language models for visio-linguistic compositionality." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
- Ma, Zixian, et al. "Crepe: Can vision-language foundation models reason compositionally?." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
- Peng, Wujian, et al. "Synthesize Diagnose and Optimize: Towards Fine-Grained Vision-Language Understanding." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.