통합 검색 | Korea Science

3D Cross-Modal Retrieval Using Noisy Center Loss and SimSiam for Small Batch Training

Yeon-Seung Choo;Boeun Kim;Hyun-Sik Kim;Yong-Suk Park
- KSII Transactions on Internet and Information Systems (TIIS)
- /
- 제18권3호
- /
- pp.670-684
- /
- 2024
3D Cross-Modal Retrieval (3DCMR) is a task that retrieves 3D objects regardless of modalities, such as images, meshes, and point clouds. One of the most prominent methods used for 3DCMR is the Cross-Modal Center Loss Function (CLF) which applies the conventional center loss strategy for 3D cross-modal search and retrieval. Since CLF is based on center loss, the center features in CLF are also susceptible to subtle changes in hyperparameters and external inferences. For instance, performance degradation is observed when the batch size is too small. Furthermore, the Mean Squared Error (MSE) used in CLF is unable to adapt to changes in batch size and is vulnerable to data variations that occur during actual inference due to the use of simple Euclidean distance between multi-modal features. To address the problems that arise from small batch training, we propose a Noisy Center Loss (NCL) method to estimate the optimal center features. In addition, we apply the simple Siamese representation learning method (SimSiam) during optimal center feature estimation to compare projected features, making the proposed method robust to changes in batch size and variations in data. As a result, the proposed approach demonstrates improved performance in ModelNet40 dataset compared to the conventional methods.
https://doi.org/10.3837/tiis.2024.03.008 인용 PDF HTML

Improving Transformer with Dynamic Convolution and Shortcut for Video-Text Retrieval

Liu, Zhi;Cai, Jincen;Zhang, Mengmeng
- KSII Transactions on Internet and Information Systems (TIIS)
- /
- 제16권7호
- /
- pp.2407-2424
- /
- 2022
Recently, Transformer has made great progress in video retrieval tasks due to its high representation capability. For the structure of a Transformer, the cascaded self-attention modules are capable of capturing long-distance feature dependencies. However, the local feature details are likely to have deteriorated. In addition, increasing the depth of the structure is likely to produce learning bias in the learned features. In this paper, an improved Transformer structure named TransDCS (Transformer with Dynamic Convolution and Shortcut) is proposed. A Multi-head Conv-Self-Attention module is introduced to model the local dependencies and improve the efficiency of local features extraction. Meanwhile, the augmented shortcuts module based on a dual identity matrix is applied to enhance the conduction of input features, and mitigate the learning bias. The proposed model is tested on MSRVTT, LSMDC and Activity-Net benchmarks, and it surpasses all previous solutions for the video-text retrieval task. For example, on the LSMDC benchmark, a gain of about 2.3% MdR and 6.1% MnR is obtained over recently proposed multimodal-based methods.
https://doi.org/10.3837/tiis.2022.07.016 인용 PDF KSCI HTML

검색결과 2건 처리시간 0.014초

3D Cross-Modal Retrieval Using Noisy Center Loss and SimSiam for Small Batch Training

Improving Transformer with Dynamic Convolution and Shortcut for Video-Text Retrieval

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)