• Title/Summary/Keyword: Multi-Modal Transformer

Search Result 12, Processing Time 0.026 seconds

Improving Transformer with Dynamic Convolution and Shortcut for Video-Text Retrieval

  • Liu, Zhi;Cai, Jincen;Zhang, Mengmeng
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.16 no.7
    • /
    • pp.2407-2424
    • /
    • 2022
  • Recently, Transformer has made great progress in video retrieval tasks due to its high representation capability. For the structure of a Transformer, the cascaded self-attention modules are capable of capturing long-distance feature dependencies. However, the local feature details are likely to have deteriorated. In addition, increasing the depth of the structure is likely to produce learning bias in the learned features. In this paper, an improved Transformer structure named TransDCS (Transformer with Dynamic Convolution and Shortcut) is proposed. A Multi-head Conv-Self-Attention module is introduced to model the local dependencies and improve the efficiency of local features extraction. Meanwhile, the augmented shortcuts module based on a dual identity matrix is applied to enhance the conduction of input features, and mitigate the learning bias. The proposed model is tested on MSRVTT, LSMDC and Activity-Net benchmarks, and it surpasses all previous solutions for the video-text retrieval task. For example, on the LSMDC benchmark, a gain of about 2.3% MdR and 6.1% MnR is obtained over recently proposed multimodal-based methods.

FakedBits- Detecting Fake Information on Social Platforms using Multi-Modal Features

  • Dilip Kumar, Sharma;Bhuvanesh, Singh;Saurabh, Agarwal;Hyunsung, Kim;Raj, Sharma
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.17 no.1
    • /
    • pp.51-73
    • /
    • 2023
  • Social media play a significant role in communicating information across the globe, connecting with loved ones, getting the news, communicating ideas, etc. However, a group of people uses social media to spread fake information, which has a bad impact on society. Therefore, minimizing fake news and its detection are the two primary challenges that need to be addressed. This paper presents a multi-modal deep learning technique to address the above challenges. The proposed modal can use and process visual and textual features. Therefore, it has the ability to detect fake information from visual and textual data. We used EfficientNetB0 and a sentence transformer, respectively, for detecting counterfeit images and for textural learning. Feature embedding is performed at individual channels, whilst fusion is done at the last classification layer. The late fusion is applied intentionally to mitigate the noisy data that are generated by multi-modalities. Extensive experiments are conducted, and performance is evaluated against state-of-the-art methods. Three real-world benchmark datasets, such as MediaEval (Twitter), Weibo, and Fakeddit, are used for experimentation. Result reveals that the proposed modal outperformed the state-of-the-art methods and achieved an accuracy of 86.48%, 82.50%, and 88.80%, respectively, for MediaEval (Twitter), Weibo, and Fakeddit datasets.

Generating Radiology Reports via Multi-feature Optimization Transformer

  • Rui Wang;Rong Hua
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.17 no.10
    • /
    • pp.2768-2787
    • /
    • 2023
  • As an important research direction of the application of computer science in the medical field, the automatic generation technology of radiology report has attracted wide attention in the academic community. Because the proportion of normal regions in radiology images is much larger than that of abnormal regions, words describing diseases are often masked by other words, resulting in significant feature loss during the calculation process, which affects the quality of generated reports. In addition, the huge difference between visual features and semantic features causes traditional multi-modal fusion method to fail to generate long narrative structures consisting of multiple sentences, which are required for medical reports. To address these challenges, we propose a multi-feature optimization Transformer (MFOT) for generating radiology reports. In detail, a multi-dimensional mapping attention (MDMA) module is designed to encode the visual grid features from different dimensions to reduce the loss of primary features in the encoding process; a feature pre-fusion (FP) module is constructed to enhance the interaction ability between multi-modal features, so as to generate a reasonably structured radiology report; a detail enhanced attention (DEA) module is proposed to enhance the extraction and utilization of key features and reduce the loss of key features. In conclusion, we evaluate the performance of our proposed model against prevailing mainstream models by utilizing widely-recognized radiology report datasets, namely IU X-Ray and MIMIC-CXR. The experimental outcomes demonstrate that our model achieves SOTA performance on both datasets, compared with the base model, the average improvement of six key indicators is 19.9% and 18.0% respectively. These findings substantiate the efficacy of our model in the domain of automated radiology report generation.

Incorporating BERT-based NLP and Transformer for An Ensemble Model and its Application to Personal Credit Prediction

  • Sophot Ky;Ju-Hong Lee;Kwangtek Na
    • Smart Media Journal
    • /
    • v.13 no.4
    • /
    • pp.9-15
    • /
    • 2024
  • Tree-based algorithms have been the dominant methods used build a prediction model for tabular data. This also includes personal credit data. However, they are limited to compatibility with categorical and numerical data only, and also do not capture information of the relationship between other features. In this work, we proposed an ensemble model using the Transformer architecture that includes text features and harness the self-attention mechanism to tackle the feature relationships limitation. We describe a text formatter module, that converts the original tabular data into sentence data that is fed into FinBERT along with other text features. Furthermore, we employed FT-Transformer that train with the original tabular data. We evaluate this multi-modal approach with two popular tree-based algorithms known as, Random Forest and Extreme Gradient Boosting, XGBoost and TabTransformer. Our proposed method shows superior Default Recall, F1 score and AUC results across two public data sets. Our results are significant for financial institutions to reduce the risk of financial loss regarding defaulters.

An Ensemble Model for Credit Default Discrimination: Incorporating BERT-based NLP and Transformer

  • Sophot Ky;Ju-Hong Lee
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2023.05a
    • /
    • pp.624-626
    • /
    • 2023
  • Credit scoring is a technique used by financial institutions to assess the creditworthiness of potential borrowers. This involves evaluating a borrower's credit history to predict the likelihood of defaulting on a loan. This paper presents an ensemble of two Transformer based models within a framework for discriminating the default risk of loan applications in the field of credit scoring. The first model is FinBERT, a pretrained NLP model to analyze sentiment of financial text. The second model is FT-Transformer, a simple adaptation of the Transformer architecture for the tabular domain. Both models are trained on the same underlying data set, with the only difference being the representation of the data. This multi-modal approach allows us to leverage the unique capabilities of each model and potentially uncover insights that may not be apparent when using a single model alone. We compare our model with two famous ensemble-based models, Random Forest and Extreme Gradient Boosting.

Multimodal Sentiment Analysis Using Review Data and Product Information (리뷰 데이터와 제품 정보를 이용한 멀티모달 감성분석)

  • Hwang, Hohyun;Lee, Kyeongchan;Yu, Jinyi;Lee, Younghoon
    • The Journal of Society for e-Business Studies
    • /
    • v.27 no.1
    • /
    • pp.15-28
    • /
    • 2022
  • Due to recent expansion of online market such as clothing, utilizing customer review has become a major marketing measure. User review has been used as a tool of analyzing sentiment of customers. Sentiment analysis can be largely classified with machine learning-based and lexicon-based method. Machine learning-based method is a learning classification model referring review and labels. As research of sentiment analysis has been developed, multi-modal models learned by images and video data in reviews has been studied. Characteristics of words in reviews are differentiated depending on products' and customers' categories. In this paper, sentiment is analyzed via considering review data and metadata of products and users. Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Self Attention-based Multi-head Attention models and Bidirectional Encoder Representation from Transformer (BERT) are used in this study. Same Multi-Layer Perceptron (MLP) model is used upon every products information. This paper suggests a multi-modal sentiment analysis model that simultaneously considers user reviews and product meta-information.

A Study on Performance Improvement of GVQA Model Using Transformer (트랜스포머를 이용한 GVQA 모델의 성능 개선에 관한 연구)

  • Park, Sung-Wook;Kim, Jun-Yeong;Park, Jun;Lee, Han-Sung;Jung, Se-Hoon;Sim, Cun-Bo
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2021.11a
    • /
    • pp.749-752
    • /
    • 2021
  • 오늘날 인공지능(Artificial Intelligence, AI) 분야에서 가장 구현하기 어려운 분야 중 하나는 추론이다. 근래 추론 분야에서 영상과 언어가 결합한 다중 모드(Multi-modal) 환경에서 영상 기반의 질의 응답(Visual Question Answering, VQA) 과업에 대한 AI 모델이 발표됐다. 얼마 지나지 않아 VQA 모델의 성능을 개선한 GVQA(Grounded Visual Question Answering) 모델도 발표됐다. 하지만 아직 GVQA 모델도 완벽한 성능을 내진 못한다. 본 논문에서는 GVQA 모델의 성능 개선을 위해 VCC(Visual Concept Classifier) 모델을 ViT-G(Vision Transformer-Giant)/14로 변경하고, ACP(Answer Cluster Predictor) 모델을 GPT(Generative Pretrained Transformer)-3으로 변경한다. 이와 같은 방법들은 성능을 개선하는 데 큰 도움이 될 수 있다고 사료된다.

Improved Transformer Model for Multimodal Fashion Recommendation Conversation System (멀티모달 패션 추천 대화 시스템을 위한 개선된 트랜스포머 모델)

  • Park, Yeong Joon;Jo, Byeong Cheol;Lee, Kyoung Uk;Kim, Kyung Sun
    • The Journal of the Korea Contents Association
    • /
    • v.22 no.1
    • /
    • pp.138-147
    • /
    • 2022
  • Recently, chatbots have been applied in various fields and have shown good results, and many attempts to use chatbots in shopping mall product recommendation services are being conducted on e-commerce platforms. In this paper, for a conversation system that recommends a fashion that a user wants based on conversation between the user and the system and fashion image information, a transformer model that is currently performing well in various AI fields such as natural language processing, voice recognition, and image recognition. We propose a multimodal-based improved transformer model that is improved to increase the accuracy of recommendation by using dialogue (text) and fashion (image) information together for data preprocessing and data representation. We also propose a method to improve accuracy through data improvement by analyzing the data. The proposed system has a recommendation accuracy score of 0.6563 WKT (Weighted Kendall's tau), which significantly improved the existing system's 0.3372 WKT by 0.3191 WKT or more.

Investigation of facto~ in square-type piezoelectric transformer using ATILA simulation (ATILA 시뮬레이션을 이용한 스퀘어타입 압전변압기의 펙터연구)

  • Vo, Vietthang;Kim, In-Sung;Joo, Hyeon-Kyu;Jeong, Soon-Jong;Kim, Min-Soo;Song, Jae-Sung
    • Proceedings of the Korean Institute of Electrical and Electronic Material Engineers Conference
    • /
    • 2010.06a
    • /
    • pp.327-327
    • /
    • 2010
  • In this paper, an investigation of factors affecting piezoelectric transformers is presented by ATILA software. These transformers are multi-layer piezoelectric transformers in square shape $28\;{\times}\;28\;mm$ and operate in first vibration mode for step-down function. The piezoelectric transformers were modeled in 3D-dimension and analyzed using finite element method in ATILA software, a popular software in piezoelectric analysis. Modal and harmonic modules were used in this process. Effective factors to the properties of piezoelectric transformers including different input electrode patterns, directions of polarization, sizes of connective comer, number of layers were examined on the simulated model using input voltage of 20 V and load resistance of $100\;{\Omega}$. Moreover, thermal analysis was also obtained with conditions of input voltage of 5 V and no-load.

  • PDF

Multi-Modal based ViT Model for Video Data Emotion Classification (영상 데이터 감정 분류를 위한 멀티 모달 기반의 ViT 모델)

  • Yerim Kim;Dong-Gyu Lee;Seo-Yeong Ahn;Jee-Hyun Kim
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2023.01a
    • /
    • pp.9-12
    • /
    • 2023
  • 최근 영상 콘텐츠를 통해 영상물의 메시지뿐 아니라 메시지의 형식을 통해 전달된 감정이 시청하는 사람의 심리 상태에 영향을 주고 있다. 이에 따라, 영상 콘텐츠의 감정을 분류하는 연구가 활발히 진행되고 있고 본 논문에서는 대중적인 영상 스트리밍 플랫폼 중 하나인 유튜브 영상을 7가지의 감정 카테고리로 분류하는 여러 개의 영상 데이터 중 각 영상 데이터에서 오디오와 이미지 데이터를 각각 추출하여 학습에 이용하는 멀티 모달 방식 기반의 영상 감정 분류 모델을 제안한다. 사전 학습된 VGG(Visual Geometry Group)모델과 ViT(Vision Transformer) 모델을 오디오 분류 모델과 이미지 분류 모델에 이용하여 학습하고 본 논문에서 제안하는 병합 방법을 이용하여 병합 후 비교하였다. 본 논문에서는 기존 영상 데이터 감정 분류 방식과 다르게 영상 속에서 화자를 인식하지 않고 감정을 분류하여 최고 48%의 정확도를 얻었다.

  • PDF