• Title/Summary/Keyword: vision transformer

Search Result 58, Processing Time 0.024 seconds

Performance Evaluation of Efficient Vision Transformers on Embedded Edge Platforms (임베디드 엣지 플랫폼에서의 경량 비전 트랜스포머 성능 평가)

  • Minha Lee;Seongjae Lee;Taehyoun Kim
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.18 no.3
    • /
    • pp.89-100
    • /
    • 2023
  • Recently, on-device artificial intelligence (AI) solutions using mobile devices and embedded edge devices have emerged in various fields, such as computer vision, to address network traffic burdens, low-energy operations, and security problems. Although vision transformer deep learning models have outperformed conventional convolutional neural network (CNN) models in computer vision, they require more computations and parameters than CNN models. Thus, they are not directly applicable to embedded edge devices with limited hardware resources. Many researchers have proposed various model compression methods or lightweight architectures for vision transformers; however, there are only a few studies evaluating the effects of model compression techniques of vision transformers on performance. Regarding this problem, this paper presents a performance evaluation of vision transformers on embedded platforms. We investigated the behaviors of three vision transformers: DeiT, LeViT, and MobileViT. Each model performance was evaluated by accuracy and inference time on edge devices using the ImageNet dataset. We assessed the effects of the quantization method applied to the models on latency enhancement and accuracy degradation by profiling the proportion of response time occupied by major operations. In addition, we evaluated the performance of each model on GPU and EdgeTPU-based edge devices. In our experimental results, LeViT showed the best performance in CPU-based edge devices, and DeiT-small showed the highest performance improvement in GPU-based edge devices. In addition, only MobileViT models showed performance improvement on EdgeTPU. Summarizing the analysis results through profiling, the degree of performance improvement of each vision transformer model was highly dependent on the proportion of parts that could be optimized in the target edge device. In summary, to apply vision transformers to on-device AI solutions, either proper operation composition and optimizations specific to target edge devices must be considered.

Predicting Accident Vulnerable Situation and Extracting Scenarios of Automated Vehicleusing Vision Transformer Method Based on Vision Data (Vision Transformer를 활용한 비전 데이터 기반 자율주행자동차 사고 취약상황 예측 및 시나리오 도출)

  • Lee, Woo seop;Kang, Min hee;Yoon, Young;Hwang, Kee yeon
    • The Journal of The Korea Institute of Intelligent Transport Systems
    • /
    • v.21 no.5
    • /
    • pp.233-252
    • /
    • 2022
  • Recently, various studies have been conducted to improve automated vehicle (AV) safety for AVs commercialization. In particular, the scenario method is directly related to essential safety assessments. However, the existing scenario do not have objectivity and explanability due to lack of data and experts' interventions. Therefore, this paper presents the AVs safety assessment extended scenario using real traffic accident data and vision transformer (ViT), which is explainable artificial intelligence (XAI). The optimal ViT showed 94% accuracy, and the scenario was presented with Attention Map. This work provides a new framework for an AVs safety assessment method to alleviate the lack of existing scenarios.

The Analysis of Color Vision Defects Mechanism for the Electric Circuits (전기적 회로에 의한 색각이상 mechanism 해석)

  • Park, Sang-An;Kim, Yong-Geun
    • Journal of Korean Ophthalmic Optics Society
    • /
    • v.6 no.1
    • /
    • pp.81-85
    • /
    • 2001
  • The color vision was composed of the wavelength absorption of three R. G. B cone photo-receptors and the r-g, y-b channel of an opponent process. The color vision defects mechanism for the electric circuit made up a photo cell, relay switch and transformer. This mechanism very well applied to the color vision defects mechanism owing to be y-b chromatic valence function in case of a cone R or G defects and to be r-g chromatic valence function in case of a cone B defects.

  • PDF

Unsupervised Transfer Learning for Plant Anomaly Recognition

  • Xu, Mingle;Yoon, Sook;Lee, Jaesu;Park, Dong Sun
    • Smart Media Journal
    • /
    • v.11 no.4
    • /
    • pp.30-37
    • /
    • 2022
  • Disease threatens plant growth and recognizing the type of disease is essential to making a remedy. In recent years, deep learning has witnessed a significant improvement for this task, however, a large volume of labeled images is one of the requirements to get decent performance. But annotated images are difficult and expensive to obtain in the agricultural field. Therefore, designing an efficient and effective strategy is one of the challenges in this area with few labeled data. Transfer learning, assuming taking knowledge from a source domain to a target domain, is borrowed to address this issue and observed comparable results. However, current transfer learning strategies can be regarded as a supervised method as it hypothesizes that there are many labeled images in a source domain. In contrast, unsupervised transfer learning, using only images in a source domain, gives more convenience as collecting images is much easier than annotating. In this paper, we leverage unsupervised transfer learning to perform plant disease recognition, by which we achieve a better performance than supervised transfer learning in many cases. Besides, a vision transformer with a bigger model capacity than convolution is utilized to have a better-pretrained feature space. With the vision transformer-based unsupervised transfer learning, we achieve better results than current works in two datasets. Especially, we obtain 97.3% accuracy with only 30 training images for each class in the Plant Village dataset. We hope that our work can encourage the community to pay attention to vision transformer-based unsupervised transfer learning in the agricultural field when with few labeled images.

Deep Clustering Based on Vision Transformer(ViT) for Images (이미지에 대한 비전 트랜스포머(ViT) 기반 딥 클러스터링)

  • Hyesoo Shin;Sara Yu;Ki Yong Lee
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2023.05a
    • /
    • pp.363-365
    • /
    • 2023
  • 본 논문에서는 어텐션(Attention) 메커니즘을 이미지 처리에 적용한 연구가 진행되면서 등장한 비전 트랜스포머 (Vision Transformer, ViT)의 한계를 극복하기 위해 ViT 기반의 딥 클러스터링(Deep Clustering) 기법을 제안한다. ViT는 완전히 트랜스포머(Transformer)만을 사용하여 입력 이미지의 패치(patch)들을 벡터로 변환하여 학습하는 모델로, 합성곱 신경망(Convolutional Neural Network, CNN)을 사용하지 않으므로 입력 이미지의 크기에 대한 제한이 없으며 높은 성능을 보인다. 그러나 작은 데이터셋에서는 학습이 어렵다는 단점이 있다. 제안하는 딥 클러스터링 기법은 처음에는 입력 이미지를 임베딩 모델에 통과시켜 임베딩 벡터를 추출하여 클러스터링을 수행한 뒤, 클러스터링 결과를 임베딩 벡터에 반영하도록 업데이트하여 클러스터링을 개선하고, 이를 반복하는 방식이다. 이를 통해 ViT 모델의 일반적인 패턴 파악 능력을 개선하고 더욱 정확한 클러스터링 결과를 얻을 수 있다는 것을 실험을 통해 확인하였다.

Evaluating Chest Abnormalities Detection: YOLOv7 and Detection Transformer with CycleGAN Data Augmentation

  • Yoshua Kaleb Purwanto;Suk-Ho Lee;Dae-Ki Kang
    • International journal of advanced smart convergence
    • /
    • v.13 no.2
    • /
    • pp.195-204
    • /
    • 2024
  • In this paper, we investigate the comparative performance of two leading object detection architectures, YOLOv7 and Detection Transformer (DETR), across varying levels of data augmentation using CycleGAN. Our experiments focus on chest scan images within the context of biomedical informatics, specifically targeting the detection of abnormalities. The study reveals that YOLOv7 consistently outperforms DETR across all levels of augmented data, maintaining better performance even with 75% augmented data. Additionally, YOLOv7 demonstrates significantly faster convergence, requiring approximately 30 epochs compared to DETR's 300 epochs. These findings underscore the superiority of YOLOv7 for object detection tasks, especially in scenarios with limited data and when rapid convergence is essential. Our results provide valuable insights for researchers and practitioners in the field of computer vision, highlighting the effectiveness of YOLOv7 and the importance of data augmentation in improving model performance and efficiency.

U-net with vision transformer encoder for polyp segmentation in colonoscopy images (비전 트랜스포머 인코더가 포함된 U-net을 이용한 대장 내시경 이미지의 폴립 분할)

  • Ayana, Gelan;Choe, Se-woon
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2022.10a
    • /
    • pp.97-99
    • /
    • 2022
  • For the early identification and treatment of colorectal cancer, accurate polyp segmentation is crucial. However, polyp segmentation is a challenging task, and the majority of current approaches struggle with two issues. First, the position, size, and shape of each individual polyp varies greatly (intra-class inconsistency). Second, there is a significant degree of similarity between polyps and their surroundings under certain circumstances, such as motion blur and light reflection (inter-class indistinction). U-net, which is composed of convolutional neural networks as encoder and decoder, is considered as a standard for tackling this task. We propose an updated U-net architecture replacing the encoder part with vision transformer network for polyp segmentation. The proposed architecture performed better than the standard U-net architecture for the task of polyp segmentation.

  • PDF

Comparative Analysis of VT-ADL Model Performance Based on Variations in the Loss Function (Loss Function 변화에 따른 VT-ADL 모델 성능 비교 분석)

  • Namjung Kim;Changjoon Park;Junhwi Park;Jaehyun Lee;Jeonghwan Gwak
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2024.01a
    • /
    • pp.41-43
    • /
    • 2024
  • 본 연구에서는 Vision Transformer 기반의 Anomaly Detection and Localization (VT-ADL) 모델에 초점을 맞추고, 손실 함수의 변경이 MVTec 데이터셋에 대한 이상 검출 및 지역화 성능에 미치는 영향을 비교 분석한다. 기존의 손실 함수를 KL Divergence와 Log-Likelihood Loss의 조합인 VAE Loss로 대체하여, 성능 변화를 심층적으로 조사했다. 실험을 통해 VAE Loss로의 전환은 VT-ADL 모델의 이상 검출 능력을 현저히 향상시키며, 특히 PRO-score에서 기존 대비 약 5%의 개선을 보였다는 점을 확인하였다. 이러한 결과는 손실 함수의 최적화가 VT-ADL 모델의 전반적인 성능에 중요한 영향을 미칠 수 있음을 시사한다. 또한, 이 연구는 Vision Transformer 기반 모델의 이상 검출과 지역화 작업에 있어서 손실 함수 선택의 중요성을 강조하며, 향후 관련 연구에 유용한 기준을 제공할 수 있을 것으로 기대된다.

  • PDF

The Detection of Multi-class Vehicles using Swin Transformer (Swin Transformer를 이용한 항공사진에서 다중클래스 차량 검출)

  • Lee, Ki-chun;Jeong, Yu-seok;Lee, Chang-woo
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2021.10a
    • /
    • pp.112-114
    • /
    • 2021
  • In order to detect urban conditions, the number of means of transportation and traffic flow are essential factors to be identified. This paper improved the detection system capabilities shown in previous studies using the SwinTransformer model, which showed higher performance than existing convolutional neural networks, by learning various vehicle types using existing Mask R-CNN and introducing today's widely used transformer model to detect certain types of vehicles in urban aerial images.

  • PDF

A Dual-Structured Self-Attention for improving the Performance of Vision Transformers (비전 트랜스포머 성능향상을 위한 이중 구조 셀프 어텐션)

  • Kwang-Yeob Lee;Hwang-Hee Moon;Tae-Ryong Park
    • Journal of IKEEE
    • /
    • v.27 no.3
    • /
    • pp.251-257
    • /
    • 2023
  • In this paper, we propose a dual-structured self-attention method that improves the lack of regional features of the vision transformer's self-attention. Vision Transformers, which are more computationally efficient than convolutional neural networks in object classification, object segmentation, and video image recognition, lack the ability to extract regional features relatively. To solve this problem, many studies are conducted based on Windows or Shift Windows, but these methods weaken the advantages of self-attention-based transformers by increasing computational complexity using multiple levels of encoders. This paper proposes a dual-structure self-attention using self-attention and neighborhood network to improve locality inductive bias compared to the existing method. The neighborhood network for extracting local context information provides a much simpler computational complexity than the window structure. CIFAR-10 and CIFAR-100 were used to compare the performance of the proposed dual-structure self-attention transformer and the existing transformer, and the experiment showed improvements of 0.63% and 1.57% in Top-1 accuracy, respectively.