• 제목/요약/키워드: SIMD Processor

검색결과 65건 처리시간 0.019초

Improved Disparity Map Computation on Stereoscopic Streaming Video with Multi-core Parallel Implementation

  • Kim, Cheong Ghil;Choi, Yong Soo
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제9권2호
    • /
    • pp.728-741
    • /
    • 2015
  • Stereo vision has become an important technical issue in the field of 3D imaging, machine vision, robotics, image analysis, and so on. The depth map extraction from stereo video is a key technology of stereoscopic 3D video requiring stereo correspondence algorithms. This is the matching process of the similarity measure for each disparity value, followed by an aggregation and optimization step. Since it requires a lot of computational power, there are significant speed-performance advantages when exploiting parallel processing available on processors. In this situation, multi-core CPU may allow many parallel programming technologies to be realized in users computing devices. This paper proposes parallel implementations for calculating disparity map using a shared memory programming and exploiting the streaming SIMD extension technology. By doing so, we can take advantage both of the hardware and software features of multi-core processor. For the performance evaluation, we implemented a parallel SAD algorithm with OpenMP and SSE2. Their processing speeds are compared with non parallel version on stereoscopic streaming video. The experimental results show that both technologies have a significant effect on the performance and achieve great improvements on processing speed.

선형 어레이 SliM-II 이미지 프로세서 칩 (A linear array SliM-II image processor chip)

  • 장현만;선우명훈
    • 전자공학회논문지C
    • /
    • 제35C권2호
    • /
    • pp.29-35
    • /
    • 1998
  • This paper describes architectures and design of a SIMD type parallel image processing chip called SliM-II. The chiphas a linear array of 64 processing elements (PEs), operates at 30 MHz in the worst case simulation and gives at least 1.92 GIPS. In contrast to existing array processors, such as IMAP, MGAP-2, VIP, etc., each PE has a multiplier that is quite effective for convolution, template matching, etc. The instruction set can execute an ALU operation, data I/O, and inter-PE communication simulataneously in a single instruction cycle. In addition, during the ALU/multiplier operation, SliM-II provides parallel move between the register file and on-chip memory as in DSP chips, SliM-II can greatly reduce the inter-PE communication overhead, due to the idea a sliding, which is a technique of overlapping inter-PE communication with computation. Moreover, the bandwidth of data I/O and inter-PE communication increases due to bit-parallel data paths. We used the COMPASS$^{TM}$ 3.3 V 0.6.$\mu$m standrd cell library (v8r4.10). The total number of transistors is about 1.5 muillions, the core size is 13.2 * 13.0 mm$^{2}$ and the package type is 208 pin PQ2 (Power Quad 2). The performance evaluation shows that, compared to a existing array processors, a proposed architeture gives a significant improvement for algorithms requiring multiplications.s.

  • PDF

고성능 내장형 마이크로프로세서를 위한 분기예측기의 설계 및 성능평가 (Branch Predictor Design and Its Performance Evaluation for A High Performance Embedded Microprocessor)

  • 이상혁;김일관;최린
    • 대한전자공학회:학술대회논문집
    • /
    • 대한전자공학회 2002년도 하계종합학술대회 논문집(2)
    • /
    • pp.129-132
    • /
    • 2002
  • AE64000 is the 64-bit high-performance microprocessor that ADC Co. Ltd. is developing for an embedded environment. It has a 5-stage pipeline and uses Havard architecture with a separated instruction and data caches. It also provides SIMD-like DSP and FP operation by enabling the 8/16/32/64-bit MAC operation on 64-bit registers. AE64000 processor implements the EISC ISA and uses the instruction folding mechanism (Instruction Folding Unit) that effectively deals with LERI instruction in EISC ISA. But this unit makes branch prediction behavior difficult. In this paper, we designs a branch predictor optimized for AE64000 Pipeline and develops a AES4000 simulator that has cycle-level precision to validate the performance of the designed branch predictor. We makes TAC(Target address cache) and BPT(branch prediction table) seperated for effective branch prediction and uses the BPT(removed indexed) that has no address tags.

  • PDF

래스터화 알고리즘을 위한 최적의 매니코어 프로세서 구조 탐색 (Architecture Exploration of Optimal Many-Core Processors for a Vector-based Rasterization Algorithm)

  • 손동구;김철홍;김종면
    • 대한임베디드공학회논문지
    • /
    • 제9권1호
    • /
    • pp.17-24
    • /
    • 2014
  • In this paper, we implement and evaluate the performance of a vector-based rasterization algorithm for 3D graphics by using a SIMD (single instruction multiple data) many-core processor architecture. In addition, we evaluate the impact of a data-per-processing elements (DPE) ratio that is defined as the amount of data directly mapped to each processing element (PE) within many-core in terms of performance, energy efficiency, and area efficiency. For the experiment, we utilize seven different PE configurations by varying the DPE ratio (or the number PEs), which are implemented in the same 130 nm CMOS technology with a 500 MHz clock frequency. Experimental results indicate that the optimal PE configuration is achieved as the DPE ratio is in the range from 16,384 to 256 (or the number of PEs is in the range from 16 and 1,024), which meets the requirements of mobile devices in terms of the optimal performance and efficiency.

다수 혹은 긴 워드 연산을 위한 레지스터 파일 확장을 통한 대칭 및 비대칭 암호화 알고리즘의 가속화 (Accelerating Symmetric and Asymmetric Cryptographic Algorithms with Register File Extension for Multi-words or Long-word Operation)

  • 이상훈;최린
    • 전자공학회논문지CI
    • /
    • 제43권2호
    • /
    • pp.1-11
    • /
    • 2006
  • 본 연구에서는 대칭 및 비대칭 암호화 알고리즘을 가속화하기 위해, 다수 혹은 긴 워드 연산을 위한 레지스터 파일 확장 구조 (Register File Extension for Multi-words or Long-word Operation: RFEMLO)라는 새로운 레지스터 파일 구조를 제안한다. 암호화 알고리즘은 긴 워드 피연산자에 대한 명령어를 통하여 가속화 할 수 있다는 점에 착안하여, RFEMLO는 하나의 레지스터 명을 통해 여러 개의 레지스터에 접근할 수 있도록 하여 여러 연산자에 대해 동일한 연산을 수행할 수 있도록 하거나, 여러 개의 레지스터를 하나의 데이터로 사용할 수 있게 한다. RFEMLO는 긴 워드 피연산자에 대한 명령어 집합의 추가와 이를 지원하는 기능 유닛을 추가함으로서 범용 프로세서에 적용할 수 있다. 제안된 하드웨어 구조와 명령어 집합의 효율성을 평가하기 위해 Simplescalar/ARM 3.0을 사용하여 대칭 및 비대칭의 다양한 암호화 알고리즘에 적용하였다. 실험 결과, RFEMLO을 적용한 순차적 파이프라인을 가진 프로세서에서 대칭 암호화 알고리즘의 경우 $40%{\sim}160%$의 성능 향상을, 비대칭 암호화 알고리즘의 경우 $150%{\sim}230%$의 높은 성능향상을 얻을 수 있었다. RFEMLO의 적용을 통한 성능 항상은 이슈 폭의 증가를 이용한 슈퍼스칼라 구현에 따른 성능 향상과 비교할 때, 훨씬 적은 하드웨어 비용으로 효과적인 성능 향상을 얻을 수 있음을 확인하였으며 슈퍼스칼라 프로세서에 RFEMLO를 적용하는 경우에도 대칭 암호화 알고리즘에서는 최대 83.6%, 비대칭 암호화 알고리즘에서는 최대 138.6%의 추가적인 성능향상을 얻을 수 있었다.