Search | Korea Science

An Efficient 4$\times$4 Integer Transform Algorithm on SIMD (SIMD 기반의 효율적인 4$\times$4 정수변환 방법)

유상준;오승준;안창범
- Proceedings of the Korean Information Science Society Conference
- /
- 2004.10a
- /
- pp.55-57
- /
- 2004
DCT(Discrete Cosine Transform)는 현존하는 블록기반 영상 압축 코딩기법의 핵심이 되는 부분이다. 많은 고속 방법이 제안되었으며, 최근 들어 SIMD 병렬구조를 이용한 고속방법들이 제안되고 있다. 본 논문에서는 SIMD명령어를 가지는 프로세서에서 4$\times$4 정수변환의 속도를 최적화하기 위한 알고리즘을 제안한다. 본 논문에서 제안하는 알고리즘은 128비트 SIMD영령어로 확장이 가능하며 비슷한 구조를 가지는 Hadamard 변환에서 적용할 수 있다. 제안하는 방법을 펜티엄4 2.4G에서 구현할 경우 H.264 참조 부호화기의 4$\times$4 정수변환 방법보다 64비트 SIMD 명령어를 사용할 경우 4.34배 128-bit SIMD 명령어를 사용할 경우 6.77배의 성능을 얻을 수 있다.
PDF

Design of a scalable general-purpose parallel associative processor using content-addressable memory (Content-Addressable Memory를 이용한 확장 가능한 범용 병렬 Associative Processor 설계)

Park, Tae-Geun
- Journal of the Institute of Electronics Engineers of Korea SD
- /
- v.43 no.2 s.344
- /
- pp.51-59
- /
- 2006
Von Neumann architecture suffers from the interface between the central processing unit and the memory, which is called 'Von Neumann bottleneck' In this paper, we propose a scalable general-purpose associative processor (AP) based on content-addressable memory (CAM) which solves this problem and is suitable for the search-oriented applications. We propose an efficient instruction set and a structural scalability to extend for larger applications. We define twelve instructions and provide some reduced instructions to speed up which execute two instructions in a single instruction cycle. The proposed AP performs in a bit-serial, word-parallel fashion and can be considered as a 32-bit general-purpose parallel processor with a massively parallel SIMD structure. We design and simulate a maximum/minumum search greater-than/less-than search, and parallel addition to verify the proposed architecture. The algorithms are executed in a constant time O(k) regardless of the number of input data.
PDF KSCI

Design of a RISC Processor with an Efficient Processing Unit for Multimedia Data (효율적인 멀티미디어데이터 처리를 위한 RISC Processor의 설계)

조태헌;남기훈;김명환;이광엽
- Proceedings of the IEEK Conference
- /
- 2003.07b
- /
- pp.867-870
- /
- 2003
본 논문은 멀티미디어 데이터 처리를 위한 효율적인 RISC 프로세서 유닛의 설계를 목표로 Vector 프로세서의 SIMD(Single Instruction Multiple Data) 개념을 바탕으로 고정된 연산기 데이터 비트 수에 비해 상대적으로 작은 비트수의 데이터 연산의 부분 병렬화를 통하여 멀티미디어 데이터 연산의 기본이 되는 곱셈누적(MAC : Multiply and Accumulate) 연산의 성능을 향상 시킨다. 또한 기존의 MMX나 VIS 등과 같은 범용 프로세서들의 부분 병렬화를 위해 전 처리 과정의 필요충분조건인 데이터의 연속성을 위해 서로 다른 길이의 데이터 흑은 비트 수가 작은 멀티미디어의 데이터를 하나의 데이터로 재처리 하는 재정렬 혹은 Packing/Unpacking 과정이 성능 전체적인 성능 저하에 작용하게 되므로 본 논문에서는 기존의 프로세서의 연산기 구조를 재이용하여 병렬 곱셈을 위한 연산기 구조를 구현하고 이를 위한 데이터 정렬 연산 구조를 제안한다.
PDF

On the Parallel Merging Algorithm for Linear Lists (선형 리스트상의 병렬 병합 알고리즘)

민용식
- The Journal of the Acoustical Society of Korea
- /
- v.10 no.5
- /
- pp.20-26
- /
- 1991
본 논문은 병렬 병합 알고리즘 중에서 선형 리스트상에서의 병합 시키기 위한 알고리즘을 제시 하고자 하는데 있어서 다음 두가지에 그 주된 목적이 있다. 1) 병렬 병합 알고리즘을 위한 ADT 를 제시하고 있다. 2) 선형 리스트상에서 병합되어지는 MKLP 알고리즘을 제시코저 한다. 이때 mklp 방 법은 P(1<=P<=N)개의 프로세서를 이용하고 그리고 SIMD-SM(EREW-PRAM) 모델 상에 구현이 되는 병렬 병합 알고리즘을 구현코저 한다.
PDF

An Efficient Technique for Processing of Spatial Data Using GPU (GPU를 사용한 효율적인 공간 데이터 처리)

Lee, Jae-Il;Oh, Byoung-Woo
- Spatial Information Research
- /
- v.17 no.3
- /
- pp.371-379
- /
- 2009
Recently, GPU (Graphics Processing Unit) has been improved rapidly on the need of speed for gaming. As a result, GPU contains multiple ALU (Arithmetic Logic Unit) for parallel processing of a lot of graphics data, such as transform, ray tracing, etc. Therefore, this paper proposed a technique for parallel processing of spatial data using GPU. Spatial data consists of multiple coordinates, and each coordinate contains value of x and y axis. To display spatial data graphics operations have to be processed to large amount of coordinates. Because the graphics operation is identical and coordinates are multiple data, SIMD (Single Instruction Multiple Data) parallel processing of GPU can be used for processing of spatial data to improve performance. This paper implemented SIMD parallel processing of spatial data using two kinds of SDK (Software Development Kit). CUDA and ATI Stream are used for NVIDIA and ATI GPU respectively. Experiments that measure time of calculation for graphics operations are carried out to observe enhancement of performance. Experimental result is reported that proposed method can enhance performance up to 1,162% for graphics operations. The proposed method that uses parallel processing with GPU for spatial data can be generally used to enhance performance for applications which deal with large amount of spatial data.
PDF

H.264/AVC Decoder Parallelization Methods for Real-time Full-HD Image Processing (Full-HD 영상의 실시간 처리를 위한 H.264/AVC 디코더 병렬화 기법)

Yoo, Hosun;Kim, Ilseung;Kim, Taeho;Jeon, Jeehyun;Jeong, Jechang
- Proceedings of the Korean Society of Broadcast Engineers Conference
- /
- 2012.07a
- /
- pp.453-456
- /
- 2012
최근 멀티코어 프로세서의 사용이 증가함에 따라 영상처리나 대용량 처리가 필요한 기술과 같은 다양한 분야에 OpenMP, SIMD 등과 같은 다양한 병렬화 기법들이 적용되고 있다. 특히, 영상처리 분야에서 Full-HD, UHD, 3D TV 등과 같이 높은 복잡도를 갖는 컨텐츠들의 수요가 높아짐에 따라 기존의 싱글코어 기반의 코덱에 병렬화를 적용하는 여러가지 기법들이 제안되어왔다. 본 논문은 기존의 OpenMP와 SIMD와 같은 병렬처리 기법을 H.264/AVC 코덱의 참조 소프트웨어 JM 18.2의 디코더에 적용함으로써 Full-HD영상을 실시간으로 디코딩하는 기법을 제안한다. 실험결과는 평균 38.338 fps의 프레임 율을 보이며 병렬처리시 평균 2배 이상 프레임 율이 증가함으로써 Full-HD 영상의 실시간 처리가 가능하다는 것을 보여준다.
PDF

Parallel Computation of Elliptic Partial Differential Equation on MP-2 (MP-2에서의 타원형 편미분 방정식 병렬계산)

Kim, Hyoung-Joong;Lee, Yong-Ho
- Journal of Industrial Technology
- /
- v.14
- /
- pp.19-28
- /
- 1994
We can get a tridiagonal block Toeplitz linear system by the finite difference approximation of 2-D Poisson equation. To exploit the nice property of this linear equation, we transform the equation into a Lyapunov equation and apply DST (discrete sine transform) to get diagonal matrix based Lyapunov equation. DST can be performed using FFT, which enables high-speed computaion. All the computations are performed on an SIMD parallel computer, the MasPar MP-2 with 4,096 processing elements. In this paper, parallel algorithm, mapping method of the algorithm onto the MP-2, and timing results are presented.
PDF

Implementation of Parallel Volume Rendering Using the Sequential Shear-Warp Algorithm (순차 Shear-Warp 알고리즘을 이용한 병렬볼륨렌더링의 구현)

Kim, Eung-Kon
- The Transactions of the Korea Information Processing Society
- /
- v.5 no.6
- /
- pp.1620-1632
- /
- 1998
This paper presents a fast parallel algorithm for volume rendering and its implementation using C language and MPI MasPar Programming Language) on the 4,096 processor MasPar MP-2 machine. This parallel algorithm is a parallelization hased on the Lacroute' s sequential shear - warp algorithm currently acknowledged to be the fastest sequential volume rendering algorithm. This algorithm reduces communication overheads by using the sheared space partition scheme and the load balancing technique using load estimates from the previous iteration, and the number of voxels to be processed by using the run-length encoded volume data structure.Actual performance is 3 to 4 frames/second on the human hrain scan dataset of $128\times128\times128$ voxels. Because of the scalability of this algorithm, performance of ]2-16 frames/sc.'cond is expected on the 16,384 processor MasPar MP-2 machine. It is expected that implementation on more current SIMD or MIMD architectures would provide 3O~60 frames/second on large volumes.
PDF

Architecture of a scalable general-purpose associative processor and its applications (확장 가능한 범용 Associative Processor 구조 및 응용)

Yun, Jae-Bok;Kim, Ju-Young;Kim, Jin-Wook;Park, Tae-Geun
- Proceedings of the IEEK Conference
- /
- 2005.11a
- /
- pp.1141-1144
- /
- 2005
일반 컴퓨터에서 중앙처리장치와 메모리 사이의 병목 현상인 "Von Neumann Bottleneck"을 보이는데 본 논문에서는 이러한 문제점을 해소하고 검색위주의 응용분야에서 우수한 성능을 보이는 확장 가능한 범용 Associative Processor(AP) 구조를 제안하였다. 본 연구에서는 Associative computing을 효율적으로 수행할 수 있는 명령어 세트를 제안하였으며 다양하고 대용량 응용분야에도 적용할 수 있도록 구조를 확장 가능하게 설계함으로써 유연한 구조를 갖는다. 12 가지의 명령어가 정의되었으며 프로그램이 효율적으로 수행될 수 있도록 명령어 셋을 구성하고 연속된 명령어를 하나의 명령어로 구현함으로써 처리시간을 단축하였다. 제안된 프로세서는 bit-serial, word-parallel로 동작하며 대용량 병렬 SIMD 구조를 갖는 32 비트 범용 병렬 프로세서로 동작한다. 포괄적인 검증을 위하여 명령어 단위의 검증 뿐 아니라 최대/최소 검색, 이상/이하 검색, 병렬 덧셈 등의 기본적인 병렬 알고리즘을 검증하였으며 알고리즘은 처리 데이터의 개수와는 무관한 상수의 복잡도 O(k)를 갖으며 데이터의 비트 수만큼의 이터레이션을 갖는다.
PDF

Implementation of an Optimal Many-core Processor for Beamforming Algorithm of Mobile Ultrasound Image Signals (모바일 초음파 영상신호의 빔포밍 기법을 위한 최적의 매니코어 프로세서 구현)

Choi, Byong-Kook;Kim, Jong-Myon
- Journal of the Korea Society of Computer and Information
- /
- v.16 no.8
- /
- pp.119-128
- /
- 2011
This paper introduces design space exploration of many-core processors that meet high performance and low power required by the beamforming algorithm of image signals of mobile ultrasound. For the design space exploration of the many-core processor, we mapped different number of ultrasound image data to each processing element of many-core, and then determined an optimal many-core processor architecture in terms of execution time, energy efficiency and area efficiency. Experimental results indicate that PE=4096 and 1024 provide the highest energy efficiency and area efficiency, respectively. In addition, PE=4096 achieves 46x and 10x better than TI DSP C6416, which is widely used for ultrasound image devices, in terms of energy efficiency and area efficiency, respectively.
https://doi.org/10.9708/jksci.2011.16.8.119 인용 PDF KSCI

Search Result 42, Processing Time 0.029 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)