• Title/Summary/Keyword: 병렬 연산 처리

Search Result 550, Processing Time 0.029 seconds

A Study on Performance Improvement of Distributed Computing Framework using GPU (GPU를 활용한 분산 컴퓨팅 프레임워크 성능 개선 연구)

  • Song, Ju-young;Kong, Yong-joon;Shim, Tak-kil;Shin, Eui-seob;Seong, Kee-kin
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2012.04a
    • /
    • pp.499-502
    • /
    • 2012
  • 빅 데이터 분석의 시대가 도래하면서 대용량 데이터의 특성과 계산 집약적 연산의 특성을 동시에 가지는 문제 해결에 대한 요구가 늘어나고 있다. 대용량 데이터 처리의 경우 각종 분산 파일 시스템과 분산/병렬 컴퓨팅 기술들이 이미 많이 사용되고 있으며, 계산 집약적 연산 처리의 경우에도 GPGPU 활용 기술의 발달로 보편화되는 추세에 있다. 하지만 대용량 데이터와 계산 집약적 연산 이 두 가지 특성을 모두 가지는 문제를 처리하기 위해서는 많은 제약 사항들을 해결해야 하는데, 본 논문에서는 이에 대한 대안으로 분산 컴퓨팅 프레임워크인 Hadoop MapReduce와 Nvidia의 GPU 병렬 컴퓨팅 아키텍처인 CUDA 흘 연동하는 방안을 제시하고, 이를 밀집행렬(dense matrix) 연산에 적용했을 때 얻을 수 있는 성능 개선 효과에 대해 소개하고자 한다.

Efficient DSP Architecture for Viterbi Algorithm (비터비 알고리즘의 효율적인 연산을 위한 DSP 구조 설계)

  • Park Weon heum;Sunwoo Myung hoon;Oh Seong keun
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.30 no.3A
    • /
    • pp.217-225
    • /
    • 2005
  • This paper presents specialized DSP instructions and their architecture for the Viterbi algorithm used in various wireless communication standards. The proposed architecture can significantly reduce the Trace Back (TB) latency. The proposed instructions perform the Add Compare Select (ACS) and TB operations in parallel and the architecture has special hardware, called the Offset Calculation Unit (OCU), which automatically calculates data addresses for the trellis butterfly computations. Logic synthesis has been Performed using the Samsung SEC 0.18 μm standard cell library. OCU consists of 1,460 gates and the maximum delay of OCU is about 5.75 ns. The BER performance of the ACS-TB parallel method increases about 0.00022dB at 6dB Eb/No compared with the typical TB method, which is negligible. When the constraint length K is 5, the proposed DSP architecture can reduce the decoding cycles about 17% compared with the Carmel DSP and about 45% compared with 7MS320c15x.

A Design of High Performance Operation Intra Predictor for H.264/AVC Decoder (H.264/AVC 복호기를 위한 고성능 연산처리 인트라 예측기 설계)

  • Jin, Xianzhe;Ryoo, Kwangki
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.16 no.11
    • /
    • pp.2503-2510
    • /
    • 2012
  • This paper proposes a parallel operation intra predictor for H.264/AVC decoder. In previous intra predictor design, common operation units were designed for 17 prediction modes in order to compute more effectively. However, it was designed by analyzing the equation applied to one pixel. So, there are four operation units for computing 16 pixels in a $4{\times}4$ block and they need four cycles. In this paper, the proposed intra predictor contains T3(Three Type Transform) operation unit for parallel operation. It divides 17 modes into 3 types to calculate 16 pixels of a $4{\times}4$ block in only one cycle and needs 16 cycles minimum in 16x16 block. As the result of the experiment, in terms of processing cycle, the performance of proposed intra predictor is 58.95% higher than the previous one.

An Efficient Technique for Processing of Spatial Data Using GPU (GPU를 사용한 효율적인 공간 데이터 처리)

  • Lee, Jae-Il;Oh, Byoung-Woo
    • Spatial Information Research
    • /
    • v.17 no.3
    • /
    • pp.371-379
    • /
    • 2009
  • Recently, GPU (Graphics Processing Unit) has been improved rapidly on the need of speed for gaming. As a result, GPU contains multiple ALU (Arithmetic Logic Unit) for parallel processing of a lot of graphics data, such as transform, ray tracing, etc. Therefore, this paper proposed a technique for parallel processing of spatial data using GPU. Spatial data consists of multiple coordinates, and each coordinate contains value of x and y axis. To display spatial data graphics operations have to be processed to large amount of coordinates. Because the graphics operation is identical and coordinates are multiple data, SIMD (Single Instruction Multiple Data) parallel processing of GPU can be used for processing of spatial data to improve performance. This paper implemented SIMD parallel processing of spatial data using two kinds of SDK (Software Development Kit). CUDA and ATI Stream are used for NVIDIA and ATI GPU respectively. Experiments that measure time of calculation for graphics operations are carried out to observe enhancement of performance. Experimental result is reported that proposed method can enhance performance up to 1,162% for graphics operations. The proposed method that uses parallel processing with GPU for spatial data can be generally used to enhance performance for applications which deal with large amount of spatial data.

  • PDF

공개키 암호 체계와 Shor 알고리듬

  • 이순칠
    • Review of KIISC
    • /
    • v.14 no.3
    • /
    • pp.1-7
    • /
    • 2004
  • 양자알고리듬들 중 쇼의 알고리듬은 공개키 암호체계의 근간을 이루는 소인수분해를 고전알고리듬보다 훨씬 빨리 처리할 수 있다. 고전컴퓨터로 N자리 수를 소인수분해 하는데 걸리는 시간은 exp$[(InN)^{1/3}(In In N)^{2/3})]$에 비례하지만 쇼의 양자풀이법을 사용하면 약$(InN)^3$ 보다 적은 시간이 걸린다. 이 알고리듬의 핵심은 양자계의 중첩이라는 성질을 이용해서 푸리에 변환을 모든 데이터에 대해 병렬적으로 동시에 처리함으로서 주기를 빠르게 찾는다는 것이다. 이러한 양자전산의 이점은 모든 연산이 중첩된 상태에 독립적으로 작용한다는 자연계의 선형성에서 비롯된다. 고전컴퓨터에서도 병렬처리를 하지만 양자적 병렬처리를 고전컴퓨터의 병렬처리로 대신할 수는 없다. N비트로 나타내지는$2^N$ 개의 숫자에 대해 동시에 병렬처리 하는데 양자컴퓨터는 한대면 되지만 고전컴퓨터는 $2^N$대가 필요하므로 비트수가 증가하면 필요한 고전컴퓨터의 수가 비현실적으로 증가하기 때문이다. 이 알고리듬의 수행으로 얻어지는 결과는 확정적인 것이 아니며 확률적으로 율은 당을 얻는다. 어떤 수가 약수가 되는지 아닌지는 금방 확인해 볼 수 있으므로 서너 번 이와 같은 시행착오 과정을 거쳐 옳은 답을 얻는다 해도 문제가 되지는 않는다.

Design of Parallel Decimal Floating-Point Arithmetic Unit for High-speed Operations (고속 연산을 위한 병렬 구조의 십진 부동소수점 연산 장치 설계)

  • Yun, Hyoung-Kie;Moon, Dai-Tchul
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.17 no.12
    • /
    • pp.2921-2926
    • /
    • 2013
  • In this paper, a decimal floating-point arithmetic unit(DFP) was proposed and redesigned to support high speed arithmetic operation employed parallel processing technique. The basic architecture of the proposed DFP was based on the L.K.Wang's DFP and improved it enabling high speed operation by parallel processing for two operands with same size of exponent. The proposed DFP was synthesized as a target device of xc2vp30-7ff896 using Xilinx ISE and verified by simulation using Flowrian tool of System Centroid co. Compared to L.K.Wang's DFP and reference [6]'s method, the proposed DFP improved data processing speed about 8.4% and 3% respectively in case of same input data.

Warp-based Emotion-adaptive Real-Time Transforming Technique of Character's Facial Expression (워핑 기반의 감정 적응형 실시간 캐릭터 표정변환 기법)

  • Bae, Dong-Hee;Kim, Jin-Mo;Yun, Do-Kyung;Cho, Hyung-Je
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2011.06a
    • /
    • pp.434-437
    • /
    • 2011
  • 최근 단일 프로세서의 성능 개선이 한계에 이르고, 이에 따라 데이터 병렬 처리를 통한 시스템 성능 개선에 관한 연구가 활발히 진행되고 있다. 또한 이러한 변화로 인해 영상처리 분야에서도 대규모 연산의 병렬 컴퓨팅 수행에 관한 연구가 꾸준히 진행되고 있으며 하드웨어 또한 발전하여 실시간 시스템에 영상처리 분야가 많이 활용되고 있다. 본 논문에서는 캐릭터의 감정 상태에 따른 표정을 영상처리 분야에서 많이 사용되고 있는 이미지 워핑 기법을 적용하여 변화시킨다. 인간이 표현할 수 있는 기본적인 감정에 따른 표정을 데이터베이스로 정리하여 캐릭터에게 임의의 감정값이 주어지면 그에 맞는 표정을 데이터베이스에서 선택하여 사용자가 설정한 프레임만큼 워핑을 수행한다. 하지만 매 프레임에 대해 정해져 있는 제어선에 따라 움직이는 픽셀들의 워핑 연산은 그 계산량이 너무 많아 실시간으로 처리하기에 여러 가지 제약이 뒤따른다. 따라서 이를 실시간으로 처리하기 위해 NVIDIA의 CUDA를 활용한 데이터 병렬처리를 수행하여 실시간 처리가 가능하게 하는 방법을 제안하고, 실험을 통해 그 유용성을 제시한다.

An Advanced Parallel Join Algorithm for Managing Data Skew on Hypercube Systems (하이퍼큐브 시스템에서 데이타 비대칭성을 고려한 향상된 병렬 결합 알고리즘)

  • 원영선;홍만표
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.30 no.3_4
    • /
    • pp.117-129
    • /
    • 2003
  • In this paper, we propose advanced parallel join algorithm to efficiently process join operation on hypercube systems. This algorithm uses a broadcasting method in processing relation R which is compatible with hypercube structure. Hence, we can present optimized parallel join algorithm for that hypercube structure. The proposed algorithm has a complete solution of two essential problems - load balancing problem and data skew problem - in parallelization of join operation. In order to solve these problems, we made good use of the characteristics of clustering effect in the algorithm. As a result of this, performance is improved on the whole system than existing algorithms. Moreover. new algorithm has an advantage that can implement non-equijoin operation easily which is difficult to be implemented in hash based algorithm. Finally, according to the cost model analysis. this algorithm showed better performance than existing parallel join algorithms.

Uniform Load Distribution Using Sampling-Based Cost Estimation in Parallel Join (병렬 조인에서 샘플링 기반 비용 예측 기법을 이용한 균등 부하 분산)

  • Park, Ung-Gyu
    • The Transactions of the Korea Information Processing Society
    • /
    • v.6 no.6
    • /
    • pp.1468-1480
    • /
    • 1999
  • In database systems, join operations are the most complex and time consuming ones which limit performance of such system. Many parallel join algorithms have been proposed for the systems. However, they did not consider data skew, such as attribute value skew (AVS) and join product skew (JPS). In the skewness environments, performance of framework for a uniform load distribution and an efficient parallel join algorithm using the framework to handle AVS and JPS. In our algorithm, we estimate data distributions of input and output relations of join operations using the sampling methodology and evaluate join cost for the estimated data distributions. Finally, using the histogram equalization method we distribute data among nodes to achieve good load balancing among nodes in the local joining phase. For performance comparison, we present simulation model of our algorithm and other join algorithms and present the result of some simulation experiments. The results indicate that our algorithm outperforms other algorithms in the skewed case.

  • PDF

AS B-tree: A study on the enhancement of the insertion performance of B-tree on SSD (AS B-트리: SSD를 사용한 B-트리에서 삽입 성능 향상에 관한 연구)

  • Kim, Sung-Ho;Roh, Hong-Chan;Lee, Dae-Wook;Park, Sang-Hyun
    • The KIPS Transactions:PartD
    • /
    • v.18D no.3
    • /
    • pp.157-168
    • /
    • 2011
  • Recently flash memory has been being utilized as a main storage device in mobile devices, and flashSSDs are getting popularity as a major storage device in laptop and desktop computers, and even in enterprise-level server machines. Unlike HDDs, on flash memory, the overwrite operation is not able to be performed unless it is preceded by the erase operation to the same block. To address this, FTL(Flash memory Translation Layer) is employed on flash memory. Even though the modified data block is overwritten to the same logical address, FTL writes the updated data block to the different physical address from the previous one, mapping the logical address to the new physical address. This enables flash memory to avoid the high block-erase cost. A flashSSD has an array of NAND flash memory packages so it can access one or more flash memory packages in parallel at once. To take advantage of the internal parallelism of flashSSDs, it is beneficial for DBMSs to request I/O operations on sequential logical addresses. However, the B-tree structure, which is a representative index scheme of current relational DBMSs, produces excessive I/O operations in random order when its node structures are updated. Therefore, the original b-tree is not favorable to SSD. In this paper, we propose AS(Always Sequential) B-tree that writes the updated node contiguously to the previously written node in the logical address for every update operation. In the experiments, AS B-tree enhanced 21% of B-tree's insertion performance.