통합 검색 | Korea Science

SIMT구조 GP-GPU의 명령어 처리 성능 향상을 위한 Dispatch Unit과 Operand Selection Unit설계 (Design of a Dispatch Unit & Operand Selection Unit for Improving the SIMT Based GP-GPU Instruction Performance)

곽재창
- 전기전자학회논문지
- /
- 제19권3호
- /
- pp.455-459
- /
- 2015
본 논문은 그래픽 처리 뿐 만 아니라 범용 연산의 가속화를 지원하기 위한 SIMT 구조 GP-GPU의 Dispatch Unit과 Operand Selection Unit을 제안한다. Warp Scheduler로부터 발행된 명령어에서 사용되는 Operand의 모든 정보를 Decoding 하면 불필요한 Operand Load가 발생하여 레지스터 부하가 발생 한다. 이러한 문제점을 해결하기 위해 Pre-decoding방법을 사용하여 Operand의 정보만을 먼저 Decoding 하여 Operand Load를 줄이고, 레지스터의 부하를 줄일 수 있는 방법을 제안한다. 제안하는 Dispatch Unit에서 나온 Operand 정보들을 레지스터 뱅크 충돌을 방지하는 방법을 적용한 Operand Selection Unit에 전달해 전체적인 처리 성능을 향상 시켰다. Modelsim 10.0b를 이용하여 Warp Scheduler로부터 발행된 10,000개의 임의의 명령어를 처리하여 소요되는 총 Clock Cycle을 측정하였다. 본 논문에서 제안한 Pre-Decoding 기능을 탑재한 Dispatch Unit과 Operand Selection Unit을 적용하여 기존의 방법들 보다 각각 약 11%, 24%의 처리 효율이 증가한 것을 확인 할 수 있었다.
https://doi.org/10.7471/ikeee.2015.19.3.455 인용 PDF KSCI

Consecutive Operand-Caching Method for Multiprecision Multiplication, Revisited

Seo, Hwajeong;Kim, Howon
- Journal of information and communication convergence engineering
- /
- 제13권1호
- /
- pp.27-35
- /
- 2015
Multiprecision multiplication is the most expensive operation in public key-based cryptography. Therefore, many multiplication methods have been studied intensively for several decades. In Workshop on Cryptographic Hardware and Embedded Systems 2011 (CHES2011), a novel multiplication method called 'operand caching' was proposed. This method reduces the number of required load instructions by caching the operands. However, it does not provide full operand caching when changing the row of partial products. To overcome this problem, a novel method, that is, 'consecutive operand caching' was proposed in Workshop on Information Security Applications 2012 (WISA2012). It divides a multiplication structure into partial products and reconstructs them to share common operands between previous and next partial products. However, there is still room for improvement; therefore, we propose a finely designed operand-caching mode to minimize useless memory accesses when the first row is changed. Finally, we reduce the number of memory access instructions and boost the speed of the overall multiprecision multiplication for public key cryptography.
https://doi.org/10.6109/jicce.2015.13.1.027 인용 PDF KSCI KPUBS HTML

오퍼랜드 참조 예측 캐쉬(ORPC)를 활용한 오퍼랜드 페치의 성능 개선 (Performance Improvement of Operand Fetching with the Operand Reference Prediction Cache(ORPC))

김흥준;조경산
- 한국정보처리학회논문지
- /
- 제5권6호
- /
- pp.1652-1659
- /
- 1998
본 논문에서는 오퍼랜드 참조 지연과 자료 캐쉬에 대한 대역폭 요구를 줄이기 위하여, 명령어 페치 단계에서 오퍼랜드의 값과 주소 변환 정보를 예측하고 초기에 예측의 정확성을 검증하여 예측 실패에 의한 성능 손실을 최소화할 수 있는 오퍼랜드 차조 예측 캐쉬(ORPC) 구조를 제안하였다. 제안된 ORPC의 세 가지 운영 구조 (ORPC1, ORPC2, ORPC3)에 의한 예측의 정확도와 성능 개선은 6개의 벤치마크 프로그램의 trace-driven 시뮬레이션을 통해 분석되었다. 512항목의 ORPC2, ORPC3은 평균적으로 오퍼랜드 적재 참조의 45.3%에 대해 정확한 오퍼랜드를 예측하여 오퍼랜드 적재 시간 및 자료캐쉬의 대역폭 요구를 감소시키며, 또한 ORPC3은 전체 오퍼랜드 참조에 대해 98.1%의 주소 변환 정보를 제공하여 자료 TLB의 기능을 대신한다.
PDF

Multi-Operand Radix-2 Signed-Digit Adder using Current Mode MOSEET Circuits

Sakamoto, Masahiro;Hamano, Daisuke;Higuchi, Yuuichi;Kiriya, Takechika;Morisue, Mititada
- 대한전자공학회:학술대회논문집
- /
- 대한전자공학회 2000년도 ITC-CSCC -1
- /
- pp.167-170
- /
- 2000
This paper describes a novel multi-operand radix-2 signed-digit(SD) adder. The novel multi-operand addition algorithm can eliminate carry propagation chain by dividing the input operands into even place part and odd place part, and adding them each. The multi-operand adder with this algorithm can add six operands in parallel, and is faster than the ordinary method of SD adder binary tree. A hardware model for proposed adder is shown which is implemented by the current-mode MOSFET circuit technology. Simulations have been made by SPICE in order to verify the function of the proposed circuit.
PDF

혼합 예측기를 사용하는 효율적인 적재 명령어의 오퍼랜드 참조 기법 (An Improved Load Operand Referencing Scheme Using A Hybrid Predictor)

최승교;조경산
- 한국정보처리학회논문지
- /
- 제7권7호
- /
- pp.2196-2203
- /
- 2000
As processor's operational frequency increases and processors execute multiple instructions per cycle, the processor performance becomes more dependent on the load operand referencing latency and the data dependency. To reduce the operand fetch latency and to increase ILP by breaking the data dependency, we propose a value-address hybrid predictor using a reasonable size prediction buffer and analyse the performance improvement by the proposed predictor. Through the extensive simulation of 5 benchmark programs, the proposed hybrid prediction scheme accurately predicts 62.72% of all loads which are 12.64% higher than the value prediction scheme and show its cost-effectiveness compared to the address predition scheme. In addition, we analyse the performance improvement achieved by the stride management and the history of previous predictions.
PDF

이진화상 잡음제거 연산자에 관한 연구 (Implementation of the noise eliminating operators of binary image)

홍희경;조동섭
- 대한전기학회:학술대회논문집
- /
- 대한전기학회 1988년도 전기.전자공학 학술대회 논문집
- /
- pp.636-639
- /
- 1988
This paper suggests the operation performing the noise elimination of binary image. The image is read by the scanner. And operand is selected according to the size of input image. Through the Dilation and Erosion, elementary vector operation with selected operand, the noise of input image is eliminated.
PDF

다중 피연산자 십진 CSA와 개선된 십진 CLA를 이용한 부분곱 누산기 설계 (Design of Partial Product Accumulator using Multi-Operand Decimal CSA and Improved Decimal CLA)

이양;박태신;김강희;최상방
- 전자공학회논문지
- /
- 제53권11호
- /
- pp.56-65
- /
- 2016
본 논문에선 병렬 십진 곱셈기의 축약 단계의 면적과 지연시간을 감소시켜 성능을 향상시키기 위해 다중 피연산자 십진 CSA과 개선된 십진 CLA를 이용한 트리 구조를 제안한다. 제안한 부분곱 축약 트리는 십진수 부분곱에 대해 다중 피연산자 십진 CSA를 사용하여 빠르게 부분곱을 축약한다. 각 CSA에서는 리코딩에 입력의 범위를 제한함으로써 가장 간단한 리코더 로직을 얻는다. 그리고 각 CSA는 특정한 아키텍처 트리의 특정한 위치에서 범위가 제한된 십진수를 더하기 때문에 부분곱 축약 단계의 연산을 효율적으로 수행할 수 있다. 또한, 사용되는 십진 CLA의 로직을 개선하여 BCD 결과를 빠르게 얻을 수 있다. 제안한 십진 부분곱 축약 단계의 성능의 평가를 위해 Design Compiler를 통해 SMIC사의 180nm CMOS 공정 라이브러리를 이용하여 합성하였다. 일반 방법을 이용하는 축약 단계에 비해 제안한 부분곱 축약 단계의 지연시간은 약 15.6% 감소하였고 면적은 약 16.2% 감소하였다. 또한 십진 CLA의 지연시간과 면적이 증가가 있음에도 불구하고 전체 지연시간과 전체 면적이 감소함을 확인하였다.
https://doi.org/10.5573/ieie.2016.53.11.056 인용 PDF KSCI

연산자 조작 공격과 피연산자 조작 공격에 대한 기존 CRT-RSA Scheme의 안전성 분석 (The Security Analysis of Previous CRT-RSA Scheme on Modified Opcode and Operand Attack)

허순행;이형섭;이현승;최동현;원동호;김승주
- 정보보호학회논문지
- /
- 제19권6호
- /
- pp.185-190
- /
- 2009
CRT-RSA의 사용이 대중화됨에 따라, CRT-RSA에 대한 보안 또한 중요 이슈가 되었다. 1996년, Bellcore 연구원들에 의해 CRT-RSA가 오류 주입 공격에 취약하다고 밝혀진 이래로, 많은 대응책들이 제안되었다. 첫 번째 대응책은 1999년 Shamir에 의해 제안되었으며, Shamir의 대응책은 오류 검사 기법에 기반을 두고 있다. Shamir의 대응책이 소개된 이후, 오류 검사 기법을 사용하는 많은 대응책들이 제안되었다. 그러나 Shamir의 대응책은 2001년 Joey 등에 의하여 피연산자 조작 공격에 취약함이 밝혀졌으며, 오류 검사 기법 또한 2003년 Yen 등에 의하여 연산자 조작 공격에 취약하다고 알려졌다. 이에 Yen 등은 오류 검사 기법을 사용하지 않고 오류 확산 기법을 사용하여 새로운 대응책을 제안하였으나, Yen 등이 제안한 대응책 또한 2007년에 Yen과 Kim에 의하여 안전하지 않음이 밝혀졌다. 최근에는 Kim 등이 Yen 등의 대응책을 보완한 새로운 대응책을 제안하였으며, Ha 등 또한 오류 확산 기법을 사용한 대응책을 제안하였다. 그러나 Kim 등과 Ha 등이 제안한 대응책들을 포함한 기존 대응책들은 연산자 조작 공격에 대해서는 안전성이 증명되지 않았기 때문에 본 논문에서는 피연산자 조작 공격은 물론, 연산자 조작 공격도 고려하여 지금까지 제안된 대응책들의 안전성을 분석할 것이다.
https://doi.org/10.13089/JKIISC.2009.19.6.185 인용 PDF KSCI HTML

Redundant Signed Binary Number에 의한 CORDIC 회로 (The CORDIC Circuit of Redundant Signed Binary Number)

김승열;김용대;한선경;유영갑
- 전자공학회논문지CI
- /
- 제40권6호
- /
- pp.1-8
- /
- 2003
Global carry propagation이 없는 redundant signed number에 의한 CORDIC 회로를 제안하였다. 이 number format은 Booth recording과 유사한 새로운 receding scheme을 가지고 가감산에서 carry 전파의 문제를 효과적으로 해결하였다. 여기서는 상수 scale factor를 갖고 삼각함수 계산을 하는 pipeline구조를 채택하였다. 이 CORDIC 회로의 동작시간은 채택한 operand bit에 상관없이 일정하다.
PDF KSCI

DSP Performance Maximization with Multisample Technique

Lee, Hosun;Lawrence K.W. Law;Youngyearl Han
- 대한전자공학회:학술대회논문집
- /
- 대한전자공학회 2000년도 제13회 신호처리 합동 학술대회 논문집
- /
- pp.471-474
- /
- 2000
In this paper, we present multisample DSP coding technique for StarCore, SC 140 DSP. The multisample programming is a pipelining technique that exploits operand reuse both coefficients and variables within kernel. A coefficient or operand is loaded once from memory and then the value may be used by multiple ALUs. It is possible to evaluate one intermediate product from each of four output sample calculations in parallel . Therefore, parallelization has been achieved by processing multiple samples in parallel rather than multiple intermediate products belonging to only one sample. The benefits of decreasing the number of memory moves per sample is to increase the algorithm perforomance. In this paper, the multisample technique has been implemented in FIR filter calculation using Motorola StarCore DSP development tool.
PDF

검색결과 51건 처리시간 0.021초

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)