통합 검색 | Korea Science

Exploiting Thread-Level Parallelism in Lockstep Execution by Partially Duplicating a Single Pipeline

Oh, Jaeg-Eun;Hwang, Seok-Joong;Nguyen, Huong Giang;Kim, A-Reum;Kim, Seon-Wook;Kim, Chul-Woo;Kim, Jong-Kook
- ETRI Journal
- /
- 제30권4호
- /
- pp.576-586
- /
- 2008
In most parallel loops of embedded applications, every iteration executes the exact same sequence of instructions while manipulating different data. This fact motivates a new compiler-hardware orchestrated execution framework in which all parallel threads share one fetch unit and one decode unit but have their own execution, memory, and write-back units. This resource sharing enables parallel threads to execute in lockstep with minimal hardware extension and compiler support. Our proposed architecture, called multithreaded lockstep execution processor (MLEP), is a compromise between the single-instruction multiple-data (SIMD) and symmetric multithreading/chip multiprocessor (SMT/CMP) solutions. The proposed approach is more favorable than a typical SIMD execution in terms of degree of parallelism, range of applicability, and code generation, and can save more power and chip area than the SMT/CMP approach without significant performance degradation. For the architecture verification, we extend a commercial 32-bit embedded core AE32000C and synthesize it on Xilinx FPGA. Compared to the original architecture, our approach is 13.5% faster with a 2-way MLEP and 33.7% faster with a 4-way MLEP in EEMBC benchmarks which are automatically parallelized by the Intel compiler.
PDF

연상 메모리의 자동설계에 관한 연구 (A Study on the Automatic Design of Content Addressable Memory)

김종선;백인천;박노경;차균현
- 한국통신학회논문지
- /
- 제15권10호
- /
- pp.857-867
- /
- 1990
CAM은 RAM이나 PLA 처럼 규칙적인 구조를 갖고 있으므로 CAM 자동설계 프로그램을 제작하기 용이하다. 본 프로그램은 CIF 형태로 그 결과가 출력되고 수정 작업이나 결과를 화면에 보기 위해 IBM/PC 상에서 디스플레이 프로그램을 개발하였다. CIF 파저는 YACC와 LEX로 제작하였고, 플롯팅을 위해서는 ROLAND XY 플롯터를 사용하였다. 위의 과정을 하나의 메뉴안에서 선택에 따라 수행하도록 Full-Down 메뉴를 사용하여 종합하였다.
PDF

TTL 시뮬레이터의 콤파일러 설계에 관한 연구 (A Study on the design of the compiler for a TTL simulator)

신철재;김용득
- 대한전자공학회논문지
- /
- 제14권2호
- /
- pp.17-27
- /
- 1977
본 논문에서는 특수 고적의 소형 전자계산기를 단일 뻐스선 방법에 의한 TTL 집적회로로 설계하였으며, 각 지령어의 길이를 16비트르 구성함으로써 더욱 간편한 콤파일러를 연구하였다. 또한 본 실험에서 160nsec의 기본 주기를 택함으로써, TTL IC의 동작시간에 따른 최적치와 기억소자의 악세스 타임이 같게 되었고, 구성 회로도 간단하게 만들어졌으며, 모두 프로그램도 만족하게 처리되었다.
PDF

스크래치패드 메모리를 위한 데이터 관리 기법 리뷰 (A Review of Data Management Techniques for Scratchpad Memory)

조두산
- 문화기술의 융합
- /
- 제9권1호
- /
- pp.771-776
- /
- 2023
스크래치패드 메모리는 소프트웨어 제어 온칩 메모리로서 기존의 캐시 메모리의 단점을 완화할 수 있게 설계되어 이용되고 있다. 기존의 캐시 메모리는 태그 관련 하드웨어 제어 로직이 있어 캐시 미스를 사용자가 직접 제어할 수 없으며, 사이즈가 크고 에너지 소모량이 상대적으로 많다. 스크래치패드 메모리는 이러한 하드웨어 오버헤드를 제거하였기 때문에 사이즈, 에너지 소모량에서 장점이 있으나 데이터 관리를 소프트웨어가 해야하는 부담이 존재한다. 본 연구에서는 스크래치패드 메모리의 데이터 관리 기법들을 분류하여 살펴보고 그 장점을 극대화할 수 있는 방안에 대하여 논의하였다.
https://doi.org/10.17703/JCCT.2023.9.1.771 인용 PDF

마이크로컨트롤러 환경에서 타깃 바이너리 파일 분석을 통한 최대 스택 메모리 사용량 예측 기법 (Maximum Stack Memory Usage Estimation Through Target Binary File Analysis in Microcontroller Environment)

최기호;김성섭;박대진;조정훈
- 대한임베디드공학회논문지
- /
- 제12권3호
- /
- pp.159-167
- /
- 2017
Software safety is a key issue in embedded system of automotive and aviation industries. Various software testing approaches have been proposed to achieve software safety like ISO26262 Part 6 in automotive environment. In spite of one of the classic and basic approaches, stack memory is hard to estimating exactly because of uncertainty of target code generated by compiler and complex nested interrupt. In this paper, we propose an approach of analyzing the maximum stack usage statically from target binary code rather than the source code that also allows nested interrupts for determining the exact stack memory size. In our approach, determining maximum stack usage is divided into three steps: data extraction from ELF file, construction of call graph, and consideration of nested interrupt configurations for determining required stack size from the ISR (Interrupt Service Routine). Experimental results of the estimation of the maximum stack usage shows proposed approach is helpful for optimizing stack memory size and checking the stability of the program in the embedded system that especially supports nested interrupts.
https://doi.org/10.14372/IEMEK.2017.12.3.159 인용 PDF KSCI

진보된 캘린더 큐 스케줄러 설계방법론 (Advanced Calendar Queue Scheduler Design Methodology)

김진실;정원영;이정희;이용석
- 한국통신학회논문지
- /
- 제34권12B호
- /
- pp.1380-1386
- /
- 2009
본 논문에서는 홈 네트워크에서 멀티미디어와 타이밍 트래픽을 처리하기 위해 디자인 된 CQS(Calendar Queue Scheduler)를 제안한다. VoIP, VOD, IPTV, 최선형(Beat-efforts) 트래픽 등 가택으로 유입되는 다양한 속성을 지닌 트래픽의 증가로 가택 내 QoS(Quality of Service) 관리의 필요성이 논의되고 있다. 이러한 제한된 환경에서 성공적으로 QoS를 보장하기 위해서는 각 애플리케이션이나 서비스 단위로 그룹을 형성하여 관리하는 것이 효과적이다. 본 연구에서는 단대단(end-to-end) QoS 측면에서 수신측 말단에 해당하는 홈 게이트웨이를 목표로 제한된 자원내에서 멀티미디어 및 타이밍 트래픽 처리와 큐 사이즈를 최적화시킨 CQS아키텍처를 하드웨어로 제안하였다. 또한, 각각의 모듈과 각각의 메모리에 대한 면적을 시뮬레이션하였다. Synopsys Design Compiler를 사용하여 Magnachip 0.18 CMOS 라이브러리로 합성하였을 때 각 모듈의 면적은 NAND($2{\times}1$) 게이트(11.09)를 기준으로 하였다. Memory의 비중이 전체 CQS에서 85.38%를 나타내고 있음을 알 수 있었다. 각 메모리 사이즈의 크기를 CACTI 5.3(단위는 mm^2)을 통하여 추출하였다. 메모리의 entry가 증가함에 따라 메모리 area의 증가 폭은 점점 더 증가하므로, 1 year 에 해당하는 day size의 결정이 전체 CQS 면적에 절대적인 영향을 미치게 된다. 본 논문에서 CQS를 하드웨어로 설계할 때 각 모듈의 설계 방법론과 각 모듈의 동작에 대하여 논하였다.
PDF KSCI

32 Bit RISC Core modeling using SystemC

최홍미;박성모
- 대한전자공학회:학술대회논문집
- /
- 대한전자공학회 2002년도 하계종합학술대회 논문집(2)
- /
- pp.325-328
- /
- 2002
In this paper, we present a SystemC model of a 32-Bit RISC core wi)ich is based on the ARMTTDMI architecture. The RISC core model was first modeled in C for architecture verification and then refined down to a level that allows concurrent behavior lot hardware timing using the SystcmC class library. It was driven in timed functional level that uses handshake protocol. It was compiled using standard C++ compiler. The functional simulation result was verified by comparing the contents of memory, the result of execution with the result from the ARMulator of ADS(Arm Developer Suite).
PDF

다중스레드 구조에서 함수 언어 루프의 효과적 실행 (The Efficient Execution of Functional Language Loops on the Multithreaded Architectures)

하상호
- 한국정보처리학회논문지
- /
- 제7권3호
- /
- pp.962-970
- /
- 2000
Multithreading is attractive in that it can tolerate memory latency and synchronization by effectively overlapping communication with computation. While several compiler techniques have been developed to produce multithreaded codes from functional languages programs, there still remains a lot of works to implement loops effectively. Executing lops in a style of multithreading usually causes some overheads, which can reduce severely the effect of multirheading. This paper suggests several methods in terms of architectures or compilers which can optimize loop execution by multithreading. We then simulate and analyze them for the matrix multiplication program.
PDF

A Study on Effect of Code Distribution and Data Replication for Multicore Computing Architectures

Cho, Doosan
- International Journal of Advanced Culture Technology
- /
- 제9권4호
- /
- pp.282-287
- /
- 2021
A multicore system must be able to take full advantage of the program's instruction and data parallelism. This study introduces the data replication technique as a support technique to maximize the program's instruction and data parallelism. Instruction level parallelism can be limited by data dependency. In this case, if data is replicated to each processor core and used, instruction level parallelism can be used to the maximum. The technique proposed in this study can maximize the performance improvement effect when applied to scientific applications such as matrix multiplication operation.
https://doi.org/10.17703/IJACT.2021.9.4.282 인용 PDF KSCI

Enhancing GPU Performance by Efficient Hardware-Based and Hybrid L1 Data Cache Bypassing

Huangfu, Yijie;Zhang, Wei
- Journal of Computing Science and Engineering
- /
- 제11권2호
- /
- pp.69-77
- /
- 2017
Recent GPUs have adopted cache memory to benefit general-purpose GPU (GPGPU) programs. However, unlike CPU programs, GPGPU programs typically have considerably less temporal/spatial locality. Moreover, the L1 data cache is used by many threads that access a data size typically considerably larger than the L1 cache, making it critical to bypass L1 data cache intelligently to enhance GPU cache performance. In this paper, we examine GPU cache access behavior and propose a simple hardware-based GPU cache bypassing method that can be applied to GPU applications without recompiling programs. Moreover, we introduce a hybrid method that integrates static profiling information and hardware-based bypassing to further enhance performance. Our experimental results reveal that hardware-based cache bypassing can boost performance for most benchmarks, and the hybrid method can achieve performance comparable to state-of-the-art compiler-based bypassing with considerably less profiling cost.
https://doi.org/10.5626/JCSE.2017.11.2.69 인용 PDF KSCI

검색결과 82건 처리시간 0.021초

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)