Search | Korea Science

A new warp scheduling technique for improving the performance of GPUs by utilizing MSHR information (GPU 성능 향상을 위한 MSHR 정보 기반 워프 스케줄링 기법)

Kim, Gwang Bok;Kim, Jong Myon;Kim, Cheol Hong
- The Journal of Korean Institute of Next Generation Computing
- /
- v.13 no.3
- /
- pp.72-83
- /
- 2017
GPUs can provide high throughput with latency hiding by executing many warps in parallel. MSHR(Miss Status Holding Registers) for L1 data cache tracks cache miss requests until required data is serviced from lower level memory. In recent GPUs, excessive requests for cache resources cause underutilization problem of GPU resources due to cache resource reservation fails. In this paper, we propose a new warp scheduling technique to reduce stall cycles under MSHR resource shortage. Cache miss rates for each warp is predicted based on the observation that each warp shows similar cache miss rates for long period. The warps showing low miss rates or computation-intensive warps are given high priority to be issued when MSHR is full status. Our proposal improves GPU performance by utilizing cache resource more efficiently based on cache miss rate prediction and monitoring the MSHR entries. According to our experimental results, reservation fail cycles can be reduced by 25.7% and IPC is increased by 6.2% with the proposed scheduling technique compared to loose round robin scheduler.

Performance and Power Consumption Improvement of Embedded RISC Core (임베디드 RISC 코어의 성능 및 전력 개선)

Jung, Hong-Kyun;Ryoo, Kwang-Ki
- Journal of the Korea Institute of Information and Communication Engineering
- /
- v.14 no.2
- /
- pp.453-461
- /
- 2010
This paper presents a branch prediction algorithm and a 4-way set-associative cache for performance improvement of embedded RISC core and a clock-gating algorithm using ODC (Observability Don't Care) operation to improve the power consumption of the core. The branch prediction algorithm has a structure using BTB(Branch Target Buffer) and 4-way set associative cache has lower miss rate than direct-mapped cache. Pseudo-LRU Policy, which is one of the Line Replacement Policies, is used for decreasing the number of bits that store LRU value. The clock gating algorithm reduces dynamic power consumption. As a result of estimation of performance and dynamic power, the performance of the OpenRISC core applied the proposed architecture is improved about 29% and dynamic power of the core using Chartered $0.18{\mu}m$ technology library is reduced by 16%.
https://doi.org/10.6109/jkiice.2010.14.2.453 인용 PDF KSCI

A Research of Extension Buffer Cache Management used Nand- flash based SSD (Nand-Flash 기반의 SSD를 이용한 확장 버퍼 캐쉬 관리 기법 연구)

Oh, Kyung-Hwan;Bong, Sun-Jong;Kim, Kyung-Tae;Youn, Hee-Young
- Proceedings of the Korean Society of Computer Information Conference
- /
- 2014.07a
- /
- pp.235-236
- /
- 2014
플래시 메모리 기술이 발전함에 따라 낸드 플래시 기반의 SSD가 상용화 되면서 I/O시간을 줄이기 위한 연구들이 진행되고 있다. 이에 본 논문에서는 기존의 메인 메모리와 저장장치 사이에 확장 버퍼 캐시로써 SSD를 사용하고 메인 메모리에서 방출 된 페이지들을 구분하여 같은 성향의 페이지들을 블록화 하는 모델을 제안한다. 이러한 모델을 통하여 블록 단위로 사용되는 SSD를 효율적으로 이용하여 읽기 및 쓰기 성능을 높이고 I/O에 해당하는 시간들을 줄임으로써 전체적인 성능 향상을 증명하였다.
PDF

Sharing Pattern Analysis of an OLTP Application

Lee, Kangwoo;Kim, Hiecheol
- Journal of Korea Society of Industrial Information Systems
- /
- v.7 no.5
- /
- pp.121-128
- /
- 2002
Although multiprocessor systems are widely used in recent years to run commercial workloads, data sharing patterns are rarely explored due to several difficulties. In this paper, we made in-depth sharing pattern analysis for a representative OLTP application, the TPC-B benchmark, running on a cache-coherent shared-memory multiprocessor system. In addition, to illustrate their effects on the performance, the number of cache misses were measured for various numbers of processors, cache sizes and cache block sizes. From these measurements, we found out the shared data in TPC-B largely bear quite different sharing characteristics from those in scientific applications.
PDF

Formal Verification of RACE Protocol Using VIS (VIS를 이용한 RACE 포로토콜의 정형검증)

Um, Hyun-Sun;Choi, JIn-Young;Han, Woo-Jong;Ki, An-Do;Shim, Kyu-Hyun
- The Transactions of the Korea Information Processing Society
- /
- v.7 no.7
- /
- pp.2219-2228
- /
- 2000
Caches in a multiprocessing environment introduce the cache coherence problem. When multiple processors maintain locally cached copies of a unique shared-memory location, any local modification of the location can result in a globally inconsistent view of memory. Cache coherence protocols are important to operate a shared-memory multiprocessor system with efficiency and correctness. Since random testing and simulations are not enough to validate correctness of protocols, it is necessary to develop efficient and reliable verification methods. In this appear we present our experience in using VIS (Verification Interacting with Synthesis), a tool of formal method, to analyze a number of property of a cache coherence protocol, RACE (Remote Access Cache coherent Enforcement).
PDF

Performance Analysis of the Parallel CUPID Code for Various Parallel Programming Models in Symmetric Multi-Processing System (Symmetric Multi-Processing 시스템에서 다양한 병렬 기법 모델을 적용한 병렬 CUPID 코드의 성능분석)

Jeon, Byoung Jin;Lee, Jae Ryong;Yoon, Han Young;Choi, Hyoung Gwon
- Transactions of the Korean Society of Mechanical Engineers B
- /
- v.38 no.1
- /
- pp.71-79
- /
- 2014
A parallelization of the bi-conjugate gradient solver for the pressure equation of the CUPID (component unstructured program for interfacial dynamics) code, which was developed for analyzing the components of a pressurized water-cooled reactor, was studied in a symmetric multi-processing system. The parallel performance was investigated for three typical parallel programming models (MPI, OpenMP, Hybrid) by solving incompressible backward-facing step flow at various grid resolutions. It was confirmed that parallel performance was low when problem size was small or the memory requirement for each thread was considerably higher than the cache memory. Furthermore, it was shown that MPI was better than OpenMP regardless of the problem size, and Hybrid was the best when the number of threads was relatively small.
https://doi.org/10.3795/KSME-B.2014.38.1.071 인용 PDF KSCI

Analysis on the Performance and Temperature of the 3D Quad-core Processor according to Cache Organization (캐쉬 구성에 따른 3차원 쿼드코어 프로세서의 성능 및 온도 분석)

Son, Dong-Oh;Ahn, Jin-Woo;Choi, Hong-Jun;Kim, Jong-Myon;Kim, Cheol-Hong
- Journal of the Korea Society of Computer and Information
- /
- v.17 no.6
- /
- pp.1-11
- /
- 2012
As the process technology scales down, multi-core processors cause serious problems such as increased interconnection delay, high power consumption and thermal problems. To solve the problems in 2D multi-core processors, researchers have focused on the 3D multi-core processor architecture. Compared to the 2D multi-core processor, the 3D multi-core processor decreases interconnection delay by reducing wire length significantly, since each core on different layers is connected using vertical through-silicon via(TSV). However, the power density in the 3D multi-core processor is increased dramatically compared to that in the 2D multi-core processor, because multiple cores are stacked vertically. Unfortunately, increased power density causes thermal problems, resulting in high cooling cost, negative impact on the reliability. Therefore, temperature should be considered together with performance in designing 3D multi-core processors. In this work, we analyze the temperature of the cache in quad-core processors varying cache organization. Then, we propose the low-temperature cache organization to overcome the thermal problems. Our evaluation shows that peak temperature of the instruction cache is lower than threshold. The peak temperature of the data cache is higher than threshold when the cache is composed of many ways. According to the results, our proposed cache organization not only efficiently reduces the peak temperature but also reduces the performance degradation for 3D quad-core processors.
https://doi.org/10.9708/jksci.2012.17.6.001 인용 PDF KSCI

NAND Flash Memory System Management for Lifetime Extension (낸드 플래시 메모리 시스템의 Lifetime 증대를 위한 관리 방법 설계)

Park, Yi-Hyun;Lee, Jae-Bin;Kim, Geon-Myung;Lim, Seung-Ho
- Proceedings of the Korea Information Processing Society Conference
- /
- 2019.05a
- /
- pp.23-25
- /
- 2019
낸드 플래시 메모리(NAND Flash Memory)는 컴퓨터 시스템의 대용량 저장장치를 위한 소자로써, 대용량화의 주요 원인으로는 메모리 셀(Cell) 당 저장할 수 있는 비트 수를 증가시킴으로써 집적도를 증가시킨 것이다. 그러나, 이러한 집적도의 증가는 에러의 증가를 가져와서 저장장치에서 가장 중요한 신뢰성이 급격하게 저하하는 요인이며, 저장장치의 생명주기(Lifetime)을 감소시키게 된다. 기존에 낸드 플래시 메모리 저장장치의 Lifetime을 증대시키기 위해서 P/E cycle을 고려하여 데이터 영역의 일부를 점점 더 ECC 영역으로 변경시키는 방식을 적용한 바가 있다. 이러한 방식은 데이터 영역의 감소로 인한 저장장치 내에서 관리되는 호스트-플래시 간 데이터 관리 크기의 미스매치로 인한 여러 가지 오버 헤드를 생성한다. 본 연구에서는 P/E cycle에 따른 데이터 영역의 ECC 영역으로의 전환을 통한 Lifetime을 증가시키는 방식에 있어서, 오버헤드를 줄이기 위한 캐쉬 관리 구조 및 매핑 관리 구조에 대한 설계를 진행하였다. 이러한 설계를 낸드 플래시 메모리 기반 저장장치에 적용할 경우, LifeTime을 증대시키기 위해서 ECC를 데이터 영역으로 확장하는 방식을 사용할 때 저하될 수 있는 일반 읽기 및 쓰기의 성능 저하를 어느 정도 감소시켜줄 수 있을 것으로 기대한다.
https://doi.org/10.3745/PKIPS.y2019m05a.23 인용 PDF

A Prefetch Architecture with Efficient Branch Prediction for a 64-bit 4-way Superscalar Microprocessor (64비트 4-way 수퍼스칼라 마이크로프로세서의 효율적인 분기 예측을 수행하는 프리페치 구조)

문상국;문병인;이용환;이용석
- The Journal of Korean Institute of Communications and Information Sciences
- /
- v.25 no.11B
- /
- pp.1939-1947
- /
- 2000
본 논문에서는 명령어의 효율적인 페치를 위해 분기 타겟 주소 전체를 사용하지 않고 캐쉬 메모리(cache memory) 내의 적은 비트 수로 인덱싱 하여 한 클럭 사이클 안에 최대 4개의 명령어를 다음 파이프라인으로 보내줄 수 있는 방법을 제시한다. 본 프리페치 유닛은 크게 나누어 3개의 영역으로 나눌 수 있는데, 분기에 관련하여 미리 부분적으로 명령어를 디코드 하는 프리디코드(predecode) 블록, 타겟 주소(NTA : Next Target Address) 테이블 영역을 추가시킨 명령어 캐쉬(instruction cache) 블록, 전체 유닛을 제어하고 가상 주소를 관리하는 프리페치(prefetch) 블록으로 나누어진다. 사용된 명령어들은 SPARC(Scalable Processor ARChitecture) V9에 기준 하였고 구현은 Verilog-HDL(Hardwave Description Language)을 사용하여 기능 수준으로 기술되고 검증되었다. 구현된 프리페치 유닛은 명령어 흐름에 분기가 존재하더라도 단일 사이클 안에 4개까지의 명령어들을 정확한 예측 하에 다음 파이프라인으로 보내줄 수 있다. 또한 NTA를 사용한 방법은 같은 수의 레지스터 비트를 사용하였을 때 BTB(Branch Target Buffer)를 사용하는 방법과 비교하여 2배정도 많은 개수의 분기 명령 주소를 저장할 수 있는 장점이 있다.
PDF

A Distributed VOD Server Based on Virtual Interface Architecture and Interval Cache (버추얼 인터페이스 아키텍처 및 인터벌 캐쉬에 기반한 분산 VOD 서버)

Oh, Soo-Cheol;Chung, Sang-Hwa
- Journal of KIISE:Computer Systems and Theory
- /
- v.33 no.10
- /
- pp.734-745
- /
- 2006
This paper presents a PC cluster-based distributed VOD server that minimizes the load of an interconnection network by adopting the VIA communication protocol and the interval cache algorithm. Video data is distributed to the disks of the distributed VOD server and each server node receives the data through the interconnection network and sends it to clients. The load of the interconnection network increases because of the large amount of video data transferred. This paper developed a distributed VOD file system, which is based on VIA, to minimize cost using interconnection network when accessing remote disks. VIA is a user-level communication protocol removing the overhead of TCP/IP. This papers also improved the performance of the interconnection network by expanding the maximum transfer size of VIA. In addition, the interval cache reduces traffic on the interconnection network by caching, in main memory, the video data transferred from disks of remote server nodes. Experiments using the distributed VOD server of this paper showed a maximum performance improvement of 21.3% compared with a distributed VOD server without VIA and the interval cache, when used with a four-node PC cluster.
PDF KSCI

Search Result 176, Processing Time 0.025 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)