• Title/Summary/Keyword: CPU 시간

Search Result 518, Processing Time 0.035 seconds

Analysis of Worst Case DMA Response Time in Fixed-Priority Bus Arbitration Protocol (고정우선순위 버스 프로토콜 환경에서 DMA I/O 요구의 최악 응답시간 분석)

  • Hahn, Joo-Sun;Ha, Rhan;Min, Sang-Lyul
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 1999.10c
    • /
    • pp.21-23
    • /
    • 1999
  • CPU에게 최상위 우선순위가 할당된 고정 우선순위 버스 프로토콜에서는 CPU와 DMA 컨트롤러의 버스 요구가 충돌할 경우 DMA 전송이 지연된다. 본 논문에서는 CPU와 다수의 DMA 컨트롤러가 시스템 버스를 공유하는 환경에서 DAM I/O 요구의 최악 응답시간을 분석하는 기법을 제안한다. 제안하는 최악 응답시간 분석 기법은 다음의 세단계로 구성되어 있다. 첫 번째 단계에서는 CPU 상에서 수행중인 각 CPU 태스크별로 최악 버스 요구 패턴을 구한다. 두 번째 단계에서는 이들 CPU 태스크의 최악 버스 요구 패턴을 모두 통합해 CPU 전체의 최악 버스 요구 패턴을 구한다. 최종 세 번째 단계에서는 CPU의 최악 버스 요구 패턴으로부터 DMA 컨트롤러의 버스 가용량을 구하고 DMA I/O 요구의 최악 응답시간을 산출한다. 모의 실험을 통해 제안하는 분석 기법일 일반적인 DMA전송량에 대해 20% 오차 범위 이내에서 안전한 응답시간을 산출함을 보였다.

  • PDF

Event Routing Scheme to Improve I/O Latency of SMP VM (SMP 가상 머신의 I/O 지연 시간 감소를 위한 이벤트 라우팅 기법)

  • Shin, Jungsub;Kim, Hagyoung
    • Journal of KIISE
    • /
    • v.42 no.11
    • /
    • pp.1322-1331
    • /
    • 2015
  • According to the hypervisor scheduler, the vCPU (virtual CPU) operates under two states: the running state and the stop state. When the vCPU is in the stop state, incoming events are delayed until that vCPU's state changes to the running state. The latency in handling such events that are sent to the vCPU is regarded as the I/O latency. Since a SMP (symmetric multiprocessing) VM (virtual machine) incorporates multiple vCPUs, the event latency on a SMP VM can vary according to specific vCPU that receives the event. In this paper, we propose a new scheme named event routing that sends events according to the operation state of each vCPU to reduce the event latency on an SMP VM. We implemented the proposed event routing scheme in Xen ARM hypervisor and confirmed the reduction of I/O latency from measuring the network RTT (round trip time) and the TCP bandwidth under a variety of testing conditions. The network RTT decreases by up to 94% and the TCP bandwidth increases up to 35% when compare to native Xen ARM.

Worst Case Timing Analysis for DMA I/O Requests in Real-time Systems (실시간 시스템의 DMA I/O 요구를 위한 최악 시간 분석)

  • Hahn Joosun;Ha Rhan;Min Sang Lyul
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.32 no.4
    • /
    • pp.148-159
    • /
    • 2005
  • We propose a technique for finding the worst case response time (WCRT) of a DMA request that is needed in the schedulability analysis of a whole real-time system. The technique consists of three steps. In the first step, we find the worst case bus usage pattern of each CPU task. Then in the second step, we combine the worst case bus usage pattern of CPU tasks to construct the worst case bus usage pattern of the CPU. This second step considers not only the bus requests made by CPU tasks individually but also those due to preemptions among the CPU tasks. finally, in the third step, we use the worst case bus usage pattern of the CPU to derive the WCRT of DMA requests assuming the fixed-priority bus arbitration protocol. Experimental results show that overestimation of the DMA response time by the proposed technique is within $20\%$ for most DMA request sizes and that the percentage overestimation decreases as the DMA request size increases.

Efficient Collaboration Method Between CPU and GPU for Generating All Possible Cases in Combination (조합에서 모든 경우의 수를 만들기 위한 CPU와 GPU의 효율적 협업 방법)

  • Son, Ki-Bong;Son, Min-Young;Kim, Young-Hak
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.7 no.9
    • /
    • pp.219-226
    • /
    • 2018
  • One of the systematic ways to generate the number of all cases is a combination to construct a combination tree, and its time complexity is O($2^n$). A combination tree is used for various purposes such as the graph homogeneity problem, the initial model for calculating frequent item sets, and so on. However, algorithms that must search the number of all cases of a combination are difficult to use realistically due to high time complexity. Nevertheless, as the amount of data becomes large and various studies are being carried out to utilize the data, the number of cases of searching all cases is increasing. Recently, as the GPU environment becomes popular and can be easily accessed, various attempts have been made to reduce time by parallelizing algorithms having high time complexity in a serial environment. Because the method of generating the number of all cases in combination is sequential and the size of sub-task is biased, it is not suitable for parallel implementation. The efficiency of parallel algorithms can be maximized when all threads have tasks with similar size. In this paper, we propose a method to efficiently collaborate between CPU and GPU to parallelize the problem of finding the number of all cases. In order to evaluate the performance of the proposed algorithm, we analyze the time complexity in the theoretical aspect, and compare the experimental time of the proposed algorithm with other algorithms in CPU and GPU environment. Experimental results show that the proposed CPU and GPU collaboration algorithm maintains a balance between the execution time of the CPU and GPU compared to the previous algorithms, and the execution time is improved remarkable as the number of elements increases.

Latency Evaluation of CPU Idle Time Based Interrupt Processing on Pfair Multi-Core Scheduler (Pfair 멀티코어 스케줄러에서 CPU 유휴시간 기반의 인터럽트 처리 기법의 지연시간 평가)

  • Park, Sangsoo
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2014.04a
    • /
    • pp.31-32
    • /
    • 2014
  • 다중의 명령어를 동시에 수행할 수 있는 멀티코어 시스템의 특성으로 하나의 시스템 내에서 태스크를 수행하면서 외부 이벤트의 발생에 의한 인터럽트를 동시에 처리할 수 있다. 각 태스크가 처리되어야 하는 시간에 제약성을 갖는 실시간 시스템에서는 스케줄러에 의해 CPU 코어에서의 수행이 제어되어야한다. 본 논문에서는 최적이라고 알려진 Pfair 멀티코어 스케줄러의 각 코어별 유휴시간을 정량적으로 평가함으로써 인터럽트 처리의 지연시간을 분석한다.

Evaluation of the Data Migration between CPU Memory and GPU Memory for a NVIDIA Pascal GPU Using Unified Memory (통합 메모리를 사용하는 NVIDIA 파스칼 GPU에서의 CPU 메모리와 GPU 메모리 간 데이터 통신 분석)

  • Shin, Philkyue;Hong, Seongsoo
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2018.07a
    • /
    • pp.7-10
    • /
    • 2018
  • 통합 메모리는 CPU 메모리와 GPU 메모리 간의 데이터 통신을 개발자에게 투명하게 내재적으로 수행하는 소프트웨어 런타임 환경으로 개발자에게 CPU 메모리와 GPU 메모리가 통합된 하나의 메모리로 보이게 해준다. 통합 메모리는 장점에도 불구하고 아직 널리 사용되지 못하고 있는데 그 이유는 내재적으로 수행되는 데이터 통신의 오버헤드가 큰 것으로 알려져 있기 때문이다. 하지만 이 데이터 통신이 구체적으로 어떻게 이루어지고 오버헤드는 어떻게 발생하는지 분석한 연구는 아직 존재하지 않는다. 우리는 NVIDIA 사의 최신 GPU 마이크로아키텍처 중 하나인 파스칼을 사용하는 GPU를 대상으로 하여, 통합 메모리를 사용할 시 데이터 통신이 이루어지는 조건과 GPU 응용의 수행시간에 데이터 통신이 끼치는 영향을 실험을 통해 분석한다. 실험 결과 통합 메모리의 오버헤드는 두 가지 원인 때문에 발생한다. 첫째, 통합 메모리를 사용하면 CPU 또는 GPU가 데이터에 접근할 때마다 이 데이터는 CPU 또는 GPU 메모리로 옮겨지고 옮겨진 데이터는 제거된다. 따라서 재사용할 데이터도 제거되어 추가적인 데이터 통신이 발생하고, 이 데이터 통신의 지연시간은 GPU 응용의 수행시간에 더해진다. 둘째, 통합 메모리를 사용하면 데이터 통신과 커널들이 서로 다른 스트림에 할당되어도 동시에 수행되지 못한다. 따라서 GPU 응용의 수행시간은 동시에 수행되던 데이터 통신과 커널의 수행시간만큼 증가한다.

  • PDF

Fast and Efficient Implementation of Neural Networks using CUDA and OpenMP (CUDA와 OPenMP를 이용한 빠르고 효율적인 신경망 구현)

  • Park, An-Jin;Jang, Hong-Hoon;Jung, Kee-Chul
    • Journal of KIISE:Software and Applications
    • /
    • v.36 no.4
    • /
    • pp.253-260
    • /
    • 2009
  • Many algorithms for computer vision and pattern recognition have recently been implemented on GPU (graphic processing unit) for faster computational times. However, the implementation has two problems. First, the programmer should master the fundamentals of the graphics shading languages that require the prior knowledge on computer graphics. Second, in a job that needs much cooperation between CPU and GPU, which is usual in image processing and pattern recognition contrary to the graphic area, CPU should generate raw feature data for GPU processing as much as possible to effectively utilize GPU performance. This paper proposes more quick and efficient implementation of neural networks on both GPU and multi-core CPU. We use CUDA (compute unified device architecture) that can be easily programmed due to its simple C language-like style instead of GPU to solve the first problem. Moreover, OpenMP (Open Multi-Processing) is used to concurrently process multiple data with single instruction on multi-core CPU, which results in effectively utilizing the memories of GPU. In the experiments, we implemented neural networks-based text extraction system using the proposed architecture, and the computational times showed about 15 times faster than implementation on only GPU without OpenMP.

Molecular Docking System using Parallel GPU (병렬 GPU를 이용한 분자 도킹 시스템)

  • Park, Sung-Jun
    • The Journal of the Korea Contents Association
    • /
    • v.8 no.12
    • /
    • pp.441-448
    • /
    • 2008
  • The molecular docking system needs a large amount of computation and requires super-computing power. Since the experiment requires a large amount of time, the experiment is conducted in the distributed environment or in the grid environment. Recently, researches on using parallel GPU of far higher performance than that of CPU in scientific computing have been very actively conducted. CUDA is an open technique by which a parallel GPU programming is made possible. This study proposes the molecular docking system using CUDA. It also proposes algorithm that parallels energy-minimizing-computation. To verify such experiments, this study conducted a comparative analysis on the time required for experimenting molecular docking in general CPU and the time and performance of the parallel GPU-based molecular docking which is proposed in this study.

The development of parallel computation method for the fire-driven-flow in the subway station (도시철도역사에서 화재유동에 대한 병렬계산방법연구)

  • Jang, Yong-Jun;Lee, Chang-Hyun;Kim, Hag-Beom;Park, Won-Hee
    • Proceedings of the KSR Conference
    • /
    • 2008.06a
    • /
    • pp.1809-1815
    • /
    • 2008
  • This experiment simulated the fire driven flow of an underground station through parallel processing method. Fire analysis program FDS(Fire Dynamics Simulation), using LES(Large Eddy Simulation), has been used and a 6-node parallel cluster, each node with 3.0Ghz_2set installed, has been used for parallel computation. Simulation model was based on the Kwangju-geumnan subway station. Underground station, and the total time for simulation was set at 600s. First, the whole underground passage was divided to 1-Mesh and 8-Mesh in order to compare the parallel computation of a single CPU and Multi-CPU. With matrix numbers($15{\times}10^6$) more than what a single CPU can handle, fire driven flow from the center of the platform and the subway itself was analyzed. As a result, there seemed to be almost no difference between the single CPU's result and the Multi-CPU's ones. $3{\times}10^6$ grid point one employed to test the computing time with 2CPU and 7CPU computation were computable two times and fire times faster than 1CPU respectively. In this study it was confirmed that CPU could be overcome by using parallel computation.

  • PDF

Manufacture of Dismantling Apparatus for Waste CPU Chip and Performance Evaluation (폐 CPU 칩의 해체장치 제작 및 성능 평가)

  • Joe, Aram;Park, Seungsoo;Kim, Boram;Park, Jaikoo
    • Resources Recycling
    • /
    • v.25 no.6
    • /
    • pp.3-12
    • /
    • 2016
  • In this study, Au distribution in F-PGA chip and W-BGA chip were examined to recover Au effectively from CPU chips. The result showed that 80.8% and 89.8% of Au exist in terminal of F-PGA chip and bare die of W-BGA chip, respectively. Based on the fact that Au exists in specific parts of the chips, an CPU chip dismantling apparatus was developed. The experimental variables were roller rotating speed, heat temperature of IR heater and heating time. Terminals of F-PGA chips were completely recovered under the temperature of $300^{\circ}C$ and the residence time of 90 s. Bare dies of W-BGA chips were completely recovered as well under the temperature of $300^{\circ}C$, the roller rotating rate of 90 rpm and the residence time of 90 s.