• Title/Summary/Keyword: memory bottleneck

Search Result 90, Processing Time 0.025 seconds

A Design of Pipeline Chain Algorithm Based on Circuit Switching for MPI Broadcast Communication System (MPI 브로드캐스트 통신을 위한 서킷 스위칭 기반의 파이프라인 체인 알고리즘 설계)

  • Yun, Heejun;Chung, Wonyoung;Lee, Yong-Surk
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.37B no.9
    • /
    • pp.795-805
    • /
    • 2012
  • This paper proposes an algorithm and a hardware architecture for a broadcast communication which has the worst bottleneck among multiprocessor using distributed memory architectures. In conventional system, The pipelined broadcast algorithm is an algorithm which takes advantage of maximum bandwidth of communication bus. But unnecessary synchronization process are repeated, because the pipelined broadcast sends the data divided into many parts. In this paper, the MPI unit for pipeline chain algorithm based on circuit switching removing the redundancy of synchronization process was designed, the proposed architecture was evaluated by modeling it with systemC. Consequently, the performance of the proposed architecture was highly improved for broadcast communication up to 3.3 times that of systems using conventional pipelined broadcast algorithm, it can almost take advantage of the maximum bandwidth of transmission bus. Then, it was implemented with VerilogHDL, synthesized with TSMC 0.18um library and implemented into a chip. The area of synthesis results occupied 4,700 gates(2 input NAND gate) and utilization of total area is 2.4%. The proposed architecture achieves improvement in total performance of MPSoC occupying relatively small area.

Program Execution Speed Improvement using Executable Compression Method on Embedded Systems (임베디드 시스템에서 실행 가능 압축 기법을 이용한 프로그램 초기 실행 속도 향상)

  • Jeon, Chang-Kyu;Lew, Kyeung-Seek;Kim, Yong-Deak
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.49 no.1
    • /
    • pp.23-28
    • /
    • 2012
  • The performance improvement of the secondary storage is very slow compared to the main memory and processor. The data is loaded from secondary storage to memory for the execution of an application. At this time, there is a bottleneck. In this paper, we propose an Executable Compression Method to speed up the initial loading time of application. and we examined the performance. So we implemented the two applications. The one is a compressor for Execution Binary File. and The other is a decoder of Executable Compressed application file on the Embedded System. Using the test binary files, we performed the speed test in the six files. At the result, one result showed that the performance was decreased. but others had a increased performance. the average increasing rate was almost 29% at the initial loading time. The level of compression had different characteristics of the file. And the performance level was dependent on the file compressed size and uncompress time. so the optimized compression algorithm will be needed to apply the execution binary file.

A New Architecture of High-Performance Digital Hologram Generator based on Independent Calculation of a Holographic Pixel (독립적 홀로그램 화소 연산 방식의 고성능 디지털 홀로그램 생성기의 하드웨어 구조)

  • Lee, Yoon-Huyk;Seo, Young-Ho;Choi, Hyun-Jun;Kim, Dong-Wook
    • Journal of Broadcast Engineering
    • /
    • v.16 no.3
    • /
    • pp.403-415
    • /
    • 2011
  • In this paper, we proposed a hardware architecture to generate digital holograms at high speed. It used the modified computer-generated hologram (CGH) algorithm and adapted the pipeline-based hardware to be able to remove memory bottleneck problem. It uses not the method which generates a hologram by accumulating intermittent holograms but the one which independently generates a pixel of a final hologram and uses the appropriate CGH algorithm for the selected method. Based on the CGH algorithm we proposed the architecture of the digital hologram generator which consists of input interface part, calculating part, and normalizing part. The hardware can decrease memory usage because it repeatedly use object light sources which is stored in the internal buffer. It is also operationally parallelized by vertically adding unit cells. It can generate 86 frames of HD digital hologram per 1 second for 1K light sources.

A Low-Power Texture Mapping Technique for Mobile 3D Graphics (모바일 3D 그래픽스를 위한 저전력 텍스쳐 맵핑 기법)

  • Kim, Hyun-Hee;Kim, Ji-Hong
    • Journal of the Korea Society of Computer and Information
    • /
    • v.14 no.2
    • /
    • pp.45-57
    • /
    • 2009
  • ETexture mapping is a technique used for adding reality to an image in 3D graphics. However. this technique becomes the bottleneck of the 3D graphics pipeline because it requires large processing power and high memory bandwidth. For reducing memory latency in texture mapping, texture cache is used. As portable devices become smaller and they have power constraint, it is important to reduce the area and the power consumption of the texture cache. In this paper we propose using a small texture cache to reduce the area and the power consumption of the texture cache. Furthermore, we propose techniques to keep a performance comparable to large texture caches by using prefetch techniques and a victim cache. Simulation results show the proposed small texture cache can reduce the area and the power consumption up to 70% and 60%, respectively, by using $1{\sim}2K$ bytes texture cache compared to the conventional 16K bytes cache while keeping the performance.

Data Congestion Control Using Drones in Clustered Heterogeneous Wireless Sensor Network (클러스터된 이기종 무선 센서 네트워크에서의 드론을 이용한 데이터 혼잡 제어)

  • Kim, Tae-Rim;Song, Jong-Gyu;Im, Hyun-Jae;Kim, Bum-Su
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.21 no.7
    • /
    • pp.12-19
    • /
    • 2020
  • The clustered heterogeneous wireless sensor network is comprised of sensor nodes and cluster heads, which are hierarchically organized for different objectives. In the network, we should especially take care of managing node resources to enhance network performance based on memory and battery capacity constraints. For instances, if some interesting events occur frequently in the vicinity of particular sensor nodes, those nodes might receive massive amounts of data. Data congestion can happen due to a memory bottleneck or link disconnection at cluster heads because the remaining memory space is filled with those data. In this paper, we utilize drones as mobile sinks to resolve data congestion and model the network, sensor nodes, and cluster heads. We also design a cost function and a congestion indicator to calculate the degree of congestion. Then we propose a data congestion map index and a data congestion mapping scheme to deploy drones at optimal points. Using control variable, we explore the relationship between the degree of congestion and the number of drones to be deployed, as well as the number of drones that must be below a certain degree of congestion and within communication range. Furthermore, we show that our algorithm outperforms previous work by a minimum of 20% in terms of memory overflow.

Accelerating Self-Similarity-Based Image Super-Resolution Using OpenCL

  • Jun, Jae-Hee;Choi, Ji-Hoon;Lee, Dae-Yeol;Jeong, Seyoon;Cho, Suk-Hee;Kim, Hui-Yong;Kim, Jong-Ok
    • IEIE Transactions on Smart Processing and Computing
    • /
    • v.4 no.1
    • /
    • pp.10-15
    • /
    • 2015
  • This paper proposes the parallel implementation of a self-similarity based image SR (super-resolution) algorithm using OpenCL. The SR algorithm requires tremendous computations to search for a similar patch. This becomes a bottleneck for the real-time conversion from a FHD image to UHD. Therefore, it is imperative to accelerate the processing speed of SR algorithms. For parallelization, the SR process is divided into several kernels, and memory optimization is performed. In addition, two GPUs are used for further acceleration. The experimental results shows that a GPGPU implementation can speed up over 140 times compared to a single-core CPU. Furthermore, it was confirmed experimentally that utilizing two GPUs can speed up the execution time proportionally, up to 277 times.

A Parallel Loop Scheduling Algorithm on Multiprocessor System Environments (다중프로세서 시스템 환경에서 병렬 루프 스케쥴링 알고리즘)

  • 이영규;박두순
    • Journal of Korea Multimedia Society
    • /
    • v.3 no.3
    • /
    • pp.309-319
    • /
    • 2000
  • The purpose of a parallel scheduling under a multiprocessor environment is to carry out the scheduling with the minimum synchronization overhead, and to perform load balance for a parallel application program. The processors calculate the chunk of iteration and are allocated to carry out the parallel iteration. At this time, it frequently accesses mutually exclusive global memory so that there are a lot of scheduling overhead and bottleneck imposed. And also, when the distribution of the parallel iteration in the allocated chunk to the processor is different, the different execution time of each chunk causes the load imbalance and badly affects the capability of the all scheduling. In the paper. we investigate the problems on the conventional algorithms in order to achieve the minimum scheduling overhead and load balance. we then present a new parallel loop scheduling algorithm, considering the locality of the data and processor affinity.

  • PDF

A Hybrid Active Queue Management for Stability and Fast Adaptation

  • Joo Chang-Hee;Bahk Sae-Woong;Lumetta Steven S.
    • Journal of Communications and Networks
    • /
    • v.8 no.1
    • /
    • pp.93-105
    • /
    • 2006
  • The domination of the Internet by TCP-based services has spawned many efforts to provide high network utilization with low loss and delay in a simple and scalable manner. Active queue management (AQM) algorithms attempt to achieve these goals by regulating queues at bottleneck links to provide useful feedback to TCP sources. While many AQM algorithms have been proposed, most suffer from instability, require careful configuration of nonintuitive control parameters, or are not practical because of slow response to dynamic traffic changes. In this paper, we propose a new AQM algorithm, hybrid random early detection (HRED), that combines the more effective elements of recent algorithms with a random early detection (RED) core. HRED maps instantaneous queue length to a drop probability, automatically adjusting the slope and intercept of the mapping function to account for changes in traffic load and to keep queue length within the desired operating range. We demonstrate that straightforward selection of HRED parameters results in stable operation under steady load and rapid adaptation to changes in load. Simulation and implementation tests confirm this stability, and indicate that overall performances of HRED are substantially better than those of earlier AQM algorithms. Finally, HRED control parameters provide several intuitive approaches to trading between required memory, queue stability, and response time.

Hierarchical Binary Search Tree (HBST) for Packet Classification (패킷 분류를 위한 계층 이진 검색 트리)

  • Chu, Ha-Neul;Lim, Hye-Sook
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.32 no.3B
    • /
    • pp.143-152
    • /
    • 2007
  • In order to provide new value-added services such as a policy-based routing and the quality of services in next generation network, the Internet routers need to classify packets into flows for different treatments, and it is called a packet classification. Since the packet classification should be performed in wire-speed for every packet incoming in several hundred giga-bits per second, the packet classification becomes a bottleneck in the Internet routers. Therefore, high speed packet classification algorithms are required. In this paper, we propose an efficient packet classification architecture based on a hierarchical binary search fee. The proposed architecture hierarchically connects the binary search tree which does not have empty nodes, and hence the proposed architecture reduces the memory requirement and improves the search performance.

- Development of an Algorithm for a Re-entrant Safety Parallel Machine Problem Using Roll out Algorithm - (Roll out 알고리듬을 이용한 반복 작업을 하는 안전병렬기계 알고리듬 개발)

  • Baek Jong Kwan;Kim Hyung Jun
    • Journal of the Korea Safety Management & Science
    • /
    • v.6 no.4
    • /
    • pp.155-170
    • /
    • 2004
  • Among the semiconductor If-chips, unlike memory chips, a majority of Application Specific IC(ASIC) products are produced by customer orders, and meeting the customer specified due date is a critical issue for the case. However, to the one who understands the nature of semiconductor manufacturing, it does not take much effort to realize the difficulty of meeting the given specific production due dates. Due to its multi-layered feature of products, to be completed, a semiconductor product(called device) enters into the fabrication manufacturing process(FAB) repeatedly as many times as the number of the product specified layers, and fabrication processes of individual layers are composed with similar but not identical unit processes. The unit process called photo-lithography is the only process where every layer must pass through. This re-entrant feature of FAB makes predicting and planning of due date of an ordered batch of devices difficult. Parallel machines problem in the photo process, which is bottleneck process, is solved with restricted roll out algorithm. Roll out algorithm is a method of solving the problem by embedding it within a dynamic programming framework. Restricted roll out algorithm Is roll out algorithm that restricted alternative states to decrease the solving time and improve the result. Results of simulation test in condition as same as real FAB facilities show the effectiveness of the developed algorithm.