• Title/Summary/Keyword: pipelined memory

Search Result 79, Processing Time 0.024 seconds

Bounding Worst-Case DRAM Performance on Multicore Processors

  • Ding, Yiqiang;Wu, Lan;Zhang, Wei
    • Journal of Computing Science and Engineering
    • /
    • v.7 no.1
    • /
    • pp.53-66
    • /
    • 2013
  • Bounding the worst-case DRAM performance for a real-time application is a challenging problem that is critical for computing worst-case execution time (WCET), especially for multicore processors, where the DRAM memory is usually shared by all of the cores. Typically, DRAM commands from consecutive DRAM accesses can be pipelined on DRAM devices according to the spatial locality of the data fetched by them. By considering the effect of DRAM command pipelining, we propose a basic approach to bounding the worst-case DRAM performance. An enhanced approach is proposed to reduce the overestimation from the invalid DRAM access sequences by checking the timing order of the co-running applications on a dual-core processor. Compared with the conservative approach, which assumes that no DRAM command pipelining exists, our experimental results show that the basic approach can bound the WCET more tightly, by 15.73% on average. The experimental results also indicate that the enhanced approach can further improve the tightness of WCET by 4.23% on average as compared to the basic approach.

Hardware Design and Application of Block-cipher Algorithm KASUMI (블록암호화 알고리듬 KASUMI의 하드웨어 설계 및 응용)

  • Choi, Hyun-Jun;Seo, Young-Ho;Moon, Sung-Sik;Kim, Dong-Wook
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.15 no.1
    • /
    • pp.63-70
    • /
    • 2011
  • In this paper, we are implemented the kasumi cipher algorithm by hardware. In this work, kasumi was designed technology-independently for application such as ASIC or core-based design. The hardware is implemented to be able to calculate both confidentiality and integrity algorithm, and a pipelined KASUMI hardware is used for a core operator to achieve high operation frequency. The proposed block cipher was mapped into EPXA10F1020C1 from Altera and used 22% of Logic Element (LE) and 10% of memory element. The result from implementing in hardware (FPGA) could operate stably in 36.35MHz. Accordingly, the implemented hardware are expected to be effectively used as a good solution for the end-to-end security which is considered as one of the important problems.

An Architecture for IEEE 802.11n LDPC Decoder Supporting Multi Block Lengths (다중 블록길이를 지원하는 IEEE 802.11n LDPC 복호기 구조)

  • Na, Young-Heon;Shin, Kyung-Wook
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2010.05a
    • /
    • pp.798-801
    • /
    • 2010
  • This paper describes an efficient architecture for LDPC(Low-Density Parity Check) decoder, which supports three block lengths (648, 1,296, 1,944) of IEEE 802.11n standard. To minimize hardware complexity, the min-sum algorithm and block-serial layered structure are adopted in DFU(Decoding Function Unit) which is a main functional block in LDPC decoder. The optimized H-ROM structure for multi block lengths reduces the ROM size by 42% as compared to the conventional method. Also, pipelined memory read/write scheme for inter-layer DFU operations is proposed for an optimized operation of LDPC decoder.

  • PDF

Trace-Back Viterbi Decoder with Sequential State Transition Control (순서적 역방향 상태천이 제어에 의한 역추적 비터비 디코더)

  • 정차근
    • Journal of the Institute of Electronics Engineers of Korea TC
    • /
    • v.40 no.11
    • /
    • pp.51-62
    • /
    • 2003
  • This paper presents a novel survivor memeory management and decoding techniques with sequential backward state transition control in the trace back Viterbi decoder. The Viterbi algorithm is an maximum likelihood decoding scheme to estimate the likelihood of encoder state for channel error detection and correction. This scheme is applied to a broad range of digital communication such as intersymbol interference removing and channel equalization. In order to achieve the area-efficiency VLSI chip design with high throughput in the Viterbi decoder in which recursive operation is implied, more research is required to obtain a simple systematic parallel ACS architecture and surviver memory management. As a method of solution to the problem, this paper addresses a progressive decoding algorithm with sequential backward state transition control in the trace back Viterbi decoder. Compared to the conventional trace back decoding techniques, the required total memory can be greatly reduced in the proposed method. Furthermore, the proposed method can be implemented with a simple pipelined structure with systolic array type architecture. The implementation of the peripheral logic circuit for the control of memory access is not required, and memory access bandwidth can be reduced Therefore, the proposed method has characteristics of high area-efficiency and low power consumption with high throughput. Finally, the examples of decoding results for the received data with channel noise and application result are provided to evaluate the efficiency of the proposed method.

Performance Analysis of Implementation on Image Processing Algorithm for Multi-Access Memory System Including 16 Processing Elements (16개의 처리기를 가진 다중접근기억장치를 위한 영상처리 알고리즘의 구현에 대한 성능평가)

  • Lee, You-Jin;Kim, Jea-Hee;Park, Jong-Won
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.49 no.3
    • /
    • pp.8-14
    • /
    • 2012
  • Improving the speed of image processing is in great demand according to spread of high quality visual media or massive image applications such as 3D TV or movies, AR(Augmented reality). SIMD computer attached to a host computer can accelerate various image processing and massive data operations. MAMS is a multi-access memory system which is, along with multiple processing elements(PEs), adequate for establishing a high performance pipelined SIMD machine. MAMS supports simultaneous access to pq data elements within a horizontal, a vertical, or a block subarray with a constant interval in an arbitrary position in an $M{\times}N$ array of data elements, where the number of memory modules(MMs), m, is a prime number greater than pq. MAMS-PP4 is the first realization of the MAMS architecture, which consists of four PEs in a single chip and five MMs. This paper presents implementation of image processing algorithms and performance analysis for MAMS-PP16 which consists of 16 PEs with 17 MMs in an extension or the prior work, MAMS-PP4. The newly designed MAMS-PP16 has a 64 bit instruction format and application specific instruction set. The author develops a simulator of the MAMS-PP16 system, which implemented algorithms can be executed on. Performance analysis has done with this simulator executing implemented algorithms of processing images. The result of performance analysis verifies consistent response of MAMS-PP16 through the pyramid operation in image processing algorithms comparing with a Pentium-based serial processor. Executing the pyramid operation in MAMS-PP16 results in consistent response of processing time while randomly response time in a serial processor.

Optimized Hardware Design of Deblocking Filter for H.264/AVC (H.264/AVC를 위한 디블록킹 필터의 최적화된 하드웨어 설계)

  • Jung, Youn-Jin;Ryoo, Kwang-Ki
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.47 no.1
    • /
    • pp.20-27
    • /
    • 2010
  • This paper describes a design of 5-stage pipelined de-blocking filter with power reduction scheme and proposes a efficient memory architecture and filter order for high performance H.264/AVC Decoder. Generally the de-blocking filter removes block boundary artifacts and enhances image quality. Nevertheless filter has a few disadvantage that it requires a number of memory access and iterated operations because of filter operation for 4 time to one edge. So this paper proposes a optimized filter ordering and efficient hardware architecture for the reduction of memory access and total filter cycles. In proposed filter parallel processing is available because of structured 5-stage pipeline consisted of memory read, threshold decider, pre-calculation, filter operation and write back. Also it can reduce power consumption because it uses a clock gating scheme which disable unnecessary clock switching. Besides total number of filtering cycle is decreased by new filter order. The proposed filter is designed with Verilog-HDL and functionally verified with the whole H.264/AVC decoder using the Modelsim 6.2g simulator. Input vectors are QCIF images generated by JM9.4 standard encoder software. As a result of experiment, it shows that the filter can make about 20% total filter cycles reduction and it requires small transposition buffer size.

A New Hardware Design for Generating Digital Holographic Video based on Natural Scene (실사기반 디지털 홀로그래픽 비디오의 실시간 생성을 위한 하드웨어의 설계)

  • Lee, Yoon-Hyuk;Seo, Young-Ho;Kim, Dong-Wook
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.49 no.11
    • /
    • pp.86-94
    • /
    • 2012
  • In this paper we propose a hardware architecture of high-speed CGH (computer generated hologram) generation processor, which particularly reduces the number of memory access times to avoid the bottle-neck in the memory access operation. For this, we use three main schemes. The first is pixel-by-pixel calculation rather than light source-by-source calculation. The second is parallel calculation scheme extracted by modifying the previous recursive calculation scheme. The last one is a fully pipelined calculation scheme and exactly structured timing scheduling by adjusting the hardware. The proposed hardware is structured to calculate a row of a CGH in parallel and each hologram pixel in a row is calculated independently. It consists of input interface, initial parameter calculator, hologram pixel calculators, line buffer, and memory controller. The implemented hardware to calculate a row of a $1,920{\times}1,080$ CGH in parallel uses 168,960 LUTs, 153,944 registers, and 19,212 DSP blocks in an Altera FPGA environment. It can stably operate at 198MHz. Because of the three schemes, the time to access the external memory is reduced to about 1/20,000 of the previous ones at the same calculation speed.

High Throughput Parallel Design of 2-D $8{\times}8$ Integer Transforms for H.264/AVC (H.264/AVC 를 위한 높은 처리량의 2-D $8{\times}8$ integer transforms 병렬 구조 설계)

  • Sharma, Meeturani;Tiwari, Honey;Cho, Yong-Beom
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.49 no.8
    • /
    • pp.27-34
    • /
    • 2012
  • In this paper, the implementation of high throughput two-dimensional (2-D) $8{\times}8$ forward and inverse integer DCT transform for H.264 is presented. The forward and inverse transforms are represented using simple shift and addition operations. Matrix decomposition and matrix operation such as the Kronecker product and direct sum are used to reduce the computation complexity. The proposed design uses integer computations and does not use transpose memory and hence, the resource consumption is also reduced. The maximum operating frequency of the proposed pipelined architecture is 1.184 GHz, which achieves 25.27 Gpixels/sec throughput rate with the hardware cost of 44864 gates. High throughput and low hardware makes the proposed design useful for real time H.264/AVC high definition processing.

Retiming for SoC Using Single-Phase Clocked Latches (싱글 페이즈 클락드 래치를 이용한 SoC 리타이밍)

  • Kim Moon-Su;Rim Chong-Suck
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.43 no.9 s.351
    • /
    • pp.1-9
    • /
    • 2006
  • In the System-on-Chip(SoC) design, the global wires are critical parts for the performance. Therefore, the global wires need to be pipelined using flip-flops or latches. Since the timing constraint of the latch is more flexible than it of the flip-flop, the latch-based design can provide a better solution for the clock period. Retiming is an optimizing technique which repositions memory elements in the circuits to reduce the clock period. Traditionally, retiming is used on gate-level netlist, but retiming for SoC is used on macro-level netlist. In this paper, we extend the previous work of retiming for SoC using flip-flops to retiming for SoC using single-phase clocked latches. In this paper we propose a MILP for retiming for SoC using single-phase clocked latches, and apply the fixpoint computation to solve it. Experimental results show that retiming for SoC using latches reduces the clock period of circuits by average 10 percent compared with retiming for SoC using flip-flops.

(Theoretical Performance analysis of 12Mbps, r=1/2, k=7 Viterbi deocder and its implementation using FPGA for the real time performance evaluation) (12Mbps, r=1/2, k=7 비터비 디코더의 이론적 성능분석 및 실시간 성능검증을 위한 FPGA구현)

  • Jeon, Gwang-Ho;Choe, Chang-Ho;Jeong, Hae-Won;Im, Myeong-Seop
    • Journal of the Institute of Electronics Engineers of Korea SC
    • /
    • v.39 no.1
    • /
    • pp.66-75
    • /
    • 2002
  • For the theoretical performance analysis of Viterbi Decoder for wireless LAN with data rate 12Mbps, code rate 1/2 and constraint length 7 defined in IEEE 802.11a, the transfer function is derived using Cramer's rule and the first-event error probability and bit error probability is derived under the AWGN. In the design process, input symbol is quantized into 16 steps for 4 bit soft decision and register exchange method instead of memory method is proposed for trace back, which enables the majority at the final decision stage. In the implementation, the Viterbi decoder based on parallel architecture with pipelined scheme for processing 12Mbps high speed data rate and AWGN generator are implemented using FPGA chips. And then its performance is verified in real time.