Search | Korea Science

Low-latency SAO Architecture and its SIMD Optimization for HEVC Decoder

Kim, Yong-Hwan;Kim, Dong-Hyeok;Yi, Joo-Young;Kim, Je-Woo
- IEIE Transactions on Smart Processing and Computing
- /
- v.3 no.1
- /
- pp.1-9
- /
- 2014
This paper proposes a low-latency Sample Adaptive Offset filter (SAO) architecture and its Single Instruction Multiple Data (SIMD) optimization scheme to achieve fast High Efficiency Video Coding (HEVC) decoding in a multi-core environment. According to the HEVC standard and its Test Model (HM), SAO operation is performed only at the picture level. Most realtime decoders, however, execute their sub-modules on a Coding Tree Unit (CTU) basis to reduce the latency and memory bandwidth. The proposed low-latency SAO architecture has the following advantages over picture-based SAO: 1) significantly less memory requirements, and 2) low-latency property enabling efficient pipelined multi-core decoding. In addition, SIMD optimization of SAO filtering can reduce the SAO filtering time significantly. The simulation results showed that the proposed low-latency SAO architecture with significantly less memory usage, produces a similar decoding time as a picture-based SAO in single-core decoding. Furthermore, the SIMD optimization scheme reduces the SAO filtering time by approximately 509% and increases the total decoding speed by approximately 7% compared to the existing look-up table approach of HM.
https://doi.org/10.5573/IEIESPC.2014.3.1.1 인용 PDF KSCI

Integrated Parallelization of Video Decoding on Multi-core Systems (멀티코어 시스템에서의 통합된 비디오 디코딩 병렬화)

Hong, Jung-Hyun;Kim, Won-Jin;Chung, Ki-Seok
- Journal of the Institute of Electronics Engineers of Korea SD
- /
- v.49 no.7
- /
- pp.39-49
- /
- 2012
Demand for high resolution video services leads to active studies on high speed video processing. Especially, widespread deployment of multi-core systems accelerates researches on high resolution video processing based on parallelization of multimedia software. Previously proposed parallelization approach could improve the decoding performance. However, some parallelization methods did not consider the entropy decoding and others considered only a partial decoding parallelization. Therefore, we consider parallel entropy decoding integrated with other parallel video decoding process on a multi-core system. We propose a novel parallel decoding method called Integrated Parallelization. We propose a method on how to optimize the parallelization of video decoding when we have a multi-core system with many cores. We parallelized the KTA 2.7 decoder with the proposed technique on an Intel i7 Quad-Core platform with Intel Hyper-Threading technology and multi-threads scheduling. We achieved up to 70% performance improvement using IP method.
PDF KSCI

KI-HABS: Key Information Guided Hierarchical Abstractive Summarization

Zhang, Mengli;Zhou, Gang;Yu, Wanting;Liu, Wenfen
- KSII Transactions on Internet and Information Systems (TIIS)
- /
- v.15 no.12
- /
- pp.4275-4291
- /
- 2021
With the unprecedented growth of textual information on the Internet, an efficient automatic summarization system has become an urgent need. Recently, the neural network models based on the encoder-decoder with an attention mechanism have demonstrated powerful capabilities in the sentence summarization task. However, for paragraphs or longer document summarization, these models fail to mine the core information in the input text, which leads to information loss and repetitions. In this paper, we propose an abstractive document summarization method by applying guidance signals of key sentences to the encoder based on the hierarchical encoder-decoder architecture, denoted as KI-HABS. Specifically, we first train an extractor to extract key sentences in the input document by the hierarchical bidirectional GRU. Then, we encode the key sentences to the key information representation in the sentence level. Finally, we adopt key information representation guided selective encoding strategies to filter source information, which establishes a connection between the key sentences and the document. We use the CNN/Daily Mail and Gigaword datasets to evaluate our model. The experimental results demonstrate that our method generates more informative and concise summaries, achieving better performance than the competitive models.
https://doi.org/10.3837/tiis.2021.12.001 인용 PDF KSCI

Time-Series Forecasting Based on Multi-Layer Attention Architecture

Na Wang;Xianglian Zhao
- KSII Transactions on Internet and Information Systems (TIIS)
- /
- v.18 no.1
- /
- pp.1-14
- /
- 2024
Time-series forecasting is extensively used in the actual world. Recent research has shown that Transformers with a self-attention mechanism at their core exhibit better performance when dealing with such problems. However, most of the existing Transformer models used for time series prediction use the traditional encoder-decoder architecture, which is complex and leads to low model processing efficiency, thus limiting the ability to mine deep time dependencies by increasing model depth. Secondly, the secondary computational complexity of the self-attention mechanism also increases computational overhead and reduces processing efficiency. To address these issues, the paper designs an efficient multi-layer attention-based time-series forecasting model. This model has the following characteristics: (i) It abandons the traditional encoder-decoder based Transformer architecture and constructs a time series prediction model based on multi-layer attention mechanism, improving the model's ability to mine deep time dependencies. (ii) A cross attention module based on cross attention mechanism was designed to enhance information exchange between historical and predictive sequences. (iii) Applying a recently proposed sparse attention mechanism to our model reduces computational overhead and improves processing efficiency. Experiments on multiple datasets have shown that our model can significantly increase the performance of current advanced Transformer methods in time series forecasting, including LogTrans, Reformer, and Informer.
https://doi.org/10.3837/tiis.2024.01.001 인용 PDF HTML

A Design of Parameterized Viterbi Decoder using Hardware Sharing (하드웨어 공유를 이용한 파라미터화된 비터비 복호기 설계)

Park, Sang-Deok;Jeon, Heung-Woo;Shin, Kyung-Wook
- Proceedings of the Korean Institute of Information and Commucation Sciences Conference
- /
- 2008.05a
- /
- pp.93-96
- /
- 2008
This paper describes an efficient design of a multi-standard Viterbi decoder that supports multiple constraint lengths and code rates. The Viterbi decode. is parameterized for the code rates 1/2, 1/3 and constraint lengths 7, 9, thus it has four operation modes. In order to achieve low hardware complexity and low power, an efficient architecture based on hardware sharing techniques is devised. Also, the optimization of ACCS (Accumulate-Subtract) circuit for the one-point trace-back algorithm reduces its area by about 35% compared to the full parallel ACCS circuit. The parameterized Viterbi decoder core has 79,818 gates and 25,600 bits memory, and the estimated throughput is about 105 Mbps at 70 MHz clock frequency.
PDF

Hardware design of Reed-solomon decoder for DMB mobile terminals (DMB 휴대용 단말기를 위한 Reed-Solomon 복호기의 설계)

Ryu Tae-Gyu;Jeong Yong-Jin
- Journal of the Institute of Electronics Engineers of Korea SD
- /
- v.43 no.4 s.346
- /
- pp.38-48
- /
- 2006
In this paper, we developed a hardware architecture of Reed-Solomon RS(255,239) decoder for the DMB mobile terminals. The DMB provides multimedia broadcasting service to mobile terminals, hence it should have small dimension for low power and short decoding delay for real-time processing. We modified Euclid algorithm to apply it to the key equation solving which is the most complicated part of the RS decoding. We also designed a small finite field divider to avoid the use of large Inverse-ROM table, and it consumed 17 clocks. After synthesis with Synopsis on Samsung STD130 $0.18{\mu}m$ Standard Cell library, the Euclid block had 30,228 gates and consumed 288 clocks, which gave the 25% reduced area compared to other existing designs. The size of the entire RS decoder was about 45,000 gates.
PDF KSCI

Memory Access Reduction Scheme for H.264/AVC Decoder Motion Compensation (H.264/AVC 디코더의 움직임 보상을 위한 메모리 접근 감소 기법)

Park, Kyoung-Oh;Hong, You-Pyo
- The Journal of Korean Institute of Communications and Information Sciences
- /
- v.34 no.4C
- /
- pp.349-354
- /
- 2009
In this paper, a new motion compensation scheme to reduce external memory access frequency which is one of the major bottlenecks for real-time decoding is proposed. Most H.264/AVC decoders store reference pictures in external memories due to the large size and reference blocks are read into the decoder core as needed during decoding. If the reference data access is done for each reference block in decoding sequence, the memory bandwidth can be unacceptable for real-time decoding. This paper presents a memory access scheme for motion compensation to read as many reference data as possible with reduced memory access frequency by analyzing reference data access pattern for each macroblock. Experimental results show that the proposed motion compensation scheme leads to approximately 30% improvement in memory bandwidth requirement.
PDF KSCI

A Parallelization Technique with Integrated Multi-Threading for Video Decoding on Multi-core Systems

Hong, Jung-Hyun;Kim, Won-Jin;Chung, Ki-Seok
- KSII Transactions on Internet and Information Systems (TIIS)
- /
- v.7 no.10
- /
- pp.2479-2496
- /
- 2013
Increasing demand for Full High-Definition (FHD) video and Ultra High-Definition (UHD) video services has led to active research on high speed video processing. Widespread deployment of multi-core systems has accelerated studies on high resolution video processing based on parallelization of multimedia software. Even if parallelization of a specific decoding step may improve decoding performance partially, such partial parallelization may not result in sufficient performance improvement. Particularly, entropy decoding has often been considered separately from other decoding steps since the entropy decoding step could not be parallelized easily. In this paper, we propose a parallelization technique called Integrated Multi-Threaded Parallelization (IMTP) which takes parallelization of the entropy decoding step, with other decoding steps, into consideration in an integrated fashion. We used the Simultaneous Multi-Threading (SMT) technique with appropriate thread scheduling techniques to achieve the best performance for the entire decoding step. The speedup of the proposed IMTP method is up to 3.35 times faster with respect to the entire decoding time over a conventional decoding technique for H.264/AVC videos.
https://doi.org/10.3837/tiis.2013.10.009 인용 PDF KSCI

Hardware Implementation of Transform and Quantization for H.264/JVT (하드웨어 기반의 H.264/JVT 변환 및 양자화 구현)

임영훈;정용진
- Proceedings of the IEEK Conference
- /
- 2003.11a
- /
- pp.83-86
- /
- 2003
In this paper, we propose a new hardware architecture for integer transform, quantizer operation of a new video coding standard H.264/JVT. We describe the algorithm to derive hardware architecture emphasizing the importance of area for low cost and low power consumption. The proposed architecture has been verified by PCI-interfaced emulation board using APEX-II Altera FPGA and also by ASIC synthesis using Samsung 0.18 ${\mu}{\textrm}{m}$ CMOS cell library. The ASIC synthesis result shows that the proposed hardware can operate at 100 MHz, processing more than 1, 300 QCIF video frames per second. The hardware is going to be used as a core module when implementing a complete H.264 video encoder/decoder ASIC for real-time multimedia application.
PDF

Implementation of DSP Embeded ASIC for Multimedia Communicatioin (멀티미디어 통신용 Vocoder 갭라용 DSP Embeded ASIC 개발)

성유나
- Proceedings of the Acoustical Society of Korea Conference
- /
- 1998.08a
- /
- pp.165-168
- /
- 1998
제안하고 있는 CSD17C00 chip은 C&S technology에서 개발한 것으로, 음성 신호 처리를 위해 범용으로 구현되었으며, 16 bit 40 MIPS DSP group OAK DSP Core를 포함, 이에 Miscellaneous Logic, Serial Port, Host Interface, Timer, Compander 의 5가지 Peripherals 과 범용 I/O Ports 로 설계되었다. 1차적으로 CSD17C00 Chip 의 성능을 점검하였다. 그 결과, 응용 프로그램은 28MIPS의 계산속도를 갖으며, 프로그램 ROM 크기는 8.85KWords 이고, 10KWords 의 데이터 ROM 과 4KWords 데이터 RAM을 필요로 한다. CSD17C00 CHIP은 멀티미디어 통신용 VOCODER 개발을 위한 범용성을 갖추고 있으며, VOCODER 용 S/W 개발 환경 및 H/W 구조가 여타 범용 DSP에 비해편의성고 K합리성을 제공하도록 설계되어 있다. 따라서, 이를 이용한다면, 멀티 미디어 통신용 VOCODER, INTERNET PHONE CO-PROCESSOR, DIGITAL RECODER, MPEG AUDIO ENCODER & DECODER 등 다양한 제품으로의 응용이 가능할 것으로 전망된다.
PDF

Search Result 69, Processing Time 0.027 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)