# A Study on the Design of DCT Module using Distributed Arithmetic Method Dong Hyun Yang Dae Sung Ku Phil Jung Kim Jung Hyun Yon Sang Duk Kim Jung Yeun Hwang Rae Sung Jeong and Jong Bin Kim \* Dept of Electronic Engineering, Chosun University, 375 Susuck-dong Dong-gu Kwangju, 501-759 Tel: +82-62-230-7861 Fax: +82-62-232-3369 E-mail: hanaro2000y@hotmail.com \*\*Dept of Internet Software Engineering, Sunghwa College, 224 Wolpyong-lee Sungjeonmyn Kangjin-gun Chunnam, 527-812, Korea Tel: +82-61-430-5239 Fax: +82-62-232-3369 E-mail: <a href="mailto:philjung@daum.net">philjung@daum.net</a> \*\*\*Dept of Electronic Engineering, Donga College, 224 Dokchon-lee Haksanmyn Youngam-gun Chunnam, 526-872, Korea Tel: +82-61-470-1776 Fax: +82-62-232-3369 E-mail: ksd4109@naver.com Abstract: In present, there are many methods such as DCT, Wavelet Transform, or Quantization to the image compression field, but the basic image compression method have based on DCT. The representative thing of the efficient techniques for information compression is DCT method. It is more superior than other information conversion method. It is widely applied in digital signal processing field and MPEG and JPEG which are selected as basis algorithm for an image compression by the international standardization group. It is general that DCT is consisted of using multiplier with main arithmetic blocks having many arithmetic amounts. But, the use of multiplier requires many areas when hardware is embodied, and there is fault that the processing speed is low. In this paper, we designed the hardware module that could run high-speed operation using row-column separation calculation method and Chen algorithm by distributed arithmetic method using ROM table instead of multiplier for design DCT module of high speed. # 1. INTRODUCTION Information-Communication technology development of modern society is required necessity of digital video application [1]. Therefore, image and moving picture compression technology perform very important part. Specially, interest about a moving picture information which satisfy sight and hearing of human becomes exciting. It is magnified to symmetrical application field such as digital save media, HDTV (High Definition Television), distribution of image database, remote sensing and unmanned reconnaissance etc. Image compression technology was gone with standardization effort of ITU-T and ISO, and JPEG of suspension image picture, MPEG of moving image picture, H.261 of image picture conference, JPEG-2000 and H.264 of digital broadcasting presented. It is general that DCT is consisted of using multiplier with main arithmetic blocks having many arithmetic amounts. But, the use of multiplier requires much areas when hardware is embody, and there is late a fault the processing speed [2]. Distributed Arithmetic (DA) is an efficient method for computing inner products when one of the input vectors is fixed [3][4]. It uses look-up tables and accumulators instead of multipliers for computing inner products and has been widely used in many DSP applications such as DFT, DCT, convolution, and digital filters. In particular, there has been great interest in implementing DCT with distributed arithmetic and in reducing the ROM size required in the implementations since the DA-based DCT architectures are known to have very regular structures suitable for VLSI implementations [5][6][7]. Most DA-based DCT implementations use the original DCT algorithm [5] or the even-odd frequency decomposition of the DCT algorithm [6][7]. In this paper, we designed the hardware module that could run high-speed operation using row-column separation calculation method and Chen algorithm by distributed arithmetic method to use ROM table instead of multiplier to design DCT module of high speed. row-column separation calculation executes DCT/IDCT by row units and the result executes column units. Then, ROM table number of a memory table decreases, but need additional adder and subtracter. Because the arithmetic speed of adder/subtracter is influenced in DCT/IDCT operating, we designed new adder that blended area efficiency of ripple carry adder and considered speed of carry selector adder for improve the arithmetic speed. Also, row-start and column-start block designed additionally by pipelining for processing of image data so that, DCT operating speed is improved. # 2. THEORETICAL BACKGROUND DCT algorithm described that it is changed from blocks of image data to transformer area. This is the most effective coding technology for image compression. Method to embody 2D DCT can divide by greatly two. First, there is RCA (Row Column Algorithm) method through 2 DCT and row-column transposition. This method advantages simple structure because can be arithmetic from 1D DCT/IDCT structure to 2D DCT/IDCT, and high speed is possible by using in reassembly with distributed arithmetic. Second, derives direction high-speed algorithm from 2D DCT calculation equation and embodies 2D DCT, there is NRCA (No Row Column Algorithm) method of method to embody 2D DCT, as in [8]. DCT in RCA method can divide by method to use high-speed algorithm and method to take advantage of distributed arithmetic techniques. Representative method of high-speed algorithm is Lee and Chen algorithm etc. Method using high speed algorithm is advantaged that reduced multiplication number than distributed arithmetic techniques, but this is advantage in software. When embody by hardware, carry flag error by use of register of fixing length is happened, and structure is complex by butterfly arithmetic and there is fault that area great. While, distributed arithmetic method has very efficient method though accomplish inner arithmetic with input data value and fixed coefficient with DCT. It is not unused multiplier because achieves multiplication by register and accumulator so that structure is simple and regular and reduce size of hardware. Lately, distributed arithmetic method is more efficient than embodies DCT by hardware. The N-point DCT zk of a real data sequence xn (n=0, 1, ..., N-1) is defined by, $$z_{k} = \frac{2}{N}c(k)\sum_{n=0}^{N-1}x_{n}\cos\frac{\pi(2n+1)k}{2N} \quad 0 \le k \le N-1$$ (1) Where, $c(0) = 1/\sqrt{2}$ and c(k) = 1 for k = 1, 2, ..., N-1. The constant scale factor 2/N will be neglected without loss of generality. Then, the DCT given in (1) can be written as, $$z = T(N)x \tag{2}$$ Where, $z = [z_0, z_1, ..., z_{N-1}]^T$ , $x = [x_0, x_1, ..., x_{N-1}]^T$ , and T(N) is an $N \times N$ matrix whose (k, n)th component is $T(N)_{(k,n)} = c(k) \cos \frac{\pi (2n+1)k}{2N}$ It has been shown that the DCT in (2) can be computed in a recursive fashion with proper permutations on the input and the output sequences. Then, (2) can be decomposed as, $$\begin{bmatrix} z_{\varepsilon} \\ z_{u} \end{bmatrix} = \begin{bmatrix} T(\frac{N}{2}) & T(\frac{N}{2}) \\ D(\frac{N}{2}) & -D(\frac{N}{2}) \end{bmatrix} \begin{bmatrix} x_{p} \\ \tilde{I} x_{r} \end{bmatrix} = \begin{bmatrix} T(\frac{N}{2})(x_{p} + \tilde{I} x_{r}) \\ D(\frac{N}{2})(x_{p} - \tilde{I} x_{r}) \end{bmatrix}$$ (3) Where, $z_e = [z_0, z_2, ..., z_{N-2}]^T$ , $z_o = [z_1, z_3, ..., z_{N-1}]^T$ 2D DCT/IDCT is transposed matrix after arithmetic D/DCTIDCT from row (or column) direction and behave 1D DCT/IDCT arithmetic to row/column directions again, as can be seen in Fig. 1. Fig. 1. 2D DCT/IDCT block diagram It needs 1D DCT/IDCT of two and transposition memory instead of matrix conversion. While one 1D DCT/IDCT module stores in transposition memory achieving 1D DCT/IDCT about column (or row) of input procession, other module achieve 1D DCT/IDCT for row (or column) of matrix in transposition memory and display result to outside, as can be seen in Fig. 1. In this method, Performance of processor and hardware size by 2D DCT/IDCT are influenced by 1D DCT/IDCT. ### 3. DCT/IDCT MODULE DESIGN 2D DCT/IDCT achieves 1D DCT/IDCT and store result to transpose buffer. And horizontal row and column access if run 1D DCT/IDCT arithmetic again, the result 2D DCT/IDCT arithmetic result become. 1D DCT/IDCT embodies by distributed arithmetic method using data that is stored on ROM table instead of use multiplier. Each row $CiX_i$ of (4) can apply distributed arithmetic because is inner $$y = c^T x = \sum_{i=1}^K c_i x_i \tag{2}$$ Where, $c = [c_0, c_1, ..., c_{N-1}]$ is a fixed coefficient vector and $x = [x_0, x_1, ..., x_{N-1}]$ is an input vector [3][-]. Suppose that $x_i$ is represented in B-bit 2's complement form as follow $$x_{i} = -b_{k0} + \sum_{n=1}^{N-1} B_{kn} 2^{-n}$$ $$y = \sum_{k=1}^{K} c_{i} [-bk \ 0 + \sum_{n=1}^{N-1} 2^{-n}]$$ (5) $$y = \sum_{n=1}^{K-1} \left[ \sum_{k=1}^{K} c_i b_{kn} \right] 2^{-n} + \sum_{k=1}^{K} c_i (-b_{k0})$$ (6) As in (6), because $b_{kn}$ is value of 0 or 1, first term of right side is possible $2^k$ unit's verification. And $b_{k0}$ of right side second term is sign bit (MSB), therefore, can calculate y if remember value of all $2\times 2^k$ . K bit become address of ROM table that remember value of $2\times 2^K$ , and calculation is completed by shift of N- 1 times and sequential sum. And size of ROM table became double for sign bit here, ROM's size decreases by $2^k$ if subtract in place of addition when MSB value becomes address. Figure 2 shows inner calculation that use distributed arithmetic of case that is K = 4. Fig.2. Inner arithmetic using distributed arithmetic method Because K bit become address of ROM table, contents of ROM table is approved to adder being displayed and added value of front. And next one bit become shift ROM table output value and add. While repeat these process, if MSB becomes address of ROM table, adder achieves subtraction and show final result. #### 3.1. Butterfly design In this paper, Butterfly composed by all 6 MUX and 1 adder/subtractor. Input data is approved by parallel through register of pre-processor, and approved data is passed during IDCT arithmetic because follow butterfly step for input column array during DCT arithmetic. After run data flows ROM table, DCT passes post-processor after pass butterfly step. Block diagram for equation is presented that handled by each MUX selection signal, as can be seen in Fig. 3. Fig. 3. Butterfly block diagram. #### 3.2. ROM tables and adder design Data of 16 bits by enter to input buffer distinguish even number and odd number. It accesses to ROM table as enters 1 bit of each 4. Table 1 shows an ROM table coefficient that even and odd part. | Table I | DCT RON | 1 TABLE | coefficient | |---------|---------|---------|-------------| |---------|---------|---------|-------------| | ROM |------|--------|--------|-------|--------|-------|-------|-------|-------| | ADD | 0 even | 1 even | | 3 even | 0 odd | 1 odd | 2 odd | 3 odd | | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | 1 | 783 | -1892 | 1892 | -783 | 399 | -1137 | 1702 | -2008 | | 10 | 1448 | -1448 | -1448 | 1448 | 1137 | -2008 | 399 | 1702 | | 11 | 2231 | -3340 | 444 | 665 | 1536 | -3145 | 2101 | -306 | | 100 | 1892 | 783 | -783 | -1892 | 1702 | -399 | -2008 | -1137 | | 101 | 2675 | -1109 | 1109 | -2675 | 2101 | -1536 | -306 | -3145 | | 110 | 3340 | -665 | -2231 | -444 | 2839 | -2407 | -1609 | 565 | | 111 | 4123 | -2557 | -339 | -1227 | 3238 | -3544 | 93 | -1443 | | 1000 | 1448 | 1448 | 1448 | 1448 | 2008 | 1702 | 1137 | 399 | | 1001 | 2231 | -444 | 3340 | 665 | 2407 | 565 | 2839 | -1609 | | 1010 | 2896 | 0 | 0 | 2896 | 3145 | -306 | 1536 | 2101 | | 1011 | 3679 | -1892 | 1982 | 2113 | 3544 | -1443 | 3238 | 93 | | 1100 | 3340 | 2231 | 665 | -444 | 3710 | 1303 | -871 | -738 | | 1101 | 4123 | 339 | 2557 | -1227 | 4109 | 166 | 831 | -2746 | | 1110 | 4788 | 783 | -783 | 1004 | 4847 | -705 | -472 | 964 | | 1111 | 5571 | -1109 | 1109 | 221 | 5246 | -1842 | 1230 | -1044 | This data designs ROM table that calculation is linked to 1D DCT/IDCT through accumulator of distributed arithmetic. And compose 1D for elevation of the processing speed and image data between 1D by pipelining technique and compose beginning block of row and beginning block of column. When row state is 'l' in 1D, row beginning signal is supplied, when it is 0, column beginning signals are supplied. And when column states of 1D are "001000", it becomes column beginnings of other 1D, and Go signal begins other 1D's row when is '1', and event status. In this paper, Adder that is proposed made use of 4 bit ripple carry adder 8 considering advantage about area of ripple carry adder and arithmetic speed advantage of carry selection adder, and following step connected 7 half adder to increase in case carry is '1' in first step. And compose beginning block of row and beginning block of column by pipelining technique for processing speed elevation ## 4. EXPERIMENT AND CONSIDERATION In this paper, we designed high-speed adder and controller of pipeline architecture for distributed arithmetic method and speed improvement of image data processing. It used ROM table instead of multiplier for DCT/IDCT module design of high speed. As the arithmetic speed of adder/subtracter is influenced in DCT/IDCT operating, we designed new adder that blended area efficiency of ripple carry adder and consider speed of carry selector adder for improve the arithmetic speed. VHDL coding and simulation of whole circuit, composition achieved to Aldec-6.1 and Synopsys Design Analyzer tool. Target library used Altera FLEX-10KE and designed so that FPGA step may be available. Rom table synthesis block diagram and DCT/IDCT controller block diagram shows Fig. 4, 5. and Synthesis block diagram of DCT/IDCT shows Fig. 6. Timing of module shows in Fig. 7. Fig. 4. Rom table synthesis block diagram Fig. 5. DCT/IDCT con roller block diagram Fig. 6. Synthesis block diagram of DCT/IDCT Fig. 7. Whole timing of DCT/IDCT module ## 5. CONCLUSION In this paper, we designed DCT/IDCT module. We proposed distributed arithmetic technique and pipeline method for speed elevation of image data processing to run addition of high speed in DCT/IDCT module. It can apply to image field or H.264 and image equipment application. When using distributed arithmetic method need accumulator. But the arithmetic speed of accumulator deteriorates performance of whole system. Solution designed new adder blending ripple carry adder that consider area efficiency and carry selector adding adder that consider high speed distributed of addition. Area expense increased about 1.5 times, but could improve the arithmetic speed of 3~6 times than existing adding machine. And could handle to pipeline concept so that improve the speed about 50%. In this paper, Design time can shorten because described high speed DCT/IDCT module that is proposed using VHDL, and do in IP form that apply digital broadcasting and application field of image compression field. ## References NT cho, S.U. Lee, "DCT Algorithms for VLSI Parallel Implementation," IEEE Trans. Acoust. Speech, Signal Processing, vol. ASSP38, pp.121-127, no.1, 1990. - [2] M. Sun, T. Chen, and A. M. Gottlieb, "VLSI implementation of a 16x16 discrete cosine transform," IEEE Trans. Circuits and Systems, Vol. 36, No. 4, pp.610-617, 1989 - [3] S.A. White, "Application of Distributed Arithmetic to Digital Signal Processing: A Tutorial Review, "IEEE Acoustics, Speech, and Signal Processing Magazine, vol. 6, pp. 4-19, July 1989. - [4] W.P. Burleson and L.L. Scharf, "A VLSI Design Methodology for Distributed Arithmetic, "J. VLSI Signal Processing. Vol. 2, pp. 235-252, 1991 - [5] M. Sheu, J. Lee, J. Wang, A. Suen, and L. Liu, "A High Throughput-Rate Architecture for 8 8 2-D DCT, "Proc. IEEE Int'l Symp. Circuits and Systems, vol. 3, pp. 1587-1590, 1593 - [6] K. Kim, S. Jang, S. Kwon, and K. Son, "An Improvement of VLSI Architecture for 2-Dimensional Discrete Cosine Transform and Its Inverse," Proc. SPIE, vol. 2727, pp. 1017-1026, 1996 - [7] S.I. Uramoto, Y.Inoue, A.Takabatake, J. Takeda, Y.Yamashita, H.Terane, and M. Yoshimoto, "A 100-MHz 1-D Discrete Cosine Transform Core Processor, "IEEE J.Solid-State Circuits, vol. 27, pp. 492-499, 1992 - [8] N.Ahmed, T. Natarajan, and K.R.Rao, "Discrete Cosine Transform," IEEE trans. Comput., vol C-23, pp.90-93, Jan. 1974