# Systolic Array Implementaion for 2-D IIR Digital Filter and Design of PE Cell ## 2-D IIR 디지탈필터의 시스토릭 어레이 실현 및 PE셀 설계 Nho Kyung Park\*, Dai Tchul Moon\*, Kyun Hyon Tchah\*\* 박 노 경\*. 문 대 철\*. 차 균 현\*\* #### ABSTRACT In this paper, a realization method for 2-D IIR digital filter is presented, as it have been derived by applying a systolic procedure to the SFG(Signal Flow Graph). After we realized the 1-D form partial systolic array, we implemented the complete systolic array to be cascade 1-D form. The cascading of partial systolic array reduce the storage elements which used to delay input signal, 1-D systolic array is derived from that DG is designed through local communication approach and then mapping it to SFG. The derivd structure is very simple and has high throughput because while new input sample is supplied, new output is obtained every sampling period, and broadcast input signal is eliminated. Since the systolic array has property of regularity, modularity, local interconnection and highly synchronized multiprocessing, thus it is very suitable for VLSI implementation. A PE design is based on the modified Booth's algorithm and Ling's algorithm. And a synthesis method for designing highly parallel algorithm in VLSI is presented. ### 요 약 2-Dimension IIR 디저탈 필터를 시스토릭 어래이 구조로 실현하는 방법을 보였다. 시스토릭 이래이는 1-D IIR 디저탈 필터로 부분 실현한 후 종속연결하여 구현하였다. 부분 실현한 시스토릭 어레이의 종속 연결은 입력 신호 지연에 사용되는 요소를 감소 시킨다. 여기서 1-D 시스토릭 어래이는 local communication 접근에 의해 DG를 설계한후 SFG 로의 사상을 통해 유도하였다. 유도된 구조는 매우 간단하며, 입력 샘플이 공급되어지면 매 샘플링 가간따다 새로운 출력을 얻는 매우 높은 데이타 처리율을 갖는다. 2-Dimension IIR 디지탈 필터를 시스토릭 어레이로 실현함으로써 규칙적이고, modularity, local interconnection, 높은 농기형 다중처리의 특징을 갖기 때문에 VLSI 실현에 매우 적합하다. 또한 PE셀의 중산기 설계에서는 modified Booth's 알고리즘과 Ling's 알고리즘에 기초를 두고 고도의 병렬처리를 행할 수 있도록 설계하였다. <sup>\*</sup>Dept. of Information & Telecommunication Eng., Hoseo Unv <sup>\*\*</sup>Dept, of Electronic Eng., Korea Univ. ### 1. INTRODUCTION Two-Dimension digital filters are very widely used in the area of digital image processing such as noise filterng, feature enhancement and video communication, biomedical picture processing, air reconnaissance. In this paper, we are concerned with the development of algorithms for systolic realization of 2 dimension IIR digital filter, systolic array architectures have been by Kung<sup>(2)</sup>. In this systolic concept, VLSI devices consist of arrays of interconnected processing cells with a high degree of modularity, regularity structures and highly synchronized multiprocessing which are amenable to VLSI design, <sup>(2)(5)</sup> Other forms of realization for design of 2-D dimension filters have been presented in the references (1H3)(5)(6). However, the algorithm has been presented in the following sections for systolic realization of 2-D digital filter directly from SFG of locally recursive algorithm. The approach yield structure with the maximum data rate possible, i.e. a new input sample is supplied and a new output sample is obtained every sampling period, derived structure is simple, modular, regular and expandable. In fact, a systolic array can be considered as an SFG array in combination with pipelining and retimming. The design of systolic arrays can be divided into the following three stages 1. Derive a localized DG from the algorithm, 2. Map the DG to an SFG array, 3. Tranform the SFG to a systolic array (i.e. systolization) #### II. 2-D IIR DIGITAL FILTER A 2-dimension IIR digital filter is given by the transfer function $$H(z_1)(z_2) = \frac{\sum_{i=0}^{N} \sum_{j=0}^{N} a_{ij} z^{-j} z^{-j}}{1 + \sum_{i=0}^{N} \sum_{j=0}^{N} b_{ij} z^{-j} z^{-j}}$$ $$\xrightarrow{i+i\neq 0} (1)$$ where N is the length of the filter, $(a_0, z_0)$ are the filter coefficients. To demonstrate the design procedure, the input-output relation is given by the recursive equation $$y(n,m) = \sum_{i=0}^{N} \sum_{j=1}^{N} a_{i,j} x(n-i, m+j)$$ $$= \sum_{\substack{j=0\\j\neq j \ \text{odd}}}^{N-N} b_{i,j} y(n-i, m+j)$$ (2) which can be written in the form $$==\sum_{j=0}^{N} a_{0j} (n, m-j) - \sum_{j=0}^{N} b_{0j} y(n, m-j) + \sum_{j=0}^{N} a_{1j} x(n-1, m-j) - \sum_{j=0}^{N} b_{1j} y(n-1, m-j) + \sum_{j=0}^{N} a_{Nj} (n-N, m-j) - \sum_{j=0}^{N} b_{Nj} y(n-N, m-j) + \sum_{j=0}^{N} a_{Nj} (n-N, m-j) - \sum_{j=0}^{N} b_{Nj} y(n-N, m-j)$$ $$= \sum_{j=0}^{N} y_{j}(n, m)$$ (3) By equation (3), we will assume $b_{\infty}=0$ , as a consequence, $y_1(n, m)$ is modified as $$y_i(n, m) = \sum_{j=0}^{N} a_{ij} x(n+i, m j) + \sum_{j=0}^{N} b_{ij} y(n+i, m+j)$$ (4) Eq.(4) has no concern with i-index as to its form. Thus, due to treat as the 1-D form and then to realize the partial systolic array, again eq.(4) is able to define as following which is considered as 1-D HR filter. $$y(n) = \sum_{i=0}^{N} a_{i} x(n-i) + \sum_{i=0}^{N} b_{i} y(n-i)$$ (5) where $$y(n) = y_1(n,m), a_i = a_0, b_i = b_0,$$ $x(n-i) = x(n-i, m-j), y(n-i) = y(n-i), m-j)$ #### III. DESIGN OF SYSTOLIC ARRAY From Derived equation(5) be mapped onto a systolic architecture for 1-D IIR digital filter. And two-demension digital filter can designed readily paralleling from papped 1-D IIR digital filter as depicted in fig.5 ## 3.1 Implementation of 1-D IIR digital filter For design of DG (Dependence Graph) from recursive algorithms, all variables of Eq.(5) is to convert it to a single assignment code such as (6). The DG of Eq.(5) is shown in Fig.1, where both the inputs and the coeffficients of the filter are broadcast throughout the index space. Signal broadcast can be eliminated by using an approach on signal flow graph, SFG can be derived from designed the DG. (609) Eq.(6) is an expression with global data dependences and it is not a locally recursive algorithm. The preliminary DG can be readily sketched as depicted in Fig.1(a) The node of Fig.1(b) is a detailed node with associative summation and broadcast contours. Direction of arcs for summation in Fig.1(a) is reversible, i.e., reversed direction is possible. The first step in DG design is to convert it to a single assignment code as Fig. 1 Preliminary DG of Eq. (5) $$y_n^{+} = y_n^{++1} + a_n \cdot x_{n-1} + b_n \cdot y_{n-1}$$ (6) where. $$y(n) = y_n^0, y_n^{N+1} = 0, K = 0, 1, 2, \dots, N$$ The DG is shown in Fig.1 where both filter coefficients(a<sub>i</sub>, b<sub>i</sub>) and input signal are broadcast as ever. In order to eliminate the input signal broadcast, we modify of Eq.(6) as $$y_n^i = y_n^{i+1} + a_n^i \cdot x_n^i + b_n^i \cdot B_n^i$$ (7) where, $$a_n^1 = a_{n-1}^1$$ , $a_{n-1}^1 = a_n$ , $b_n^1 = b_{n-1}^1$ , $b_{n-1}^{-1} = b_n$ , $b_n^0 = 0$ $B_n^1 = B_{n-1}^{-1}$ , $B_n^{-1} = y_n^0$ and $y(n) = y_n^0$ A local communication are required to send the data from node to node, The corresponding DG of Eq. (7) is shown in Fig.2, which is completely localized, Obiously, this localized DG is more preferable to spiral version, To determine a valid array structure for a locally recursive algorithm, one design method is to designate one PE for each node in a DG. How ever, this lead to very inefficient utilization of the PEs, because each PE can be active only for a small fraction of the computation time, In order to improve PE utilization, it is often desirable to map the nodes of the DG onto a few number of PEs. To achieve this it is useful to map from the DG to SFG form should include both functional and structural description parts. SFG, which consists of processing nodes, communication edge, and delay. The SFG is more specific, i.e. it is close to hardware level design. Therefore, the SFG also dictates the type of systelic arrays that will be obtained. However, two steps are involved in mapping a DG to an SFG array. The first step is the processor assignment. A projection method may be applied, in which nodes of the DG along a straight line are assigned to a common PE. A linear projection is often represented by a projection vector $\vec{d}$ . The second step is the scheduling. This projection should be accompanied by a scheduling scheme, which specifies the sequence of the operation in all the PEs. A scheduling function represents a mapping from N-dimensional index space of the DG onto a 1-D schedule (time) space. The schedule can be represented by a Fig. 2 completely a localized DG column schedule vectors, pointing to the normal direction of the hyperplanes, Given a DG of Fig.2 and a projection vector $\vec{d}$ determines the arrays structure. All nodes lying a straight line parallel to $\vec{d}$ are assigned to one PE. In order for the given (Fig.2) hyperplanes to represent a permissible linear schedule, the scheduling normal vector $\vec{s}$ has to satisfies following two conditions: (6)(9)(13) $$\vec{s} \cdot \vec{e} > 0 \quad \forall \vec{e}$$ (8) $$\vec{s} \cdot \vec{d} \ge 0$$ (9) where e represents any arc in the signal flow graph of the algorithm. The first condition means that a causality should be enforced in a permissible schedule, i/e/, all the dependency arcs flows in the same direction across the hyperplane; the second condition implies that nodes on an equitemporal hyperplane should not be projected to the same PE, i.e., the hyperplanes are not parallel with projection vector $\vec{d}$ . The hyperplanes in Fig.2 represent different time instants. And signal flow graph based on the of Fig.2 is shown in Fig.3 for 1-D IIR digital filter. Fig. 3 The Block diagram of SFG Eq.(8)(9) means that every edge of the resulting SFG will have one or more delay elements, which satisfying the temporal locality condition SFG array in Fig.3 is spatially localized but not temporally localized. Since it is difference a relationship between timing determined schedule vector from DG and timing of the designed systolic array, it's have to the retiming map the SFG onto a systolic array. (3)(4)(5)(6) An SFG is meaningful only when it is computable, that, there exist no zero delay loop or cycle element. Therefore, all inputs may be propagated to PE through localized arc during a sampling period, where retiming derived from a node computation procedure of SFG with highly throughput rate. Because the propagated coefficients of filter stored to PE needs not transmission line, which it's have to supply for computational processing, thus there needs only one store element for filter coefficients. Each input signals need to delay element. (i.e., for inputs and coefficients computation). The resultant systolized array for 1-D HR digital filter are shown in Fig. 4. Fig. 4 Systolic array structure of 1-D HR digital filter PE cell consist of delay element, multiplier, adder and store element. Each PE in the Fig.4 perform the multiplication operation forms all the partial products for a particular output sample and the addition operation forms the summation of the relevant partial products to products to produce the output. This structure designed with highly throughput rate, minimal computation time and pipelining period. ## 3.2 Systolic Array of 2-D IIR digital filter In this paper, a systolic realization of 2 D HR digital filter designed by using the realized parallel 1-D IIR digital filter such as Fig.4. This structure consisting of parallel of 1 D IIR digital filter. The final realization for the 2-D IIR digital filter of Eq. (3) is given in Fig.5 The derived structures are simplify modularity and expandable, also this resultant structures could simplify the design of the control unit required to run the system. Fig. 5 Systolic array of 2-D HR digital filter ## IV. DESIGN OF PE CELL As shown in Fig.6, the device consists of an Booth Decoder, Switching Circuit, Addition Block, ACC. Each PE multiplies the 16bit input sample with the 16bit coefficient and addition operation result from the previous PE Computation. Fig. 6 Block diagram of PE The partial arithmatic results are registered for pipelining after each PE. Each module of the multiplier was designed with a level of parallelism to reduce the computation time. In the modified Booth's algorithm for multiplication, the multiplier is organized into substring of three bits each with adjacent groups sharing common bit such as are shown in Table 1. Multiplier based on the modified booyh's algorithm requires less clock (m+n-1+N/2c) and N/2 cells, each of which contains a ripple-carry adder some gates, where N/2 cells, each of which contains a ripple-carry adder some gates, where N is restricted to be even. Booth decoder is shown in Fig. 7. And in order to minimize summation time, adition of all the partial products is strated simul- Fig. 7 Booth decoder(a)Block diagram of logic (b)Result of simulation Table 1. Five level multiplier receding | orignal multiplier | | | receded multiplier | action | | |--------------------|--------------------|------|----------------------------------|--------|-----| | Yı+1 | , Y <sub>1</sub> , | Yı-ı | $Y_i = Y_{i+1} + Y_i - 2Y_{i-1}$ | | | | 0 | 0 | 0 | +0 | add | 0 | | 0 | 0 | 1 | +1 | add | X | | 0 | 1 | 0 | +1 | add | Х | | 0 | 1 | 1 | +2 | add | 2X | | 1 | 0 | 0 | -2 | add- | -2X | | 1 | 0 | 1 | -ı | add | – X | | 1 | 1 | 0 | -1 | add | ~ X | | 1 | İ | I | -0 | add | 0 | X: multiplicand Y: multiplier taneously in the adder array. Accordingly, Addition Block and CLA design based on the CSA(Carry Save ADDER, and Ling's approachs. Lings approach is based on the propagation of composite term in place of the conventional look-ahead carry. This approach gives an adder that is faster and less expensive<sup>(7)</sup> Ling's adder implements $$t_i = a_i + b_i$$ , $g_i = a_i b_i$ $H_i = g_i + t_{i+1}$ , $H_{i+1}$ (10) $S_i = t_i \bigoplus H_i + g_i t_{i+1} H_{i+1}$ The basic circuit of CLA by Eq. (10) is shown in Fig.8 Fig. 8 Basic circuit of CLA The comparision of convential CLA and Ling's CLA is shown in Table 2. Table 2. Comparision of conventional CLA and Ling's CLA | Conventional CLA | | | | | | |-------------------------------------------------------------------|--|--|--|--|--| | Co = Co | | | | | | | $C_1 = g_1 + P_1 C_0$ | | | | | | | $C_2 = g_2 + P_2g_1 + P_2P_1C_0$ | | | | | | | $C_3 = g_3 + P_3g_2 + P_3P_2P_1 + P_3P_2P_1C_0$ | | | | | | | $C_4 = g_4 + P_4g_3 + P_4P_3g_2 + P_4P_3P_2g_1 + P_4P_3P_2P_1C_0$ | | | | | | | Ling's CLA | | | | | | | $H_0 = g_0$ | | | | | | | $H_1 = g_1 + g_0$ | | | | | | | $H_2 = g_2 + g_1 + t_1 g_0$ | | | | | | | $H_3 = g_3 + g_2 + t_2 g_1 + t_2 t_1 g_0$ | | | | | | | $H_4 = g_4 + g_3 + t_3g_2 + t_3t_2g_1 + t_3t_2t_1g_0$ | | | | | | The Layout of PE cell is shown in Fig.9. The Layout has been made regular, Regular layout have helped in reducing devive size and enhancing device speed. Fig. 9 Layout of PE Cell ## V. CONCLUSION In the paper, a realization method for 2-D IIR digital filter is presented, as it have been derived by applying a systolizing procedure to the SFG of the 2-D IIR digital filter. The derived structure is very simple, regular, modular and highly synchronized multiplexing, thus is very suitable for VLSI implementation. And a PE cell design based on the modified Booth's algorithm and Ling's so as to reduce the number of gate and to improve speed, and high hardware efficiency. #### References - M.A. Sid-Ahmed, "A Systolic Realization for 2 D Digital Filters," IEEE Trans, Signal processing, 37, NO.4, pp. 560-565 Apr. 1989. - S.Y. Kung, "On Supercomputing with Systolic / Wavefront Array Processors," Invited paper, Proceedings of the IEEE, vol. 72, NO. 7 - M. A.Sid-Ahmed, "Serial Architectures for the Implementation of 2-D Digital Filters and for Template Matching in digital Images, IEEE Trans. Signal processing, vol. 38, No.5, pp. 853-857, May 1990. - Chun Hsien Chou, "VLSI Architectures for High Speed and Flexible Two-Dimesional Digital Filters," IEEE Trans. Signal processing, Vol.39, NO.11. Nowv. 1991 - S.Y. Kung and J.N. Hwnag, "Systolic Array designs for Kalman Filtering," IEEE Trans. Signal Processing, Vol. 39, NO.1, Jan. 1991. - S.Y. Kung, VLSI Array Processors, Prentice-Hall, Inc. 1988 - D.T. Moon, "A Study on the IC, Implementation of High Speed Multiplier for Real Time Digital Processing," Korean Institute of Communication Sciences, Vol 15, NO, 7, pp. 628-637, 1990. - D. T. Moon, "A Study on the Design of 2-Dimention Convolution with Systolic Array and One-Chip Ic," Korean Institute of Communication Sciences, Vol. 15, NO. 10, PP, 819-828, 1990. - F. EL Cuibaly, S. Sunder, and A. Antoniou, "Systolic Implementation of FIR Filters," IEEE, Sgnal Processing V: Theories and Applications, 1990. - I Chen Wu, "A Fast 1-D Serial Parallel systolic Multiplier," IEEE Trans, Computers, Vol. C-36, NO. 10, pp. 1234-1247 Oct, 1987. - R. W. Doran, "Variants of an Improved Carry Look-Ahead Adder," IEEE Trans, Computers, Vol. 37, NO. 9, pp. 1110-113 Sep. 1988. - Chun-Hsien Chou, "VLSI Architectures for High-Speed and Flexible Two-Dimensional Digital Filter," IEEE Trans. Signal Processing, Vol. 39, NO.11, pp. 2515-2523, MAY 1990. - 13. J. V. McCanny, J. G. McWhirter, and S. Y. Kung, "The Use of Data Dependence Graphs in the Design of Bit-Level Systolic Arrays," IEEE Trans. Accoustics, Speech, and Signal Processing, Vol 38, NO, 5, pp. 787-793, MAY, 1990. #### ▲Nho Kyung Park Birthdate: Jan 7, 1958 1984 B.A: Korea University, Dept. of Electronics Eng. 1986 M.A: Korea University, Dept. of Electronics, Graduate 1990 Ph.D.: Korea University, VLSI / Communication part, Dept. of Electronics, Graduate 1988~Present: Assistant Professor, Dept. of In formation & Communication Eng., Hoseo University Research Interests: CAD and ASIC Design for Communication, Analog/Digital circuit test ## ▲Dai-Tchul Moon 1978. 2: Department of Electronics Engineering, Soongsil University(B,S) 1981. 8: Department of Electronics Engineering, Korea University (M.S) 1988. 2: Department of Electronics Engineering, Korea University (Ph.D) 1984. 3 ~ Present: Associate Professor, Department of Information & Tele communication Engineering, Hoseo University. His areas of interests include VLSI Design array processing, VLSI signal processing and CAD. ## ▲Kyun Hyon Tchah Birthdate: Mar 26, 1939 1965 B.A: Seoul University, Dept, of Electrical Eng. 1967 M.A: Illinois University, Dept. of Electronics, Graduate 1976 Ph.D: Seoul University, Dept. of Electron ics, Graduate 1977 - Present : Professor, Dept. of Electronics Eng., Korea University Research Interests: CAD and Communication systems