# Efficient Systolic Architectures for Discrete Wavelet Transforms

<sup>1</sup> Potti Chendi Alekya, <sup>2</sup>K.R.Surendra <sup>3</sup>Dr.N.Pushpalatha,<sup>4</sup>Dr.Irala Suneetha

<sup>1</sup>PG Scholar, Department of ECE, Annamacharya Institute of technology and sciences, Tirupati <sup>2</sup> Assistant Professor, Department of ECE, Annamacharya Institute of technology and sciences, Tirupati <sup>3</sup>Associate Professor, Department of ECE, Annamacharya Institute of technology and sciences, Tirupati <sup>4</sup> Professor & HOD, Department of ECE, Annamacharya Institute of technology and sciences, Tirupati

#### ABSTRACT-

This work presents an implementation of Discrete Wavelet Transform (DWT)using Systolic architecture in VLSI .This architecture consist of Input delay unit, filter, register bank and control unit. This performs the calculation of high pass and low pass coefficients by using only one multiplier. This architecture has been simulated and implemented in VLSI. The hardware utilization efficiency is more compared to the referred due to FBRA Scheme. The systolic nature of this architecture corresponding to a clock speed of 115.9 MHz has its advantage in Optimizing area, time and power. The architecture is simple, modular, and cascadable for computation of one, or multi-dimensional DWT.

Keywords: - DWT, Six tap FIR Filter, Systolic Array Architecture, Decomposition, FBRA

# I. INTRODUCTION

In recent years, there has been increasing important requirement to address the bandwidth limitations over communication networks. The advent of broadband networks (ISDN, ATM, etc) as well as compression standards such as JPEG, MPEG, etc is an attempt to overcome that's limitations. With the use of more and more digital stationary and moving images, huge amount of disk space is required for storage and manipulation purpose. Image compression is very important in order to reduce storage need.

Redundancies in video sequence can be removed by using Discrete Cosine Transform (DCT) and Discrete Wavelet Transform (DWT). DCT suffers from the negative effects of blackness and Mosquito noise resulting in poor subjective quality of reconstructed images at high compression. Wavelet techniques represents real life non stationary signal which is powerful technique for achieving compression. In order to meet the real time requirements, in many applications, design and implementation of DWT is required. For the implementation, Systolic array (DWT-SA) architecture is used. The proposed VLSI architecture computes both high pass and low pass frequency coefficients in clock cycle and thus has efficient hardware utilization. Here, the user is required to input only the data stream and the high-pass and low-pass filter coefficients. This paper deals first introduction part and  $2^{nd}$  chapter tells that Discrete Wavelet Transform and  $3^{rd}$  chapter discussed basic principle of Systolic Array and  $4^{th}$  chapter discussed Systolic Array Architecture  $5^{th}$  chapter discussed results of proposed method and  $6^{th}$  chapter gives conclusion of the paper

# **II. DISCRETE WAVELET TRANSFORM**

Wavelet is a small wave whose energy is concentrated in time. Properties of wavelets allow both time and frequency analysis of signals. The Discrete Wavelet Transform (DWT), which is based on sub-band coding, is fast computation of Wavelet Transform. It is easy to implement and reduces the computation time and resources required.

A schematic of three stage DWT decomposition is shown in Fig. 1. In figure 1, the signal is denoted by the sequence a[n], where n is an integer. The low pass filter is denoted by L1 while the high pass filter is denoted by H1. At each level, the high pass filter produces detail information; b[n], while the low pass filter associated with scaling function produces coarse approximations, c[n and so on]. The filtering and decimation process is continued until the desired level is reached.



Fig 1. Three stage DWT decomposition using pyramid algorithm.

The maximum number of levels depends on the length of the signal. The DWT of the original signal is then obtained by concatenating all the coefficients, starting from the last level of decomposition.

#### III. SYSTOLIC ARRAY

A systolic array is an arrangement of processors in an array where data flows synchronously across the array between neighbors, usually with different data flowing in different directions. Each processor at each step takes in data from one or more neighbors (e.g. North and West), processes it and, in the next step, outputs results in the opposite direction (South and East).

Systolic arrays are specialized form of parallel computing, where processors connected by short wires. An example of two-dimensional systolic array is given in the Fig 2 given below.



Fig 2. Example of two dimensional systolic arrays

The array given above takes in inputs parallel performs parallel processing and outputs the result. Systolic arrays do not lost their speed duo to their connection unlike any other parallelism. Cells i.e. Processing Elements (PE), compute data and store it independently of each other. Each cell (PE) is an independent processor and has some registers, Arithmetic, and Logic Units (ALUs).

The cells (Processing Elements) share information with their neighbors, after performing the needed operations on the data.



Fig 3. Example of Systolic Array Processing

Example of Systolic Array is shown in the Figure 4 above. Here each cell takes in inputs from top and left, multiplies those two number and stores in the local register which is inside the each Processing Element. After 9 clock pulses the result would be stored in each processing elements. In the full search block matching it needs  $N^2$  subtractions,  $N^2$  magnitude operations and  $N^2$  magnitude accumulations are needed. Hence systolic arrays can be used to perform these operations duo to its advantageous properties like regularity, modularity and local communication.

SA is extremely fast but expensive and difficult to implement and build. SA has both properties of Processor Pipelining and parallelism (which minimizes the computation time).

*Processor Pipelining:* Ideally at least one new instruction completes every time cycle.

Parallelism: Multiple jobs are allowed to perform simultaneously.

We need a high-performance, special-purpose computer system to meet specific application. I/O and computation imbalance is a notable problem. The concept of Systolic architecture can map high-level computation into hardware structures. Systolic system works like an automobile assembly line. Systolic system is easy to implement because of its regularity and easy to reconfigure.

The term "systolic" was first used in this context by H.T. Kung, then at CMU; it refers to the "pumping" action of a heart. Systolic architecture can result in costeffective, high -performance special-purpose systems for wide range of problems.

#### IV. THE PROPOSED SYSTOLIC ARRAY ARCHITETURE

The proposed systolic array (DWT-SA) architecture is an improved architecture. Here, only one set of multipliers and adders has been employed. The multiplier and adder set performs all necessary computations to generate all high pass and low pass coefficients.

# A. DWT-SA ARCHITECTURE

The design of DWT-SA is based on a computation schedule derived from Eq. 3a –

3n which is the result of applying the pyramid algorithm for eight data points (N = 8) to the six tap filter. We note that Eq. 1a and 1b represent the high pass and low pass components of the six tap FIR filter. The proposed DWT-SA architecture is shown in Fig. 4. It comprises of four basic units: Input Delay, Filter, Register Bank, and Control unit.

#### **B. FILTER UNIT (FU):**

The Filter Unit (FU) proposed for this architecture is a six tap non-recursive FIR digital filter whose transfer function for the high pass and low pass components are shown in Eq. 1 .This feature makes possible systolic implementation of DWT.

#### 1. FILTER CELL (FC):

The signed-number represents either positive, negative numbers or one positive and other negative numbers. To avoid this problem the proposed filter cell consists of invert and xor operation as shown in Figure as shown in Fig. 6.



Fig 4. DWT- SA Architecture





#### Fig 5.The systolic architecture of a six tap filter.

#### C. STORAGE UNIT

Two storage units are used in the proposed architecture: Input Delay and Register Bank. The data registers used in these storage units have been constructed from standard D latch. The following presents the structure of each storage unit.

# 1. Input Delay Unit (ID):

As shown in figure 4, five delays are connected serially. At any clock cycle each delay passes its contents to its right neighbor which results in only five past values being retained. The input of delay is applied to the switch.

# 2. Register Bank Unit (RB)

Several registers are required for storage of the intermediate partial results. 26 data registers connected serially are required to implement RB

# D. CONTROL UNIT (CU):

The proposed DWT-SA architecture computes N coefficients in N clock cycles and achieves real time operation by executing computations of higher octave coefficients in between the first octave coefficient computations. The first octave computations are scheduled every N/4 clock cycles, while the second and third octaves are scheduled every N/2 and every N clock cycles, respectively.

#### 1. Register Allocation

The next step in designing the DWT-SA architecture is the design of the Control Unit (CU) and the Register Bank (RB). The two components synchronize the availability of operands. The Forward Register Allocation (FRA) method uses a set of registers which are allocated to intermediate data on the first come first served basis. It does not reassign any registers to other operands once its contents have been accessed.

The FBRA scheme is similar, except that once the operand stored has been used; the register is reallocated to another operand. The FRA method is simpler, requires less control circuitry and it results however, in less efficient register utilization. In either scheme, the coefficient computations are periodic and hence, each register containing a specific variable will be reserved for the same variable in the next period.

#### 2. Complete Design of CU

The complete design of the Control Unit for DWT-SA architecture is shown in Fig. 7. The control unit uses switch, decoder and 4 FSM. The switching action is done by using FSM. State diagram is used to represent FSM. CU directs data from the Input Delay (ID), or the Register Bank (RB) to the Filter Unit (FU). By selecting particular select line, switching action of switch is done. And according to that switch, the data is applied to the filter cell.



Fig 7. The Control Unit (CU)

The step by step process of entire project is shown in flow chart fig 8 and flow chart for design DWT is shown in fig 9. The first step starts by giving input image to the MATLAB which checks the image size and whether it is 2D/3D and converts the image into 2D, resize to 256 x 256 and then to data in Hexadecimal form.





# Fig 8. Flow chart of the entire Project



#### Fig 9. Flow chart of proposed architecture

# **V.SIMULATION RESULTS**

One of the most popular applications of DWT is video processing. With a frame rate of 30 frames/ sec, the video processor should process a complete frame in less than 33 ms. It has been found that the proposed architecture can execute the DWT computations on a monochrome 256  $\times$  256 frame in 8.42 ns, and 115.902 MHz clock rate.



Fig 10. Input Images to MATLAB

The input image as shown in fig 10 is fed to the MATLAB; it checks and converts the color image into monochrome image of standard size 256 x 256 and then to a Hexadecimal number and stores data in image text file. Using this data we interface the MATLAB to MODELSIM for writing the code for DWT in VERILOG. After compiling the code we perform the simulation for designed DWT. The simulation results of DWT of MODELSIM of image in fig 10 are shown in fig 11.

| - | Constant of the State of the St | - 223 |
|---|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------|
|   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |       |
|   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | 111   |

Fig 11. Model Simulator output of implemented DWT

After interfacing back the output of MODELSIM for implemented SA-DWT to MATLAB the output image is shown in fig 10



Fig 12. Output from MATLAB after performing SA-DWT.

# 1. SYNTHESIS RESULTS

By performing the synthesis to the design using Xilinx ISE it is observed that the minimum delay of architecture is 8.42 ns and it has maximum frequency of 115.902 MHz as shown in figure 12. We observed from the synthesis report that the architecture is required more area. But the speed is more. The fig.14 (a) and (b) shows the RTL schematic and its inner view obtained from Xilinx ISE of the implemented DWT after performing the synthesis.

# 2. COMPARISON OF LIFTING-BASED DWT WITH SA-DWT

Finally, we are comparing the results of lifting-based DWT with the Systolic Array DWT in table1 to show that the proposed architecture is having more ardware utilization efficiency and it is more efficient in terms of speed. But the area used or taken by the SA-DWT is more and the design will allows looks more complexity.



Fig 13. Synthesis report of design



Fig 14. RTL schematic view of design



Fig 15. RTL schematic inner view

Table1.Comparision of lifting based DWT with SA-DWT

| S.No Parameters | Lifting-<br>DWT | SA-DWT |
|-----------------|-----------------|--------|
|-----------------|-----------------|--------|

| 1 | Minimum Delay                   | 19.75 ns  | 8.42 ns       |
|---|---------------------------------|-----------|---------------|
| 2 | Maximum Frequency               | 50.62 MHz | 115.90<br>MHz |
| 3 | No. of Slices Used              | 692       | 2412          |
| 4 | No. of Slice Flip-flops<br>Used | 293       | 154           |
| 5 | No. of IOB Bonded<br>Used       | 105       | 259           |

# VI. CONCLUSION

A systolic VLSI architecture for computing one dimensional DWT in real time has been presented. The architecture is simple, modular, cascadable, and has been implemented in VLSI. The implementation employs only one multiplier per filter cell, and hence results in a considerably smaller chip area. The DWT-SA architecture does not use any external or internal memory modules to store the intermediate results and therefore avoids the delays caused by access, read, write and refresh timing.

The architecture has been simulated in VLSI and has high hardware utilization efficiency than the referred. By performing the synthesis to the design using Xilinx ISE it is observed that the minimum delay of architecture is 8.42 ns and it has maximum frequency of 115.902 MHz

# REFERENCES

1. I. Daubechies, "Orthonormal bases of compactly supported wavelets," Comm. Pure Appl. Math, Vol. 41, pp. 906-966, 1988.

2. S. G. Mallat, "A theory of multiresolution signal decomposition: the wavelet representation," IEEE Trans. on Pattern Recognition and Machine Intelligence, Vol. 11, No. 7, July 1989.

3. M. Vetterli and C. Harley, "Wavelets and filter banks: theory and design," IEEE Transactions on Signal processing, Vol. 40, No. 9, pp. 2207-2232, 1992.

4. Y. Meyer, Wavelets: Algorithms and Applications, SIAM, Philadelphia, 1993

5. R. A. Devore, B. Jawerth and B. J. Lacier, "Image compression through wavelet coding," IEEE Trans. on Information Theory, Vol. 38.

6. O. Rioul and M. Vetterli, "Wavelets and signal processing," IEEE Signal processing Magazine, pp. 14-38, Oct. 1991.

7. R. A. Gopinath, Wavelets and Filter Banks – New Results and Applications, PhD Dissertation, Rice University, Houston, Texas, 1993.

8. S. G. Mallat, "Multifrequency channel decompositions of images and wavelet models", IEEE Trans. On Acoustics, Speech and Signal Processing Vol. 37, No. 12, pp. 2091-2110, Sept. 1989.

9. K. K. Parhi and T. Nishitani, "VLSI architectures for discrete wavelet transforms", IEEE Trans. On VLSI Systems, pp. 191-202, June 1993.

10. Aware Wavelet Transform Processor (WTP) Preliminary, Aware Inc., Cambridge, MA.

11. A. D. Booth, "A signed binary multiplication technique", Quarterly Journal of Mechanics and Applied Mathematics, Vol. 4, pp. 236-240, 1951.