# Review on efficient implementation of DWT architecture for image processing

Ms. Geetha K Assistant Professor: Dept of ECE Presidency University Bangalore, India

Mr. Shrikanth K Reddy Assistant Professor: Dept of ECE Presidency University Bangalore, India

#### **ABSTRACT**

Discrete wavelet transform is used in JPEG 2000 standard for the compression of Images. The DWT helps in reducing the correlation between the pixels there by improves performance of compression algorithm. A research is going on from past decade to reduce the hardware requirement and also increasing speed of DWT computation on FPGA. This paper presents recent schemes used in the computation of DWT and discusses the performances.

Keywords—DWT, FPGA, Pipeline,

## I. INTRODUCTION

The discrete wavelet transform (DWT) has gained wide popularity due to its excellent decorrelation property. Many modern image and video compression systems embody the DWT as the transform stage. It is widely recognized that the 9/7 filters are among the best filters for DWT-based image compression. In fact, the JPEG2000 image coding standard employs the 9/7 filters as the default wavelet filters for lossy compression. Several recent publications describe efficient implementation of JPEG2000 encoders and decoders [1].

Many research works have faced the problem of reducing the DWT complexity. This issue has been investigated mainly from two perspectives: 1) reducing the memory access overhead 2) reducing the DWT computational complexity.

Lifting and convolution present the two computing approaches to achieve the discrete wavelet transform. While conventional lifting based architectures require fewer arithmetic operations compared to the convolution-based approach for DWT, they sometimes have long critical paths. If Ta and Tm are the delays of the adder and multiplier, respectively, then the critical path of the lifting based architecture for the (9, 7) filter is  $(4 \times Tm + 8 \times Ta)$ , while that of the convolution implementation is  $(Tm + 2\times Ta)$ . In addition to this and for the reason to preserve proper precision, intermediate variables widths are larger in liftingbased computing. As a result, the lifting multiplier and adder delays are longer than the convolution ones. Hence convolution is a best method to reduce the delays in the computation of DWT [2]. But, the traditional advantages of lifting implementations are 1. Lifting leads to a speed-up when compared to the standard implementation. 2. Lifting allows for an in-place implementation of the fast wavelet transform, a feature similar to the Fast Fourier Transform. This means the wavelet transform can be calculated without allocating auxiliary memory. 3. All operations within one lifting step can be done entirely parallel while the only sequential part is the order of the lifting operations. 4. Using lifting it is particularly easy to build non linear wavelet transforms. Typical examples are wavelet transforms that map integers to integers. Such transforms are important for hardware implementation and for lossless image coding. 5. Using lifting and integer-to-integer transforms, it is possible to combine biorthogonal wavelets with scalar quantization and still keep cubic quantization cells which are optimal like in the orthogonal case. In a multiple description setting, it has been shown that this generalization to biorthogonality allows for substantial improvements. 6. Lifting allows for adaptive wavelet transforms [3]. In the present era, efficient implementation of DWT using 9/7 filters in resource-constrained hand-held devices with capability for real-time processing of the computation-intensive multimedia applications is, a necessary challenge.

Conventionally, programmable DSP chips are used to implement DWT algorithms for low-rate applications and the VLSI application specific integrated circuits (ASICs) for higher rates. The FPGAs are programmable logic devices that provide sufficient quantities of logic resources that can be adapted to support a large parallel distributed architecture [2]. Mugdha M. Dewasthale et al [4] listed the performance limitation of conventional programmable DSP over FPGA and they are 1. Fixed inflexible architecture 2. Limited number of MAC units 3.Fixed data width 4. Serial processing limits data throughput 5. Timeshared MAC unit 6. Multiple DSPs required to meet bandwidth needs. Performance Advantages of FPGAs are 1. Flexible architecture 2. Distributed DSP resources (LUT, registers, multipliers, & memory) 3. Parallel processing maximizes data throughput, 4. Support any level of parallelism 5. Optimal performance/cost tradeoff 6. FPGAs also support serial processing, also Amin Jarrah at al [5] Has proposed an efficient implementation of the DWT method on FPGA platform for all its dimensions (1D, 2D and 3D). The DWT implementation was compared in pentium III processor, DSP, and the optimized FPGA. Experimental results have shown that implementation on FPGA gives high performance compared with other platforms, this is due to the highly parallel and piplined architecture provided by the FPGA implementation of DWT. This paper presents various algorithms used in the literature for the implementation of DWT architecture on FPGA and the performance is compared with respect to utilization of resources

## II. DISCRETE WAVELET TRANSFORMS

The basic DWT can be realized by convolution-based implementation using the FIR-filters to do the transform. The input discrete signal X(n) is filtered by a low-pass filter (h) and a high-pass filter (g) at each transform level. The two output streams are then sub-sampled by simply dropping the alternate output samples in each stream to produce the low pass sub band YL and high-pass sub band YH. The associated equations can be written as (1) [2]. Fig 1 shows the signal analysis in one dimensional (1-D) Discrete Wavelet Transform.

$$YL(n) = \sum_{i=0}^{\frac{N}{2}-1} h(2n-i)x(i), \ YH(n) = \sum_{i=0}^{\frac{N}{2}-1} g(2n-i)x(i)$$



Fig: 1D Discrete wavelet transform

#### III. LIFTING BASED DWT

The main feature of the lifting-based discrete wavelet transform scheme is to break up the high-pass and low-pass wavelet filters into a sequence of smaller filters that in turn can be converted into a sequence of upper and lower triangular matrices [6]. The basic idea behind the lifting scheme is to use data correlation to remove the redundancy. The lifting algorithm can be computed in three main phases, namely: the split phase, the predict phase and the update phase, as illustrated in Fig 2.

fig:2 Split, predict and update phase of the lifting based DWT



IV. RECENT WORK

Mohammed Bahoura et al [7] articulate that to optimize time and resources consuming, many FPGA architectures of the WT have been proposed, which are mainly based on convolution and lifting schemas. However, these architectures don't take into account the group delays of the finite impulse response (FIR) filters that are used to compute the DWT. When the signal is processed sample by sample unlike frame by frame, the group delays of different filter paths can sensibly affect processing in the forward and inverse wavelet transforms schemas, as for signal denoising techniques. Hence they, proposed, a real-time architecture for forward/inverse wavelet transforms that take into account the group delays of the used filters. This scheme is implemented for daubechies 5 wavelet and architecture was implemented on FPGA using Xilinx System Generator tool. The maximum frequency achieved is 9.759 MHz

Mohammed Bahoura, et al [8] has proposed a pipelined real-time architecture for forward/inverse wavelet transforms that take into account the filter group delays. The required resources and the reconstruction error of this architecture were evaluated and compared to those of the conventional one. These architectures were implemented on FPGA using Xilinx System Generator and XUP Virtex-II Pro development board. Compared to the conventional method referred, the pipelined method gives a frequency of about 163.470MHz

Juan Li et al in [9] projected a novel semi-cache parallel hardware structure of discrete wavelet transform, which minimized the half of on-chip resources compared with the before proposed. Besides, the DSP48E1 block was implemented to ensure data accuracy and processing speed. In the experimental part, each sub-module and the whole algorithm architecture was simulated. Finally, the correctness of the algorithm architecture was verified in the Xilinx virtex-6 development board. Further, in order to verify the algorithm's real-time effects in the FPGA, the different size of images was experimented in the OMAP3530 platform and FPGA. The processing time results fully demonstrated the advantages of FPGA in the real-time processing field. Mallat algorithm structure designed in this paper can be used as the follow-up multi-level wavelet decomposition and the reference image wavelet reconstruction. The image wavelet reconstruction can be processed only in accordance with the reverse structure of the algorithm. The shortcoming of the algorithm is that the clock frequency next step need is half of the pre-wavelet transform unit. To the multilevel wavelet processing, the clock frequency requirements can't be too high.

C.S. Avinash et al [10] Discrete Wavelet Transform (DWT) is used in signal compression and enhancement applications. DWT could be implemented using recursive Finite Impulse Response (FIR) filtering. This paper described the implementation of DWT using Distributed Arithmetic Architecture (DAA) on FPGA device and a novel idea for implementation of Decimator function. The Low Pass FIR filters (LPF) and High Pass FIR filters (HPF) used in DWT are implemented using DAA. Simulations and syntheses are done with Virtex-5 XC5VLX20T-2FF323 FPGA device environment. Conventional Multiplier Architecture (CMA) based DWT system makes use of 8 high-cost resource DSP48Es for use of 8 8X8-bit multipliers, whereas DAA based DWT system makes use of only LUT slices. With CMA design, the maximum throughput achieved by the system is 1009.72Mbit/s, whereas DAA based system provides a throughput of 301.761Mbit/s, a metric which restricts DAA based DWT to low speed applications.

One of the methods of decimation is to use a down-sampling operator at the end of FIR filters. The output of downsampling operator is v(2n); every alternate sample is blocked. This is implemented by enabling/disabling the output register. every alternate output sample. Since this method disables only the output register, the blocks responsible for computation of the samples in the DAA based FIR filter, i.e. LUT, accumulator and 1-bit scaler, perform computations even during those cycles when the output sample needs to be blocked. This results in unnecessary power consumption. This paper proposes a simple methodology to disable these blocks during the cycles when computation is not needed. Disabling these blocks every alternate output sample results in decimate-by-2 operation and functions as an inbuilt decimator for the FIR filter, as required for DWT. Author discuses the computation of Level-1 DWT characterized by Daubechies db2 wavelet and is implemented on Virtex-5. Compared to the conventional method referred, this method gives a frequency of about 301.761MHz. as the through put observed is reduced the author suggest that this scheme is suitable for low speed application when low cost solution is needed.

Biswas et al [11] presents a high precision low area lifting based architecture for the unified implementation of both lossy and lossless 3D multi-level Discrete Wavelet Transform (DWT) using LeGall 5/3 wavelet and Cohen-Daubechies-Feauveau (CDF) 9/7 wavelet. The proposed system is parallel-pipelined, and resource is shared between the lossy and lossless modes, producing a throughput of 2 outputs/clock and achieving a high speed and low area solution. The data width of the design is taken as 20 bits to reach a high PSNR value for multi-level 3D DWT. Targeting a portable and real-time solution, the proposed architecture was successfully implemented on Xilinx Virtex-5 series Field Programmable Gate Array (FPGA), achieving a clock speed of 290 MHz with a power consumption of 467 mWat 200 MHz clock frequency. The design has also been implemented in UMC 90 nm CMOS technology, which consumes 329 mW power at 200 MHz clock frequency. The proposed solution may be configured as lossless or lossy compression, in the field of 3D image compression system, according to the necessity of the user. In [12] Jain et al presented, a mapping of a configurable 2D-DWT algorithm using convolution method with separable filter approach having filter length upto 8 taps on the reconfigurable architecture hardware.

The reconfigurable hardware architecture mapped with 2D-DWT is ported onto an FPGA that has a frequency of operation of 37.26 MHz. For a 1-level decomposition, the number of clock cycles are 496 per 8×8 block of the N×N image with a total clock cycles equals 31N2/4 with 75% compression and can be further improved by computing higher levels of decomposition of the 2D-DWT.

#### V. CONCLUSION

Discrete wavelet transform is used not only in compressing of an image but in most of the other application like in processing of images in different aspects. Hence, developing an efficient architecture for the computation of DWT on FPGA is major concern in the present era. In this paper we have highlighted few techniques used recently by various researchers in increasing the speed of the computation.

# REFERENCES

- [1] Maurizio Martina: "Multiplier-less, Folded 9/7 5/3 Wavelet VLSI Architecture", IEEE transactions on circuits and systems ii: express briefs, vol. 54, no. 9, September 2007
- M Maaumoun et al, Low cost VLSI discrete wavelet transform and FIR filters architectures for very high-speed signal and image processing," Cybernetic Intelligent Systems (CIS), 2010 IEEE 9th International Conference on,
- Ingrid Daubechies et al, "Factoring wavelet transforms into lifting steps", November 2007.
- Dewasthale, M. M., & Mukherji, P. (2009). FPGA Implementation of Wavelet Transform Based on Lifting Scheme. 2009 International Conference on Information Management and Engineering. doi:10.1109/icime.2009.108
- [5] Jarrah, A., & Jamali, M. M. (2014). Optimized FPGA based implementation of discrete wavelet transform. 2014 48th Asilomar Conference on Signals, Systems and Computers.
- Wu, Z., & Wang, W. (2011). Pipelined architecture for FPGA implementation of lifting-based DWT. 2011 International Conference on Electric Information and Control Engineering.
- Bahoura, M., & Ezzaidi, H. (2010). Real-time implementation of discrete wavelet transform on FPGA. IEEE 10th International Conference On Signal Processing Proceedings
- Bahoura, M., & Ezzaidi, H. (2010). Pipelined architecture for discrete wavelet transform implementation on FPGA. 2010 International Conference on Microelectronics.
- [9] Li, J., Su, B., Yan, Y., & Jiang, C. (2012). Discrete wavelet transform implementation based on FPGA. 2012 IEEE 11th International Conference on Signal Processing.
- [10] Avinash, C. S., & Alex, J. S. R. (2015). FPGA implementation of Discrete Wavelet Transform using Distributed Arithmetic Architecture. 2015 International Conference on Smart Technologies and Management for Computing, Communication, Controls, Energy and Materials (ICSTM).
- [11] Biswas, R., Malreddy, S. R., & Banerjee, S. (2017). A High Precision-Low Area Unified Architecture for Lossy and Lossless 3D Multi-Level Discrete Wavelet Transform. IEEE Transactions on Circuits and Systems for Video Technology, 1–1.
- [12] Jain, N., Singh, M., & Mishra, B. (2018). Image Compression Using 2D-Discrete Wavelet Transform on a Light Weight Reconfigurable Hardware. 2018 31st International Conference on VLSI Design and 2018 17th International Conference on Embedded Systems (VLSID). doi:10.1109/vlsid.2018.38