# DESIGN OF INTEGER TRANSFORM ARCHITECTURE FOR HEVC BY USING DECOMPOSITION MATRIX FOR VLSI APPLICATION

<sup>1</sup>G.Sreeram,<sup>2</sup>K.Ashok Kumar <sup>1</sup>PG Scholar, <sup>2</sup>Assistant professor, Dept. Of ECE <sup>1,2</sup>PBR Visvodaya Institute of Technology & Science, Kavali , S.P.S.R Nellore Dt, A.P

#### Abstract

In this paper, a new very-large-scale integrated (VLSI) integer transform architecture is proposed for the High Efficiency Video Coding (HEVC) encoder. The architecture is designed based on the signed bit-plane transform (SBT) matrices, which are derived from the bit-plane decompositions of the integer transform matrices in HEVC. Mathematically, an integer transform matrix can be equally expressed by the binary weighted sum of several SBT matrices that are only composed of binary 0 or 1. The SBT matrices are very simple and have lower bit width than the original integer transform in the form. The SBT matrices are also sparse and there are many zero elements. The sparse characteristic of SBT matrices is very helpful for saving the addition operators of SBT. In the proposed architecture, instead of the original integer transform in high bit width, the video data can be respectively transformed with the SBT matrices in lower bit width. As a result, the delay of the transform unit circuit can be significantly reduced with the proposed SBT.

#### **I.INTRODUCTION**

Discrete cosine transform (DCT) is a key technology for video coding. It was first applied in image coding by Ahmed in 1974 [1]. After some decades, DCT was widely adopted as a video coding technology. Early video coding standards, such as JPEG [2], MPEG-2/H.262 [3], and H.263 [4], employed the real number DCT directly. The real number DCT has to be implemented in float point precision, which is high complexity with up to 64 bit width in digital system. It cannot be accepted to implement the real number DCT in so high bit width. For reducing the complexity of DCT implementation, the integer transforms, the approximated forms of real number DCT, are widely used in the actual encoders replacing the real number DCT. However, the integer transforms were not regulated in standards, which lead to error drifting problems so that greatly deteriorate decoded video.

HEVC is the latest video coding with a higher coding performance than other existing ones. Many novel coding algorithms are introduced in HEVC. Particularly in the aspect of the transform, up to  $32 \times 32$  integer transform is applied for improving coding performance. Theoretically, a large-size transform is usually efficient for coding a large-size prediction block and vice versa. However, the implementation complexity of a transform is increasing with the enlarging transform size. Thus, it is becoming more and more important for reducing the implementation complexity of a transform. The  $32 \times 32$  transform is the most complex in the transforms of HEVC; thus, the improvement of the  $32 \times 32$  transform also can be efficient for the whole transform circuit.

#### **II.LITERATURE REVIEW:**

Many research works on transform implementation optimization for HEVC have been done in the past .Meher et al. proposed an efficient constant matrix multiplication scheme to derive parallel architectures of a transform for HEVC which can support the real-time ultra HD video codec. In some simplification strategies, such as the reuse of transform structure and multiplier less implementation, were adopted for saving the hardware cost. The work in presented a transform architecture that uses the canonical signed digit representation and common sub expression elimination technique to perform the multiplication with a shift-add operation. Based on these optimizations, the transform architecture is greatly simplified for practice application. However, with the increasing applications of high definition (HD) and ultra HD video coding, the higher processing capacity of codecs is required. Thus, all modules in video codec, including the transform, need to be further improved for real-time coding with low complexity.

P. K. Meher, S. Y. Park, B. K. Mohanty, K. S. Lim, and C. Yeo, presented area- and power-efficient architectures for the implementation of integer discrete cosine transform (DCT) of different lengths to be used in High Efficiency Video Coding (HEVC). They shows that an efficient constant matrix-multiplication scheme can be used to derive parallel architectures for 1-D integer DCT of different lengths. They also show that the proposed structure could be reusable for DCT of lengths 4, 8, 16, and 32 with a throughput of 32 DCT coefficients per cycle irrespective of the transform size. Moreover, the proposed architecture could be pruned to reduce the complexity of implementation substantially with only a marginal affect on the coding performance. They proposed power-efficient structures for folded and full-parallel implementations of 2-D DCT. From the syn project result, it is found that the proposed architecture involves nearly 14% less area-delay product (ADP) and 19% less energy per sample (EPS) compared to the direct implementation of the reference algorithm, on average, for integer DCT of lengths 4, 8, 16, and 32. Also, an additional 19% saving in ADP and 20% saving in EPS can be achieved by the proposed pruning algorithm with nearly the same throughput rate. The proposed architecture is found to support ultrahigh definition  $7680 \times 4320$  at 60 frames/s video, which is one of the applications of HEVC.

A. D. Darji and R. P. Makwana proposed a High Efficiency Video Coding (HEVC) is a video compression standard, a successor to H.264/MPEG-4 Advanced Video Coding (AVC), that was jointly developed by the ISO/IEC Moving Picture Experts Group (MPEG) and ITU-T Video Coding Experts Group (VCEG) as ISO/IEC 23008-2 MPEG-H Part 2 and ITU-T H.265. In this paper, they proposed an efficient architecture for the computation of 4, 8, 16 and 32 point DCT used in HEVC standard. The architecture uses the Canonical Signed Digit (CSD) and Common Sub-expression representation perform the Elimination (CSE) technique to multiplication with shift-add operation. The proposed architecture requires less number of adders and shifters and gives almost double throughput as compared to the previous work. Number of Logic Elements (LEs) required for the implementation is reduce by almost 36% without compromising throughput. The hardware cost reduces due to the reduction in arithmetic operation.

# **III.EXISTING METHODS**

The existing transform architectures consider how to reduce the number of arithmetic operators, such as addition and multiplication, more than the data bit width in the transform. In fact, the data bit width is also an important factor impacting on the circuit speed and area of VLSI architecture. A circuit with a large bit width needs a larger number of fan-in or fanout of logic gate, and more MOS devices are required in the logic gate circuit. Thus, the capacitive load and resistance of the logic gate all increase with widening bit width. According the firstorder resistance and capacitance (RC) circuit model theory, the delay of the circuit is related with RC. Large RC leads to long circuit delay. The circuit delay varying with the increasing input bit width in two typical CMOS processes (SMIC40nm and GF28nm) is shown in Fig.As for the adder, the carry chain is the critical path for the circuit delay, which is also dependent on the input and output bit width. Each extra bit increasing will lead to larger delay. Thus, aside from the number of arithmetic operations, the bit width is the other optimization factor for a fast transform architecture.

In this approach, the separability property of the DCT is exploited. An 8-point, 1-D DCT is applied to each of the 8 rows, and then again to each of the 8 columns. The 1-D algorithm that is applied to both the rows and columns is the same. Therefore, it could be possible to use identical pieces of hardware to do the row computation as Ill as the column computation. A transposition matrix would separate the two as the functional description in figure 2.1 shows. The bulk of the design and computation is in the 8 point 1-D DCT block, which can potentially be reused 16 times-8 times for each row, and 8 times for each column. Therefore, the fast algorithm for computing the 1-D DCT is usually selected. The high regularity of this approach is very attractive for reduced cell count and easy very large scale integration (VLSI) implementation.



# Fig 1:2-D DCT implementation **IV.PROPOSED SYSTEM**

In this project, a new VLSI architecture for the integer transforms of the HEVC standard for reducing the bit widths of data. The integer transform matrix is decomposed into several signed bit-plane transform (SBT) matrices that are used in the proposed architecture. Moreover, a number of adders are reused based on the redundant property of elements of bit matrices. With the bit matrix-based transform algorithm, the proposed VLSI transform architecture can process 32 pixels/cycle data throughput maximally with very high working frequency and proper area.

#### A.Signed Bit Matrix-Based Transform Algorithm

Applying the SBT algorithm to the transform architecture, instead of the integer transform matrix circuits, the SBT matrix circuits are implemented and the input data are transformed with

each SBT matrix circuit, respectively. Due to the simple elements of SBT matrices, the bit widths of intermediate transformed data and output data are significantly reduced. The bit width of output data should be  $n + \log N2$  maximally. Taking the  $32 \times 32$ 1-D integer transform as an example, the increasing bit width of output data is only 5 b with the SBT algorithm, compared with the 11-b increasing of the straightforward integer transform. Although the delay of the integer transform circuit is reduced based on the proposed bit transform algorithm, more adders are required due to more SBTs. However, the bit widths of adders used in SBT are also so low that the addition operation is also very fast. Additionally, It can be observed that many zero elements are in the SBT matrix. The number of actually required addition operations is seldom due to the sparse SBT matrix according to the rule of matrix multiplication. The sparse characteristic of the SBT matrices can benefit for reducing the addition operations in the transform process.



Fig. 2: Hierarchical structure of SBT.



Fig. 3: Adder reuse circuit of 2-SSBT.

The SBT vector is divided into multiple 2-SSBT vectors. The number of all the possible element combination situations of 2-SSBT is 32 = 9. The nine element combinations are (1, 1), (1, -1), (-1, 1), (-1, -1), (1, 0), (-1, 0), (0, 1), (0, -1), and (0, 0). It can be seen from (10) that four adders are required in the sub transform of 2-SSBT. In fact, considering the relationship of positive and negative

signs, all the additions of a 2-SSBT are done with only two adders.. The negative operator, which is implemented through reversing each bit and then adding 1, is very simple compared with the addition operator with negligible circuit implementation cost. Exploring the addition redundancy of the 2-SSBT vector, two adders are really required for computing.

According to the relationship between M-SSBT and M/2- SSBT, the SBT can be implemented in a hierarchical way. An M-SSBT can be implemented through jointing two M/2- SSBTs. The hierarchical structure of 4-SSBT as an example is illustrated in Fig where the output of two 2-SSBT units is input to a 4-SSBT unit for 4-SSBT computation.

7

The inner circuit design of the 4-SSBT circuit is also shown in Fig.



Fig.4: One-dimensional  $32 \times 32$  transform top-level architecture.



Fig. 5: Part of the adder reuse circuit of 4-SSBT. HEVC adopts several integer transforms in different sizes from  $4 \times 4$  to  $32 \times 32$ , which are integrated in the proposed transform VLSI architecture. The transform architecture is implemented based on Chen's transform framework [10]. The transform in size N is decomposed into transforms in size N/2 recursively. The SBT algorithm and corresponding adder reuse algorithm are incorporatively implemented in each N × N transform unit . For implementing the 2-D transform, two 1-D transforms are implemented and connected, respectively, by a transpose buffer. Meher's transpose buffer solution is employed in our 2-D transform architecture.

The transpose buffer is designed to be capable of pipelining the data with only several initial latency cycles, which can guarantee that the 2-D transform circuit is the same throughput with the 1-D transform circuit. The proposed 1-D transform architecture is synthesized with SMIC40nm and Fig:6.Existing system block diagram

# c.Simulation results

GF28nm CMOS standard cell libraries, which is compared with existing solutions of transform circuits . In this table, the circuit area is estimated from the number of transistors by normalizing with respect to a basic two-input NAND gate. The synthesized results show that, with the 16-b input data, the proposed architecture costs about 255 K logic gates in the maximum working frequency of 450 MHz in a 40-nm CMOS process and 223 K logic gates in the maximum working frequency of 700 MHz in a 28-nm CMOS process. The high frequency is achieved from the optimized intermediate data with narrower bit width It can be easily compared from the simulation results that the proposed 1-D DCT architecture has almost as many as twice the gate count and significant higher working frequency than Meher's and Darji's and has comparable circuit area and stronger coding capability than Zhao's High working frequency means that the proposed transform circuit can synchronize to other high-frequency modules in the video encoder.

# **V.RESULTS AND DISCUSSION**

#### a. Block diagram of existing system:

| Topmod  | lule_2D_DCT |
|---------|-------------|
| X(63:0) | Y(63:0)     |
| CIk     |             |
| reset   |             |
| Topmod  | lule_2D_DCT |

This RTL block diagram of existing dct method with input of x-64 inputs and output of y- 64 bits

# b.Block Diagram of proposed system:



Fig 7:Proposed Block diagram

This RTL block diagram of proposed system with inputs of x -64bits and output of y - 127 bits

8

| ame     | Value  |    | 1,4    | 100 ns | 1,5     | i00 ns   | 1,6      | 00 ns     | 1,70      | 10 ns    | 1,80      | 0 ns   |       | 1,900 |
|---------|--------|----|--------|--------|---------|----------|----------|-----------|-----------|----------|-----------|--------|-------|-------|
| Y[63:0] | 100110 | 00 | 11010. |        | 01101   | . 10011  | 00000    | . \(00000 | 01101     | 00110    | 11010     | 10011  | 0110  | )1)(1 |
| N[63:0] | 000000 |    |        |        | 0000000 | 00000000 | 00000000 | 000000000 | 000000000 | 00000000 | 000000010 | 011000 |       |       |
| 🔚 Cik   | 1      |    |        |        |         |          |          |           |           |          |           |        |       |       |
| 1 reset | 0      |    |        |        |         |          |          |           |           |          |           |        |       |       |
|         |        |    |        |        |         |          |          |           |           |          |           |        |       |       |
|         |        |    |        |        |         |          |          |           |           |          |           |        |       |       |
|         |        |    |        |        |         |          |          |           |           |          |           |        |       |       |
|         |        |    |        |        |         |          |          |           |           |          |           |        |       |       |
|         |        |    |        |        |         |          |          |           |           |          |           |        |       |       |
|         |        |    |        |        |         |          |          |           |           |          |           |        |       |       |
|         |        |    |        |        |         |          |          |           |           |          |           |        |       |       |
|         |        |    |        |        |         |          |          |           |           |          |           |        |       |       |
|         |        |    |        |        |         |          |          |           |           |          |           |        |       |       |
|         |        |    |        |        |         |          |          |           |           |          |           |        |       |       |
|         |        |    |        |        |         |          |          |           |           |          |           |        |       |       |
|         |        |    | 600 ns | 1      | 700 ns  | 1        | 800 ns   | 1         | 900 ns    |          | 1,000 ns  |        | 1,100 | ns    |

Fig 8.. Existing system Simulation Results

|            |        |                 |                   |                   |                    |                   | 2              |
|------------|--------|-----------------|-------------------|-------------------|--------------------|-------------------|----------------|
| ne         | Value  |                 | 1,999,995 ps      | 1,999,996 ps      | 1,999,997 ps       | 1,999,998 ps      | 1,999,999 ps 2 |
| 🖌 x[63:0]  | 101011 |                 | 101011110101110   | 11101111101101101 | 1101101111011110   | 11010010101010101 |                |
| 🧋 y[127:0] | 110011 | 110011111111000 | 00011111100111111 | .0011111111110000 | 11001111001111110  | 01111111111000000 | 1111111100111  |
| 🍇 w1[31:0] | 110110 |                 |                   | 1101101111011110  | 111010010101010101 |                   |                |
| 🍇 w2[31:0] | 101011 |                 |                   | 1010111101011101  | 1101111101101101   |                   |                |
| 🍇 w3[15:0] | 111010 |                 |                   | 11101001          | 01010101           |                   |                |
| 🍇 w4[15:0] | 110110 |                 |                   | 11011011          | 11011110           |                   |                |
| 🍇 w5[7:0]  | 010101 |                 |                   | 0101              | 0101               |                   |                |
| 🍇 w6[7:0]  | 111010 |                 |                   | 1110              | 1001               |                   |                |
| 🍇 r1[7:0]  | 010101 |                 |                   | 0101              | 0101               |                   |                |
|            |        |                 |                   |                   |                    |                   |                |
|            |        |                 |                   |                   |                    |                   |                |
|            |        |                 |                   |                   |                    |                   |                |
|            |        |                 |                   |                   |                    |                   |                |
|            |        |                 |                   |                   |                    |                   |                |
|            |        |                 |                   |                   |                    |                   |                |
|            |        |                 |                   |                   |                    |                   |                |

#### Fig 9.Simulation Results

This is simulation results of proposed system with inputs of 64 bits and output of 128 bits

#### d. Comparison Table:

#### Table I

#### **Comparison Table**

| Parameter | Existing<br>system | Proposed<br>system |  |  |  |
|-----------|--------------------|--------------------|--|--|--|
| Area      | 482                | 16                 |  |  |  |
| Delay     | 8 sec              | б sec              |  |  |  |
| Power     | 0.081              | 0.081              |  |  |  |

This comparison of both existing and proposed systems of parameters area, delay, power both existing and proposed method.

# VI.CONCLUSION

A fast integer transform VLSI architecturebased sparse SBT is proposed for real-time ultra HD video coding conforming to the HEVC standard. Considering the bit width effect on circuit delay, the bit width of the integer transform matrix is optimized in the proposed VLSI architecture. The integer transform matrix with high-bit-width elements is decomposed into several SBT matrices with low-bitwidth elements based on the proposed matrix bit-plane decomposition method. The transform architecture with the SBT algorithm can work more efficiently for the low-bit-width computations. The circuit reuse strategy is particularly proposed for the SBT to reduce the number of adders of the VLSI architecture. A large number of adders for the SBT is saved using the proposed circuit reuse strategy. The proposed transform hardware architecture can process video data

9

with higher speed and proper area compared with previous work.

### REFERENCES

[1] N. Ahmed, T. Natarajan, and K. R. Rao, "On image processing and a discrete cosine transform," IEEE Trans. Comput., vol. C-23, no. 1, pp. 90–93, Jan. 1974.

[2] G. K. Wallace, "JPEG still image data compression standard," Commun. ACM, vol. 34, no. 4, pp. 30–44, Apr. 1991.

[3] "Generic coding of moving pictures and associated audio information Part 2: Video ITU-T and ISO/IEC JTC1," in ITU-T Recommendation H.262, ISO/IEC 13818-2 (MPEG-2), Nov. 1994.

[4] "Video coding for low bit-rate communication Version 1," in ITU-T Recommendation H.263, Nov. 1995.

[5] T. Wiegand, G. Sullivan, G. Bjontegaard, and A. Luthra, "Overview of the H.264/AVC video coding standard," IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 560–576, Jul. 2003.

[6] L. Yu, S. Chen, and J. Wang, "Overview of AVSvideo coding standards," Signal Process., Image Commun., vol. 24, no. 4, pp. 247–262, Apr. 2009.

[7] G. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, "Overview of the high efficiency video coding (HEVC) standard," IEEE Trans. Circuits Syst. Video Technol., vol. 22, no. 12, pp. 1649–1668, Dec. 2012.

[8] W.-H. Chen, C. H. Smith, and S. C. Fralick, "A fast computational algorithm for the discrete cosine transform," IEEE Trans. Commun., vol. COM-25, no. 9, pp. 1004–1009, Sep. 1977.

[9] A. Ahmed, M. U. Shahid, and A. ur Rehman, "N-point DCT VLSI architecture for emerging HEVC standard," VLSI Design, vol. 2012, 2012, Art. no. 752024.

[10] Y. Arai, T. Agui, and M. Nakajima, "A fast DCT-SQ scheme for images," Trans. IEICE, vol. E71, no. 11, pp. 1095–1097, Nov. 1988.

[11] K. M. Tsui and S. C. Chan, "Error analysis and efficient realization of the multiplierless FFT-like transformation (ML-FFT) and related sinusoidal transformations," J. VLSI Signal Process. Syst., vol. 44, no. 1–2, pp. 97–115, Aug. 2006.

[12] S. H. Zhao and S. C. Chan, "Design and multiplierless realization of digital synproject filters for hybrid-filter-bank A/D converters," IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 56, no. 10, pp. 2221–2233, Oct. 2009.

[13] P. K. Meher, S. Y. Park, B. K. Mohanty, K. S. Lim, and C. Yeo, "Efficient integer DCT architectures for HEVC," IEEE Trans. Circuits Syst. Video Technol., vol. 24, no. 1, pp. 168–178, Jan. 2014.

[14] W. Zhao, T. Onoye, and T. Song, "Highperformance multiplierless transform architecture for HEVC," in Proc. IEEE Int. Symp. Circuits Syst., May 2013, pp. 1668–1671.

[15] A. D. Darji and R. P. Makwana, "Highperformance multiplierless DCT architecture for HEVC," in Proc. IEEE Int. Symp. VLSI Design Test, Jun. 2015, pp. 1–5.

#### Author's Profile:



G.Sreeram is currently pursuing M.Tech degree in VLSI in Electronics and Communication Engineering from PBR Visvodaya Institute of Technology science, Kavali, Nellore District, AP,

India.He received his B.Tech in Electronics and Communication Engineering from PRRM Engineering College(Affiliated to JNTU Hyderabad), RR Dist. A.P.

> K. Ashok Kumar M.Tech, MISTE is currently working as an Assistant Professor in Electronics and communication Engineering (ECE) Department, PBR

Visvodaya Institute of Technology science, Kavali, Nellore District, A.P affliated to the Jawaharlal Nehru technological university Ananthapur.