# ALU DESIGN AND SIMULATION USING FOLDED TREE AND CLOCK GATING

<sup>1</sup> Sunitha K, <sup>2</sup> Nagesh K N, <sup>3</sup> H VenkateshKumar <sup>1</sup>PG Student, <sup>2</sup>Professor, <sup>3</sup>Professor <sup>1</sup> Department of Electronics and Communication. <sup>1</sup>Nagarjuna College of Engineering and Technology, Bengaluru, India

Abstract: An arithmetic and logic unit (ALU) is core component of the processor. The ALU performs different arithmetic and logic operations like addition, subtraction, multiplication, logical AND, OR on input operands and produces output. In applications like Digital Signal Processing and crypto graphical implementations, delay and power consumption are important parameters of consideration. In ALU using ripple carry adder and multiplier like Dadda multiplier, delay is more. The adder and multiplier circuits being the most important modules used by the arithmetic operations of an ALU the main focus of concern in this 8 bit ALU design is, addition done using Ladner Fischer Adder (LFA) and multiplication operation using folded tree architecture along with clock gating which decreases the delay by 20-25%. The synthesis and simulation of 8-bit ALU is performed by using Xilinx and output waveforms are observed using Xilinx ISE simulator and ModelSim.

#### Index Terms -ALU, Adder, Multiplier, Clock gating.

### I. INTRODUCTION

An arithmetic logic unit (ALU) is an extremely versatile, useful digital circuit which performs different arithmetic and logical operations on integer binary numbers. In 1945, mathematician John von Neumann proposed the concept of ALU. The ALU is the critical and core component of a central processing unit (CPU) in a computer. The ALU accepts the inputs called operands of n bits and executes a given specific operation specified by the opcode to produce n bit output shown in Fig. 1. In applications like image processing and digital signal processing (DSP) high performance CPUs utilize modern, powerful and complex ALU. There is a demand to develop reduced delay and lower power consumed ALU [1].

Nowadays to allow the development of a smaller computer but more powerful system ALU is getting more composite and smaller. However there are some limiting factors that slow down the development. There is considerable possibility for speed improvement at the design level of the circuit, through the suitable choice of a logical style of combinational circuits. The chosen logical style has a strong impact on vital parameters that govern the speed switching capability, transition activity and shortcircuit, leakage currents .Depending on application, the design technique used, different aspects of the performance are important. Delay considerations attract the attention of the scientific community associated with the VLSI design [2].

In this paper ALU using traditional ripple carry adder block is replaced by parallel prefix adder Ladner-Fischer Adder and for multiplication folded tree architecture is introduced which is primarily used to maximize the reuse of processing elements to reduces the power and mainly delay.



Fig 1.ALU block

## II. DESIGN OF ALU

The salient building blocks of ALU generally consists of inverters, AND gate, OR gate, flip-flops, multiplexers and registers for various arithmetic and logical operation execution and control. A ALU operates on a fixed length of the input data to which it is designed exclusively like 8, 16, 32 or 64 bit. The performance of any processing circuit depends on performance of ALU. Logical gates are used as operational hardware to perform logic operations like AND, OR which are simple to design and control. Design of arithmetic operations are proposed to provide desired performance to the processing circuits [3].

Clock gating involves clocking modules selectively as and when required. By ANDing the clock signal with an enable signal clock gating is attained to form a gated clock [4] and is then applied to various modules of the circuit. Depending on the opcode gated clock is applied to one of the modules which performs the specific operation [5].

In ALU using ripple carry adder (RCA) and Dadda multiplier without clock gating delay is more. The RCA uses cascaded n bit full adder to form n bit adder. The dadda multiplier is one of the fastest column compression multipliers similar to Wallace multiplier and requires few gates.

The proposed ALU design uses parallel prefix Ladner Fischer Adder (LFA) for addition of 8 bit numbers and uses folded tree architecture (FTA) for multiplication with clock gating.

Depending on the opcode value corresponding arithmetic and logical operations are performed as shown in Table .1.

| <b>Opcode Values</b> | Operations          |
|----------------------|---------------------|
| 000                  | Addition(A+B)       |
| 001                  | Subtraction(A-B)    |
| 010                  | Multiplication(A*B) |
| 011                  | Logical AND(A.B)    |
| 100                  | Logical OR(A B)     |

Table 1.ALU operations with opcode values

The flow chart of ALU operations is shown in Fig.2.Clock signal runs throughout each step. Reset control signal is used to enable the circuit operations. Depending on the opcode value clock gating is enabled by setting logic high to enable input for example ADD\_select, OR\_select. The functional module will perform the arithmetic and logical operations on input operand A and operand B for corresponding logic 'high' value of enable input.

# 2.1 Ladner Fischer Adder

Ladner Fischer adder used for addition is a parallel prefix adder developed by Ladner and M. Fischer in 1980. Parallel prefix addition attempts to yield all parallel incoming carries and prevents waiting as far as carry is brought forth from the adder stage where it has originated. The construction has three stages-pre-processing stage, carry generate stage, post-processing stage [6].

Pre-Process Stage: The generate (gi) and propagate (pi) signals are generated in this stage. The generate (gi) signifies whether the carry out is generated from that bit or not while the propagate (pi) signifies whether carry is forwarded to next bit or not. If A and B are inputs, gi and pi signals respectively are given by the Eq.1 and Eq.2.

Carry Generate Stage: It has tree structure having black cells and buffers where computation occurs in parallel. Generate ii. and propagate signals are provided as inputs in two pairs (gx, px), (gy, py) to the group PG stage which computes group generate and group propagate signals in pairs given by Eq.3 and Eq.4 with (x, y) taken as (0,1).

$$Cp=p1 \text{ AND } p0$$

$$Cg=g1 \text{ OR } (p1 \text{ AND } p0)$$

$$(3)$$

$$(4)$$

Post Processing Stage: It is the final step in which the sum is calculated by XORing the carryout of first bit with the next iii. propagate bit given by Eq.5 [7]

$$Si=Ci-1 XOR pi$$
 (5)

## 2.2 Folded Tree Architecture for Multiplication

Folded Tree Architecture reduces the total number of Process Elements (PEs) in the VLSI design [8]. With the help of counter and Finite State Machine (FSM) PEs are reused. Number of times the specified PE going to be reused is tracked using Iteration count. Processing element mostly used in digital systems is adder. The process is recursive in the processing element with the help of iteration counter and FSM.FSM contains counter. For each of the iteration the counter value is incremented, and is compared to the iteration count value. The output obtained in the PE array is sent to the output when iteration count reaches the

final counter value. FTA is used for multiplication where PEs are half adder and full adder which are used to generate the partial products and final product. [9].



Fig 2.Flow chart of ALU operations with opcode

# III. SYNTHESIS AND SIMULATION RESULT

ALU is designed using Verilog HDL and XILINX project navigator ISE Design suite 14.6 is used for synthesis. The Top view and RTL view are shown below in Fig.3 and Fig.4 respectively



Fig 3.Basic Top view of ALU



Fig 4.RTL view of ALU

Xilinx ISE inbuilt simulator is used with selected device 6slx4tqg144-3 and waveforms are observed using ModelSim Altera10.4b.Figure 5 shows the output of ALU designed using ripple carry adder for addition and dadda multiplier for multiplication corresponding to opcode values as in Table 1.



Fig 5.Simulation Output of ALU using ripple carry adder and dadda multiplier

Figure 6 shows the output for arithmetic operations-Addition using LFA, subtraction and multiplication using folded tree. The output is zero when reset is logic zero and there is a delay during multiplication before final product.



Fig 6.Simulation Output of ALU for arithmetic operations

Figure 7 shows the output for logical operations-AND, OR according to opcode. When opcode is 011-AND, when opcode is 100-OR operation is performed.



Fig 7.Simulation Output of ALU for logical operations

Overall ALU output is shown in Fig.8 with clock and reset signal. When reset is zero output is zero and when reset is 1 corresponding operations are performed according to Table 1. And also clock gating is shown where depending on opcode values gated clock signals en\_add, en\_sub, en\_mul, en\_and and en\_or are applied to Ladner Fischer, SUB, Folded multiplier, AND\_gate, OR\_gate functional unit respectively



Fig 8.ALU Simulation Output waveform with gated clock signals

Power summary of ALU using LFA and folded tree multiplier is shown in Fig 9. The dynamic power is 1mW and the total power is 15mW.

| On-Chip | Power (W) | Used  | Available | Utilization (%) |
|---------|-----------|-------|-----------|-----------------|
| Clocks  | 0.001     | 7     |           | i               |
| Logic   | 0.000     | 280   | 2400      | 12              |
| Signals | 0.000     | 383   |           | •               |
| lOs     | 0.000     | 37    | 102       | 36              |
| Leakage | 0.014     |       |           |                 |
| Total   | 0.015     |       |           |                 |
|         |           | Total | Dynamic   | Quiescent       |
| Supply  | Power (W) | 0.015 | 0.00      | 0.014           |

Fig 9.Power analysis summary of proposed ALU

Timing Summary with speed grade -3 for both the ALU is shown in table.2.

Table 2. Comparison of Combination path Delay

| ALU                                                        | Maximum<br>combinational path delay |
|------------------------------------------------------------|-------------------------------------|
| ALU using<br>RCA and<br>Dadda<br>Multiplier                | 11.674ns                            |
| ALU using<br>LFA and<br>multiplier<br>using Folded<br>tree | 8.066ns                             |

ALU designed using LFA and folded tree architecture for multiplication shows maximum combinational path delay of 8.066ns with maximum frequency 226.134MHz, minimum period is 4.422ns, before clock minimum input arrival time is 5.032ns and required maximum output time after clock is 7.747ns. Thus the overall delay is reduced by 3.6ns as in table 2.

### IV. CONCLUSION

ALU with 8 bit is designed with arithmetic and logical function with different module so as to optimize ALU performance. Adders are very crucial modules in digital systems because of their extensive use and is designed using parallel prefix LFA and for multiplication folded tree architecture is used and clock gating is applied to all modules. The main advantage of this 8-bit ALU design is reduction in delay as shown in Table 2 thereby increasing the speed of ALU operation. This is because of the fact that Ladner Fischer adder is used for addition where carry bits are calculated in parallel and it has minimum logic depth but it is traded off with area occupied and also folded tree architecture is used for multiplication where PEs are reused. Delay is reduced by 3.6 ns which is 30% reduction when compared to ALU using RCA and dadda multiplier. Further ALU design can be extended to 64 bits which can be used in high speed DSP application and Wireless sensor nodes.

## REFERENCES

- [1] Amirthalakshmi.T.M, S.Selvakumarraja "Design of Low Power Four Function 8-Bit ALU for Nano based systems", IEEE **ICCSP 2015**
- [2] Manit Kantawala "Design and Implementation of 8Bit and 16 Bit ALU Using Verilog Language", International Journal of Engineering Applied Sciences and Technology, Vol. 3, Issue 2, 2018.
- [3] M.S.Sikarwar, SudhaNair "Clock Gated and Enable Controlled 64 bit ALU architecture for Low Power Applications", International Journal of Research in Engineering and Technology, Volume 3, Issue 12, Dec 2014.
- [4] JiteshShinde, Dr.S.S.Salankar, "Clock Gating -A Power Optimizing Technique for VLSI Circuits", India Conference, Annual
- [5] Mr P.C.Bhaskar, Vikas K. Jathar, "Development of Processor Engine for FPGA Based Clock Gating and Performing Power Analysis", Computing Communication Control and automation, 2016.
- [6] Cedric Walravens, WimDehaene, "Low-Power Digital Signal Processor Architecture for Wireless Sensor Nodes", IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Volume 22, Issue 2, 2014.
- [7] P.Divya1, M.Purna Sekhar2 "Design Of 16 Bit Ladner Fischer Based Modified Carry Select Adder Using D-Latch ", International Journal Of VLSI System Design And Communication Systems Volume.02, Issueno.11, December-2014.
- [8] K. Hemapriya, R. Karthikeyan, "Low-Power Folded Tree Architecture and Multi-Bit Flip-Flop Merging Technique for WSN Nodes", International Conference on Information Communication and Embedded Systems, 2014.
- [9] K Ranjithkumar, TRV Anandharajan, "DSP Architecture with Folded Tree for Power Constraint Devices", Sixth International Conference on Advanced Computing, 2014.