A Low Latency Low Area Floating Point Multiply-Accumulate Unit for DNN Acceleration
Digital systems use multiply-accumulate (MAC) units to perform the multiply-accumulate operation. MAC units form the very foundation of a variety of different applications such as digital signal processing, deep learning, or artificial intelligence, each of these applications require repetitive computations of multiplication and addition. MAC units are ideal of these operations. Most of these applications rely heavily on the speed of the computations of the adders and the multipliers being used in the MAC units. Deep Neural Networks (DNN), has recently gained popularity in scientific computing, and is widely used to solve complex problems. A convolutional neural network (CNN) which is a popular group of DNNs have shown considerable performance in many applications, such as image processing, signal processing, pattern recognition, and computer vision. In these networks, convolution operations account for more than 90% of the computations. MAC units are the primary component that performs these convolution operations in deep neural networks. The primary objective of this thesis is to design and implement mixed-precision MAC units for low latency and low area for deep neural network acceleration. This research investigates 16-bit MAC units designed with different multiplier and adder algorithms. MAC units with various multiplication – adder combinations are analyzed and the best combination in terms of latency and area are implemented. The MAC units are designed and implemented in Verilog and synthesized and simulated using Vivado by Xilinx. The realization of the design is carried out using Cadence Innovus using 45nm TSMC technology nodes. Lastly, the timing, power, and area reports are generated by using Synopsys Design Compiler.