Power Efficient Design of High-Performance Convolutional Neural Networks Hardware Accelerator on FPGA: A Case Study With GoogLeNet

Convolutional neural networks (CNNs) have dominated image recognition and object detection models in the last few years. They can achieve the highest accuracies with several applications such as automotive and biomedical applications. CNNs are usually implemented by using Graphical Processing Units...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Ahmed J. Abd El-Maksoud, Mohamed Ebbed, Ahmed H. Khalil, Hassan Mostafa
Formato: article
Lenguaje:EN
Publicado: IEEE 2021
Materias:
Acceso en línea:https://doaj.org/article/76953a014d404d07a2d8a929652c98f7
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
Descripción
Sumario:Convolutional neural networks (CNNs) have dominated image recognition and object detection models in the last few years. They can achieve the highest accuracies with several applications such as automotive and biomedical applications. CNNs are usually implemented by using Graphical Processing Units (GPUs) or generic processors. Although the GPUs are capable of performing the complex computations needed by the CNNs, their power consumption is huge compared to generic processors. Moreover, current generic processors are unable to cope up with the growing CNNs demand for computation performance. Therefore, hardware accelerators are the best choice to provide the required computation performance needed by the CNNs as well as affordable power consumption. Several techniques are adopted in hardware accelerators such as pruning and quantization. In this paper, a low-power dedicated CNN hardware accelerator is proposed based on GoogLeNet CNN as a case study. Weights pruning and quantization are applied to reduce the memory size by <inline-formula> <tex-math notation="LaTeX">$57.6\times $ </tex-math></inline-formula>. Consequently, only FPGA on-chip memory is used for weights and activations storage without using offline DRAMs (Dynamic Random Access Memories). In addition, the proposed hardware accelerator utilizes zero DSP (Digital Signal Processing) units as all multiplications are replaced by shifting operations. The accelerator is developed based on a time-sharing/pipelined architecture, which processes the CNN model layer by layer. The architecture proposes a new data fetching mechanism that increases data reuse. Moreover, the proposed accelerator units are implemented in native RTL (Register Transfer Logic). The accelerator classifies 25.1 frames per second (fps) with 3.92W only, which is more power-efficient than other GoogLeNet implementations on FPGA in the literature. In addition, the proposed accelerator achieves an average classification efficiency of 91&#x0025;, which is significantly higher than comparable architectures. Furthermore, this accelerator surpasses the popular CPUs such as Intel Core-i7 and GPUs such as GTX 1080Ti in terms of the number of frames processed per Watt.