TY - JOUR
T1 - EPA
T2 - The effective pipeline architecture for CNN accelerator with high performance and computing efficiency based on FPGA
AU - Zhang, Junjie
AU - Yin, Qiao
AU - Hu, Weicheng
AU - Li, Yunfeng
AU - Li, Hu
AU - Ye, Nan
AU - Cao, Bingyao
N1 - Publisher Copyright:
© 2021 John Wiley & Sons, Ltd.
PY - 2023/8/15
Y1 - 2023/8/15
N2 - Thanks to the great developments of the latest Field Programmable Gate Array (FPGA), the performance bottleneck of Deep Learning hardware accelerators has been converted to computing ability. In this paper, a novel FPGA-based Convolutional Neural Network (CNN) Accelerator architecture, named the Effective Pipeline Architecture (EPA) is proposed to optimize the resource usage for the implementation of the CNN calculation. As the unique storage strategies, which contain many creative designing details, are adopted and optimized for different CNN models and layers, great DSP computing efficiency can be achieved in the fine-grained pipeline. Moreover, compared with the traditional architectures, through the kernel combination and data scheduling, twice throughput for the general matrix multiplication is realized in a great many parallel DSP48E resources. As a result, the realization of Yolov2-Tiny achieves 873 Giga Operations Per Second (GOPS) by 902 DSPs with 67 Frames Per Second (FPS), and the computing efficiency in most layers can even reach more than 90%, which improves the calculation performance and efficiency comparing with the previous designs, and is significant to meet the increasing computing requirement.
AB - Thanks to the great developments of the latest Field Programmable Gate Array (FPGA), the performance bottleneck of Deep Learning hardware accelerators has been converted to computing ability. In this paper, a novel FPGA-based Convolutional Neural Network (CNN) Accelerator architecture, named the Effective Pipeline Architecture (EPA) is proposed to optimize the resource usage for the implementation of the CNN calculation. As the unique storage strategies, which contain many creative designing details, are adopted and optimized for different CNN models and layers, great DSP computing efficiency can be achieved in the fine-grained pipeline. Moreover, compared with the traditional architectures, through the kernel combination and data scheduling, twice throughput for the general matrix multiplication is realized in a great many parallel DSP48E resources. As a result, the realization of Yolov2-Tiny achieves 873 Giga Operations Per Second (GOPS) by 902 DSPs with 67 Frames Per Second (FPS), and the computing efficiency in most layers can even reach more than 90%, which improves the calculation performance and efficiency comparing with the previous designs, and is significant to meet the increasing computing requirement.
KW - CNN accelerator
KW - FPGA
KW - computing efficiency
KW - fine-grained pipeline
KW - high performance
KW - pipeline architecture
UR - http://www.scopus.com/inward/record.url?scp=85103415502&partnerID=8YFLogxK
U2 - 10.1002/cpe.6198
DO - 10.1002/cpe.6198
M3 - Article
AN - SCOPUS:85103415502
SN - 1532-0626
VL - 35
JO - Concurrency and Computation: Practice and Experience
JF - Concurrency and Computation: Practice and Experience
IS - 18
M1 - e6198
ER -