Close

Presentation

Efficiently Exploiting Inference Parallelism in Two-Sided Sparse CNNs for a High-Speed, Low-Cost Accelerator
DescriptionThe inherent parallelism of convolution neural network (CNN) inference enables efficient and flexible data processing. Consequently, highly parallel compute paradigms are widely adopted to achieve high performance for CNNs. Moreover, exploiting sparsity has become an indispensable technique for accelerating CNN inference. However, fully exploiting two-sided random sparsity (weights and input activations) can hinder the parallel processing in CNNs. The non-uniformity of sparse inputs makes synchronization overhead significant due to the requirement of input matching to ensure sufficient valid input pairs for subsequent parallel processing units. While various architectures have been proposed, the challenge remains inadequately addressed. In this paper, we introduce a stride-aware data compression method coupled with weight-stationary dataflow to fully leverage the parallel characteristics of CNNs for accelerated inference at low hardware cost and power consumption. Experimental results demonstrate that our technique achieves speedups of 1.17×, 1.16×, 1.32×, and 0.82× compared to the recent accelerator SparTen for VGG16, GoogLeNet, ResNet34, and MobileNetV1, respectively. Furthermore, FPGA implementation of our core reveals a notable 4.8× reduction in hardware size and a 5.25× enhancement in energy efficiency compared to SparTen.