Presentation
SynGPU: Synergizing CUDA and Bit-Serial Tensor Cores for Vision Transformer Acceleration on GPU
DescriptionVision Transformers (ViTs) have demonstrated remarkable performance in computer vision tasks by effectively extracting global features. However, their self-attention mechanism suffers from quadratic time and memory complexity as image resolution or video duration increases, leading to inefficiency on GPUs. To accelerate ViTs, existing works mainly focus on pruning tokens based on value-level sparsity. However, they miss the chance to achieve peak performance as they overlook the bit-level sparsity. Instead, we propose Inter-token Bit-sparsity Awareness (IBA) algorithm to accelerate ViTs by exploring bit-sparsity from similar tokens. Next, we implement IBA on GPUs that synergize CUDA Cores and Tensor Cores by addressing two issues: firstly, the bandwidth congestion of the Register File hinders the parallel ability of CUDA Cores and Tensor Cores. Secondly, due to the varying exponent of floating-point vectors, it is hard to accelerate bit-sparse matrix multiplication and accumulation (MMA) in Tensor Core through fixed-point-based bit-level circuits. Therefore, we present SynGPU, an algorithm-hardware co-design framework, to accelerate ViTs. SynGPU enhances data reuse by a novel data mapping to enable full parallelism of CUDA Cores and Tensor Core. Moreover, it introduces Bit-Serial Tensor Core (BSTC) that supports fixed- and floating-point MMA by combining the fixed-point Bit-Serial Dot Product (BSDP) and exponent alignment techniques. Extensive experiments show that SynGPU achieves an average of 2.15$\times$ $\sim$
3.95$\times$ speedup and 2.49$\times$ $\sim$ 3.81$\times$ compute density over A100 GPU.
3.95$\times$ speedup and 2.49$\times$ $\sim$ 3.81$\times$ compute density over A100 GPU.
Event Type
Research Manuscript
TimeWednesday, June 251:45pm - 2:00pm PDT
Location3001, Level 3
Similar Presentations


