Presentation
PARO: Hardware-software Co-design with Pattern-aware Reorder-based Attention Quantization in Video Generation Models
DescriptionTransformer-based video generation models have demonstrated significant potential in content creation. However, the current state-of-the-art model employing "3D full attention"
encounters substantial computation and storage challenges. For instance, the attention map size for CogVideoX-5B requires 56.50 GB, and generating a video of 49 frames takes approximately 1 minute on an NVIDIA A100 GPU under FP16. Although model quantization has proven effective in reducing both memory and computational costs, applying it to video generation models still faces challenges in preserving algorithmic performance while ensuring efficient hardware processing. To address these issues, we introduce PARO, a video generation accelerator with pattern-aware reorder-based attention quantization. PARO investigates the diverse attention patterns of 3D full attention and proposes a novel reorder technique to unify these patterns into a unified "block diagonal" structure. Block-wise mixed precision quantization is further applied to achieve lossless compression under an average bitwidth of 4.80 bits. In terms of hardware, to overcome the limitation of existing mixed-precision computing units could not fully utilize the attention map bitwidth to accelerate QK
multiplication, PARO designs an output-bitwidth aware mixed-precision processing element (PE) array through hardware-software co-design. This approach ensures that the mixed-precision characteristics are fully utilized to enhance hardware efficiency in the bottleneck attention computation. Experiments demonstrate that PARO delivers up to 2.71× improvement in end-to-end performance compared to an NVIDIA A100 GPU and achieves up to 6.38∼7.05× speedup over state-of-the-art ASIC-based accelerators on the CogVideoX-2B and 5B models.
encounters substantial computation and storage challenges. For instance, the attention map size for CogVideoX-5B requires 56.50 GB, and generating a video of 49 frames takes approximately 1 minute on an NVIDIA A100 GPU under FP16. Although model quantization has proven effective in reducing both memory and computational costs, applying it to video generation models still faces challenges in preserving algorithmic performance while ensuring efficient hardware processing. To address these issues, we introduce PARO, a video generation accelerator with pattern-aware reorder-based attention quantization. PARO investigates the diverse attention patterns of 3D full attention and proposes a novel reorder technique to unify these patterns into a unified "block diagonal" structure. Block-wise mixed precision quantization is further applied to achieve lossless compression under an average bitwidth of 4.80 bits. In terms of hardware, to overcome the limitation of existing mixed-precision computing units could not fully utilize the attention map bitwidth to accelerate QK
multiplication, PARO designs an output-bitwidth aware mixed-precision processing element (PE) array through hardware-software co-design. This approach ensures that the mixed-precision characteristics are fully utilized to enhance hardware efficiency in the bottleneck attention computation. Experiments demonstrate that PARO delivers up to 2.71× improvement in end-to-end performance compared to an NVIDIA A100 GPU and achieves up to 6.38∼7.05× speedup over state-of-the-art ASIC-based accelerators on the CogVideoX-2B and 5B models.
Event Type
Research Manuscript
TimeTuesday, June 2410:45am - 11:00am PDT
Location3001, Level 3
AI1: AI/ML Algorithms
Similar Presentations


