Presentation
BBAL: A Bidirectional Block Floating Point-Based Quantization Accelerator for Large Language Models
DescriptionLarge language models (LLMs), with their billions of pa-
rameters, pose substantial challenges for deployment on edge devices,
straining both memory capacity and computational resources. Block
floating-point (BFP) quantisation reduces memory and computational
overhead by converting high-overhead floating-point operations into low-
bit fixed-point operations. However, BFP requires aligning all data to the
maximum exponent, which causes loss of small and moderate values,
resulting in quantisation error and degradation in the accuracy of
LLMs. To address this issue, we propose a Bidirectional Block Floating-
Point (BBFP) data format, which reduces the probability of selecting
the maximum as shared exponent, thereby reducing quantisation error.
By utilizing the features in BBFP, we present a full-stack Bidirectional
Block Floating Point-Based Quantisation Accelerator for LLMs (BBAL),
primarily comprising a PE array based on BBFP, paired with
our proposed cost-effective nonlinear computation unit. Experimental
results show BBAL achieves a 22% improvement in accuracy compared
to an outlier-aware accelerator at similar efficiency, and a 40% efficiency
improvement over a vanilla BFP-based accelerator at similar accuracy.
rameters, pose substantial challenges for deployment on edge devices,
straining both memory capacity and computational resources. Block
floating-point (BFP) quantisation reduces memory and computational
overhead by converting high-overhead floating-point operations into low-
bit fixed-point operations. However, BFP requires aligning all data to the
maximum exponent, which causes loss of small and moderate values,
resulting in quantisation error and degradation in the accuracy of
LLMs. To address this issue, we propose a Bidirectional Block Floating-
Point (BBFP) data format, which reduces the probability of selecting
the maximum as shared exponent, thereby reducing quantisation error.
By utilizing the features in BBFP, we present a full-stack Bidirectional
Block Floating Point-Based Quantisation Accelerator for LLMs (BBAL),
primarily comprising a PE array based on BBFP, paired with
our proposed cost-effective nonlinear computation unit. Experimental
results show BBAL achieves a 22% improvement in accuracy compared
to an outlier-aware accelerator at similar efficiency, and a 40% efficiency
improvement over a vanilla BFP-based accelerator at similar accuracy.
Event Type
Research Manuscript
TimeTuesday, June 245:00pm - 5:15pm PDT
Location3000, Level 3
AI
AI3: AI/ML Architecture Design
Similar Presentations


