Presentation
Grasp: Group-based Prediction of Activation Sparsity for Fast LLM Inference
DescriptionOptimizing LLM inference has become increasingly important as the demand for efficient on-device deployments grows. To reduce the computational overhead in the MLP components, which account for a significant portion of LLM inference, ReLU-fied LLMs have been introduced to maximize activation sparsity. Several sparsity prediction methods have been developed to efficiently skip unnecessary memory accesses and computations by predicting activation sparsity. In this paper, we propose a novel magnitude-based, training-free sparsity prediction technique called Grasp that builds on the existing sign bit-based method for ReLU-fied LLMs. The proposed method enhances prediction accuracy by grouping values considering the distribution within vectors and explicitly accounting for statistical outliers. This allows us to estimate the impact of each element more accurately yet in an efficient way, improving both activation sparsity prediction accuracy and computational efficiency. Compared to the-state-of-the-art technique, Grasp achieves higher sparsity prediction accuracy and 11% higher skipping efficiency, which corresponds to 1.85× speedup against the dense inference.
Event Type
Research Manuscript
TimeMonday, June 231:30pm - 1:45pm PDT
Location3000, Level 3
AI
AI1: AI/ML Algorithms


