Presentation
MILLION: Mastering Long-Context LLM Inference Via Outlier-Immunized KV Product Quantization
Description.Large language models (LLMs) are increasingly utilized for complex tasks requiring longer context lengths, with some models supporting up to 128K or 1M tokens. This trend, however, presents significant challenges in inference speed and memory management.
The primary bottleneck in long-context LLM inference is the quadratic computational complexity of attention mechanisms, causing substantial slowdowns as sequence length increases. KV cache mechanism alleviates this issue by storing pre-computed data, but introduces memory requirements that scale linearly with context length, hindering efficient LLM deployment. Quantization emerges as a promising approach to address the widening gap between LLM size and memory capacity. However, traditional quantization schemes often yield suboptimal compression results for KV caches due to two key factors:
i) On-the-fly quantization and de-quantization, causing significant performance overhead;
ii) Prevalence of outliers in KV values, challenging low-bitwidth uniform quantization.
To this end, we propose MILLION, a novel quantization framework achieving low-bitwidth KV cache through product quantization. First, we conduct a thorough analysis of KV cache distribution, revealing the limitations of existing quantization schemes. Second, we introduce a non-uniform quantization algorithm based on product quantization, which efficiently compresses data while preserving accuracy. Third, we develop a high-performance GPU inference framework for \archname that leverages sparse computation and asynchronous quantization, significantly enhancing inference speed.
Comprehensive evaluation results demonstrate that MILLION can achieve 4 bits quantization trivial perplexity and accuracy loss.
The primary bottleneck in long-context LLM inference is the quadratic computational complexity of attention mechanisms, causing substantial slowdowns as sequence length increases. KV cache mechanism alleviates this issue by storing pre-computed data, but introduces memory requirements that scale linearly with context length, hindering efficient LLM deployment. Quantization emerges as a promising approach to address the widening gap between LLM size and memory capacity. However, traditional quantization schemes often yield suboptimal compression results for KV caches due to two key factors:
i) On-the-fly quantization and de-quantization, causing significant performance overhead;
ii) Prevalence of outliers in KV values, challenging low-bitwidth uniform quantization.
To this end, we propose MILLION, a novel quantization framework achieving low-bitwidth KV cache through product quantization. First, we conduct a thorough analysis of KV cache distribution, revealing the limitations of existing quantization schemes. Second, we introduce a non-uniform quantization algorithm based on product quantization, which efficiently compresses data while preserving accuracy. Third, we develop a high-performance GPU inference framework for \archname that leverages sparse computation and asynchronous quantization, significantly enhancing inference speed.
Comprehensive evaluation results demonstrate that MILLION can achieve 4 bits quantization trivial perplexity and accuracy loss.
Event Type
Research Manuscript
TimeMonday, June 232:15pm - 2:30pm PDT
Location3000, Level 3
AI1: AI/ML Algorithms


