Close

Presentation

DIAS: Distance-based Attention Sparsity for Ultra-Long-Sequence Transformer with Tree-like Processing-in-Memory Architecture
DescriptionLong-context inference has become a central focus in recent self-regressive Transformer research. However, challenges still remain in performing decode stage due to the computational complexity of attention mechanisms and the substantial overhead associated with KV cache storage. Although attention sparsity has been proposed as a potential solution, conventional sparsity methods that rely on heuristic algorithms often suffer from accuracy degradation when applied to ultra-long sequences.

To address the dilemma between accuracy and performance, this work proposes DIAS, a distance-based irregular attention sparsity approach with processing-in-memory (PIM) architecture. DIAS employs top-K approximate attention scores through graph-based search to enhance inference efficiency while maintaining accuracy. Furthermore, a scalable gather-and-scatter-based PIM architecture is introduced to manage the storage demands of large-scale KV cache and to facilitate efficient sparsity attention computations. Various configurations of DIAS evaluated on Longbench with Mistral and Llama3 models show xx times speedup and xx times energy efficiency improvement, with accuracy drop of less than 1\%.