Close

Presentation

Blaze: An Efficient Bit-Sparse Attention Architecture With Workload Orchestration Optimization
DescriptionThe attention mechanism is a key component in neural networks, essential for retrieving relevant information in Natural Language Processing (NLP). However, the high computational complexity and substantial power consumption limits the deployment of attention-based models. To overcome these issues, we introduce Blaze, an efficient attention architecture that utilizes both value and bit-level sparsity with workload orchestration optimization. Our Approximate-Computing-Based (ACB) mechanism addresses workload imbalance in bit-sparse architectures, while the Leading-Booth mechanism further enhances the performance of attention computations. We also design a reconfigurable computing engine to support these innovations, improving performance in attention inference tasks.