Presentation
EMSTrans: An Efficient Hardware Accelerator for Transformer with Multi-Level Sparsity Awareness
DescriptionTransformer-based models deliver high performance, but this comes at the expense of substantial computational overhead and data movement, particularly in the attention layer. Although self-attention exhibits multi-level sparsity, effectively leveraging sparsity to design hardware and dataflow that optimize both matrix multiplications (MMs) and softmax computation remains a significant challenge,
including the need for flexibility to simultaneously support element-wise and bit-wise sparsity, as well as the difficulty in fully identifying the opportunities sparsity offers to optimize softmax computation. In this paper, we propose EMSTrans, a specialized Transformer accelerator that addresses these challenges through two innovations. Firstly, we propose the power of 2 (PO2) based fusion approach and a sparsity-aware bit-element-fusion (SABEF) systolic array. This approach utilizes the systolic-gating state shifting method to leverage both bit-level and element-level sparsity, significantly reducing energy consumption and latency in MM operations. Secondly, we propose the two-stage operation fusion (TSOF) softmax computation scheme. It allows the sparse score matrix (S) to be stored in a pruned and quantized form using the proposed subtraction-based quantization pruning method (SBQP). Through the fully utilization of sparsity, the proposed accelerator achieves up to 93.34% energy saving and 1.98× acceleration for MM computation along with up to 79% memory access reduction for softmax computation, achieving 1.93× improvement of the accelerator's energy efficiency.
including the need for flexibility to simultaneously support element-wise and bit-wise sparsity, as well as the difficulty in fully identifying the opportunities sparsity offers to optimize softmax computation. In this paper, we propose EMSTrans, a specialized Transformer accelerator that addresses these challenges through two innovations. Firstly, we propose the power of 2 (PO2) based fusion approach and a sparsity-aware bit-element-fusion (SABEF) systolic array. This approach utilizes the systolic-gating state shifting method to leverage both bit-level and element-level sparsity, significantly reducing energy consumption and latency in MM operations. Secondly, we propose the two-stage operation fusion (TSOF) softmax computation scheme. It allows the sparse score matrix (S) to be stored in a pruned and quantized form using the proposed subtraction-based quantization pruning method (SBQP). Through the fully utilization of sparsity, the proposed accelerator achieves up to 93.34% energy saving and 1.98× acceleration for MM computation along with up to 79% memory access reduction for softmax computation, achieving 1.93× improvement of the accelerator's energy efficiency.
Event Type
Networking
Work-in-Progress Poster
TimeSunday, June 226:00pm - 7:00pm PDT
LocationLevel 3 Lobby


