Presentation
PIMoE: Towards Efficient MoE Transformer Deployment on NPU-PIM System through Throttle-Aware Task Offloading
DescriptionMixture-of-experts (MoE) technique holds significant promise for scaling up Transformer models.
However, the data transfer overhead and imbalanced workload hinder efficient deployment.
This work presents PIMoE, a heterogeneous system combining processing-in-memory (PIM) and neural-processing-unit (NPU) to facilitate efficient MoE Transformer inference.
We propose a throttle-aware task offloading method that addresses workload imbalance between NPU and PIM, achieving optimal task distribution.
Furthermore, we design a near-memory-controller data condenser to address the mismatch of sparse data layout between NPU and PIM, enhancing data transfer efficiency.
Experimental results demonstrate that PIMoE achieves 4.5× speedup and 13.7× greater energy efficiency compared to the A100, and 1.4× speedup over a state-of-the-art MoE platform.
However, the data transfer overhead and imbalanced workload hinder efficient deployment.
This work presents PIMoE, a heterogeneous system combining processing-in-memory (PIM) and neural-processing-unit (NPU) to facilitate efficient MoE Transformer inference.
We propose a throttle-aware task offloading method that addresses workload imbalance between NPU and PIM, achieving optimal task distribution.
Furthermore, we design a near-memory-controller data condenser to address the mismatch of sparse data layout between NPU and PIM, enhancing data transfer efficiency.
Experimental results demonstrate that PIMoE achieves 4.5× speedup and 13.7× greater energy efficiency compared to the A100, and 1.4× speedup over a state-of-the-art MoE platform.
Event Type
Research Manuscript
TimeWednesday, June 254:15pm - 4:30pm PDT
Location3001, Level 3
Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems