Close

Presentation

Hydra: Harnessing Expert Popularity for Efficient Mixture-of-Expert Inference on Chiplet System
DescriptionThe Mixture-of-Expert (MoE) mechanism has been widely adopted in Transformer-based large language models (LLMs) to enhance generalization and enable model scaling. However, the increasing size of MoE models imposes significant memory demands, leading to suboptimal hardware performance. The emerging multi-chiplet system, with its inherent scalability, offers a potential solution. However, deploying MoE models on chiplet-based architectures introduces new challenges of extensive all-to-all communication and model computational inefficiencies. To alleviate the above issues, this paper presents Hydra, a software/hardware co-design aimed at accelerating MoE inference on chiplet-based architectures. In software, Hydra employs a popularity-aware expert mapping strategy to optimize inter-chiplet communication. In hardware, it incorporates Content Addressable Memory (CAM) to eliminate expensive explicit token (un)-permutation based on sparse matrix multiplications and a redundant-calculation-skipping softmax engine to bypass unnecessary division and exponential operations. Evaluated in 22 nm technology, Hydra achieves latency reductions of 14.2× and 3.5× and power reductions of 169.1× and 18.9× over GPU and state-of-the-art MoE accelerator, respectively, thereby offering a scalable and efficient solution for MoE model deployment.
Event Type
Research Manuscript
TimeWednesday, June 251:30pm - 1:45pm PDT
Location3001, Level 3
Topics
AI
Tracks
AI3: AI/ML Architecture Design