Close

Presentation

HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference
DescriptionMixture of Experts (MoE) models enable efficient scaling of Large Language Models (LLMs) but demand substantial memory, necessitating offloading and on-demand loading to manage constraints. CPUs are often leveraged to compute expert layers during cache misses, reducing the need for costly GPU loading. However, unpredictable activation patterns in MoE models make task-to-hardware mapping in CPU-GPU hybrid systems highly complex.
We propose HybriMoE, a system that addresses these challenges with (i) dynamic intra-layer scheduling, (ii) impact-driven prefetching, and (iii) score-based caching. Evaluated on kTransformers and Llama.cpp, HybriMoE achieves 1.33x and 1.70x speedups in prefill and decode stages, respectively.