Presentation
LLMShare: Optimizing LLM Inference Serving with Hardware Architecture Exploration
SessionLLM Uprising: Fast & Furious
DescriptionLarge Language Models (LLMs) have revolutionized language tasks but pose significant deployment challenges due to their substantial computational demands during inference. The hardware configurations of existing LLM serving systems do not optimize for the different computational and bandwidth needs of the prefill and decoding phases in LLM inference, leading to inefficient resource use and increased costs. In this paper, we systematically investigate promising hardware configurations for LLM inference serving. We develop a simulator that models the performance and cost across different hardware solutions and introduce a customized design space exploration framework to identify optimal setups efficiently. By aligning hardware capabilities with the specific demands of the prefill and decoding phases, we achieve 13% cost savings and over 4x throughput improvements compared to conventional serving system setups.
Event Type
Research Manuscript
TimeTuesday, June 241:45pm - 2:00pm PDT
Location3002, Level 3
AI
AI4: AI/ML System and Platform Design