Close

Presentation

LLMShare: Optimizing LLM Inference Serving with Hardware Architecture Exploration
DescriptionLarge Language Models (LLMs) have revolutionized language tasks but pose significant deployment challenges due to their substantial computational demands during inference. The hardware configurations of existing LLM serving systems do not optimize for the different computational and bandwidth needs of the prefill and decoding phases in LLM inference, leading to inefficient resource use and increased costs. In this paper, we systematically investigate promising hardware configurations for LLM inference serving. We develop a simulator that models the performance and cost across different hardware solutions and introduce a customized design space exploration framework to identify optimal setups efficiently. By aligning hardware capabilities with the specific demands of the prefill and decoding phases, we achieve 13% cost savings and over 4x throughput improvements compared to conventional serving system setups.
Event Type
Research Manuscript
TimeTuesday, June 241:45pm - 2:00pm PDT
Location3002, Level 3
Topics
AI
Tracks
AI4: AI/ML System and Platform Design