Presentation
PipeSpec: Breaking Stage Dependencies in Hierarchical LLM Decoding
DescriptionSpeculative decoding accelerates LLM inference by using smaller draft models to generate candidates for verification by larger models. However, existing approaches are limited by sequential dependencies between stages. We present PipeSpec, a novel framework that breaks these dependencies through asynchronous execution across a hierarchical pipeline of k models, enabling continuous parallel execution. Our approach achieves up to 2.54x speedup over conventional autoregressive decoding while outperforming state-of-the-art speculative methods. Results across text summarization and code generation tasks demonstrate that pipeline efficiency increases with model depth, providing a scalable solution for multi-device systems.
Event Type
Networking
Work-in-Progress Poster
TimeMonday, June 236:00pm - 7:00pm PDT
LocationLevel 2 Lobby