Close

Session

Research Manuscript
:
Balancing Speed and Memory: Advancing LLM Acceleration
DescriptionAs LLMs grow in scale, optimizing both memory usage and computational throughput becomes essential. This session introduces six interesting approaches to overcoming key bottlenecks in LLM and MoE inference, including semantic-aware KV cache compression, outlier-free quantization for FPGA acceleration, and hybrid CPU-GPU execution strategies. Additionally, new techniques in sparse attention balancing, FPGA overlays for state-space models, and fusion-aware workload optimization enable more efficient processing. These works comes as a timely effort to inspire next-generation AI accelerators that achieve higher performance while maintaining resource efficiency.
Event Type
Research Manuscript
TimeMonday, June 2310:30am - 12:00pm PDT
Location3001, Level 3
Topics
AI
Tracks
AI4: AI/ML System and Platform Design