Session
Balancing Speed and Memory: Advancing LLM Acceleration
DescriptionAs LLMs grow in scale, optimizing both memory usage and computational throughput becomes essential. This session introduces six interesting approaches to overcoming key bottlenecks in LLM and MoE inference, including semantic-aware KV cache compression, outlier-free quantization for FPGA acceleration, and hybrid CPU-GPU execution strategies. Additionally, new techniques in sparse attention balancing, FPGA overlays for state-space models, and fusion-aware workload optimization enable more efficient processing. These works comes as a timely effort to inspire next-generation AI accelerators that achieve higher performance while maintaining resource efficiency.
Event Type
Research Manuscript
TimeMonday, June 2310:30am - 12:00pm PDT
Location3001, Level 3
AI
AI4: AI/ML System and Platform Design
Presentations
| 10:30am - 10:45am PDT | MambaOPU: An FPGA Overlay Processor for State-space-duality-based Mamba Models | |
| 10:45am - 11:00am PDT | A Cross-model Fusion-aware Framework for Optimizing (gather-matmul-scatter)s Workload | |
| 11:00am - 11:15am PDT | HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference | |
| 11:15am - 11:30am PDT | ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression | |
| 11:30am - 11:45am PDT | DuoQ: A DSP Utilization-aware and Outlier-free Quantization for FPGA-based LLMs Acceleration | |
| 11:45am - 12:00pm PDT | Libra: A Hybrid-Sparse Attention Accelerator Featuring Multi-Level Workload Balance |


