Presentation
SambaNova SN40L: Unleashing Agentic AI with Dataflow
DescriptionAgentic AI involves multiple specialized language models working in concert to perform complex tasks. Open-source large language models (LLMs) coupled has enabled the machine learning community to build agentic systems with smaller models that exceed the capabilities of monolithic LLMs. Techniques like chain-of-thought reasoning and prompt caching accomplish complex tasks during inference, shifting the performance bottleneck to the autoregressive decode phase of token generation. However, token generation is inefficient on GPUs for two main reasons: (1) GPUs utilize only 20% of their peak memory bandwidth due to inadequate operator fusion coupled with synchronization overheads at kernel boundaries, and (2) hosting and dynamically switching between a large number of models can be prohibitively expensive and slow.
This talk describes SambaNova's approach to address the challenges above with the SambaNova SN40L Reconfigurable Dataflow Unit (RDU) chip. The SN40L RDU is a 2.5D CoWoS chiplet-based design containing two RDU dies on a silicon interposer. The chip introduces a new three-tier memory system with on-chip distributed SRAM, on-package HBM, and off-package DDR DRAM. Model parameters reside in DDR, and actively used models are cached and served from high bandwidth memory. On-chip streaming dataflow enables an unprecedented level of operator fusion: entire decoder blocks can be automatically fused into a single kernel call. Furthermore, streaming dataflow eliminates synchronization costs between consecutive calls to the same kernel by transforming these calls into a single call to a modified kernel containing a pipelined outer loop, achieving over 75% of peak performance during token generation on 8 and 16 RDUs. At the time of writing, a single rack of 16 SN40L RDUs serves Deepseek-R1 671B model at 198 tokens/s, the fastest in the world and the first non-GPU vendor to host. Techniques described in this talk are deployed in production in a commercial AI inference cloud at cloud.sambanova.ai.
This talk describes SambaNova's approach to address the challenges above with the SambaNova SN40L Reconfigurable Dataflow Unit (RDU) chip. The SN40L RDU is a 2.5D CoWoS chiplet-based design containing two RDU dies on a silicon interposer. The chip introduces a new three-tier memory system with on-chip distributed SRAM, on-package HBM, and off-package DDR DRAM. Model parameters reside in DDR, and actively used models are cached and served from high bandwidth memory. On-chip streaming dataflow enables an unprecedented level of operator fusion: entire decoder blocks can be automatically fused into a single kernel call. Furthermore, streaming dataflow eliminates synchronization costs between consecutive calls to the same kernel by transforming these calls into a single call to a modified kernel containing a pipelined outer loop, achieving over 75% of peak performance during token generation on 8 and 16 RDUs. At the time of writing, a single rack of 16 SN40L RDUs serves Deepseek-R1 671B model at 198 tokens/s, the fastest in the world and the first non-GPU vendor to host. Techniques described in this talk are deployed in production in a commercial AI inference cloud at cloud.sambanova.ai.
Event Type
Research Special Session
TimeTuesday, June 2411:30am - 12:00pm PDT
Location3010, Level 3


