Presentation
Near-Memory LLM Inference Processor based on 3D DRAM-to-logic Hybrid Bonding
DescriptionLarge language model (LLM) inference poses dual challenges, demanding substantial memory bandwidth and computing resources. Recent advancements in near-memory accelerators leveraging 3D DRAM-to-logic hybrid-bonding (HB) interconnects have gained attention due to their highly parallel data transfer capabilities. We address limitations in previous HB-DRAM accelerators, such as those stemming from distributed controller designs, by introducing an architecture with a centralized controller and dual-IO scheme. This approach not only reduces the chip area overhead but also enables reconfigurable GEMV/GEMM operations, boosting the performance. Simulations for the OPT 66B model show that our proposed accelerator achieves 2.9X, 3.5X, and 2.5X higher performance compared to NPU, DRAM-PIM, and heterogeneous designs (DRAM-PIM + NPU), respectively.
Event Type
Research Manuscript
TimeWednesday, June 255:00pm - 5:15pm PDT
Location3001, Level 3
Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems