Presentation
Bank-Split PIM: Enabling Concurrent PIM and Memory Operations for LLM Inference in Heterogeneous Systems
DescriptionThe rising popularity of Large Language Models (LLMs) has intensified the demand for efficient inference acceleration. While GPUs and NPUs are adept at handling General Matrix-Matrix (GEMM) operations, the memory-intensive tasks inherent in LLMs are better suited to Processing-In-Memory (PIM) architectures. However, integrating PIM into heterogeneous systems presents challenges, particularly in enabling con- current PIM and standard memory operations, which can lead to significant bottlenecks and underutilization of PIM resources.
In this paper, we propose a novel PIM architecture that addresses these issues through two key innovations: 1) Bank-Split Architecture: Segregates memory banks and assigns independent I/O buffers to each, enabling the simultaneous execution of PIM and normal memory operations by decoupling their access patterns. 2) Partial Batch Offloading: Duplicates weight data to alternate I/O buffers during GEMM operations on Processing Units (e.g., GPUs or NPUs), enabling independent partial batch processing and significantly enhancing PIM utilization.
Experimental results demonstrate that our architecture achieves up to a 7.31× speedup compared to the NPU baseline, an average 20.3% performance improvement over the latest heterogeneous system PIM, NeuPIMs, and an overall 21% increase in PIM utilization.
In this paper, we propose a novel PIM architecture that addresses these issues through two key innovations: 1) Bank-Split Architecture: Segregates memory banks and assigns independent I/O buffers to each, enabling the simultaneous execution of PIM and normal memory operations by decoupling their access patterns. 2) Partial Batch Offloading: Duplicates weight data to alternate I/O buffers during GEMM operations on Processing Units (e.g., GPUs or NPUs), enabling independent partial batch processing and significantly enhancing PIM utilization.
Experimental results demonstrate that our architecture achieves up to a 7.31× speedup compared to the NPU baseline, an average 20.3% performance improvement over the latest heterogeneous system PIM, NeuPIMs, and an overall 21% increase in PIM utilization.
Event Type
Networking
Work-in-Progress Poster
TimeSunday, June 226:00pm - 7:00pm PDT
LocationLevel 3 Lobby


