Close

Presentation

BlockPIM: Optimizing Memory Management for PIM-enabled Long-Context LLM Inference
DescriptionProcessing-In-Memory (PIM) architectures alleviate the memory bottleneck in the decode phase of large language model (LLM) inference by performing operations like GEMV and Softmax in memory. However, the fragmented data layout in current PIM architectures limits end-to-end acceleration for long-context LLMs. In this paper, we propose BlockPIM, a cross-channel block memory layout strategy that maximizes memory utilization and eliminates the context length constraint. Additionally, we introduce a cross-channel attention computation scheme that is compatible with the current architecture to support distributed attention operations on BlockPIM. Experimental results demonstrate that our approach achieves a 62\% average throughput increase compared to existing state-of-the-art PIM solutions, enabling efficient and scalable deployment of large language models on PIM architectures.