Presentation
Move Less, Retrieve Fast: A Retrieval-in-Memory Architecture for Language Models
DescriptionRetrieval-augmented language models (RALMs) have attracted widespread attention for addressing the limitations of traditional large language models. However, challenges involved in retrieval, including substantial data movement and irregular access patterns, seriously impact the efficiency and deployment of RALMs. The emerging 3D-stacked processing-in-memory (PIM) architecture, characterized by its high memory bandwidth and near-data computing capabilities, presents a promising solution for efficient retrieval. To support large-scale retrieval in RALMs, the PIM architecture should be carefully designed with joint software and hardware optimization.
This paper presents Rimast, a retrieval-in-memory architecture for fast retrieval in RALMs. The objective is to minimize data movement and improve overall performance through hardware-software co-design. At the hardware level, a hierarchical PIM architecture with a retrieval-in-memory dataflow is designed to reduce unnecessary data transfer. At the software level, skew-free data mapping and adaptive offloading strategies are proposed to address the irregular access patterns associated with retrieval in RALMs. We demonstrate the effectiveness of the proposed Rimast using extensive experiments. The experimental results demonstrate that Rimast effectively reduces data movement, achieving average speedups of 273×, 55×, and 2.41× over CPUs, GPUs, and prior art accelerators, respectively.
This paper presents Rimast, a retrieval-in-memory architecture for fast retrieval in RALMs. The objective is to minimize data movement and improve overall performance through hardware-software co-design. At the hardware level, a hierarchical PIM architecture with a retrieval-in-memory dataflow is designed to reduce unnecessary data transfer. At the software level, skew-free data mapping and adaptive offloading strategies are proposed to address the irregular access patterns associated with retrieval in RALMs. We demonstrate the effectiveness of the proposed Rimast using extensive experiments. The experimental results demonstrate that Rimast effectively reduces data movement, achieving average speedups of 273×, 55×, and 2.41× over CPUs, GPUs, and prior art accelerators, respectively.
Event Type
Research Manuscript
TimeMonday, June 232:00pm - 2:15pm PDT
Location3001, Level 3
AI
AI3: AI/ML Architecture Design
Similar Presentations


