Close

Presentation

PIMPAL: Accelerating LLM Inference on Edge Devices via In-DRAM Arithmetic Lookup
DescriptionDeploying Large Language Models (LLMs) on edge
devices poses significant challenges due to their high compu-
tational and memory demands. In particular, General Matrix-
Vector Multiplication (GEMV), a key operation in LLM infer-
ence, is highly memory-intensive, making it difficult to accelerate
using conventional edge computing systems. While Processing-
in-memory (PIM) architectures have emerged as a promising
solution to this challenge, they often suffer from high area
overhead or restricted computational precision.
This paper proposes PIMPAL (Process-In-Memory architecture
with Parallel Arithmetic Lookup), a cost-effective PIM architecture
leveraging LookUp Table (LUT)-based computation for GEMV
acceleration in sLLMs (small LLMs). By replacing traditional
arithmetic operations with parallel in-DRAM LUT lookups,
PIMPAL significantly reduces area overhead while maintaining
high performance. PIMPAL introduces three key innovations: (1)
it divides DRAM bank subarrays into compute blocks for parallel
LUT processing; (2) it employs Locality-aware Compute Mapping
(LCM) to reduce DRAM row activations by maximizing LUT
access locality; and (3) it enables multi-precision computations
through a LUT Aggregation (LAG) mechanism that combines
results from multiple small LUTs. Experimental results show that
PIMPAL achieves up to 17x higher performance than previous
LUT-based PIM designs and reduces area overhead by 40%
compared to conventional processing unit-based PIM designs.