BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Los_Angeles
X-LIC-LOCATION:America/Los_Angeles
BEGIN:DAYLIGHT
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
TZNAME:PDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
TZNAME:PST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20260402T024533Z
LOCATION:3002\, Level 3
DTSTART;TZID=America/Los_Angeles:20250623T103000
DTEND;TZID=America/Los_Angeles:20250623T104500
UID:dac_DAC 2025_sess120_RESEARCH768@linklings.com
SUMMARY:UniCAIM: A Unified CAM/CIM Architecture with Static-Dynamic KV Cac
 he Pruning for Efficient Long-Context LLM Inference
DESCRIPTION:Weikai Xu, Wenxuan Zeng, Qianqian Huang, Meng Li, and Ru Huang
  (Peking University)\n\nTransformer-based large language models (LLMs) hav
 e achieved impressive performance in various natural language processing (
 NLP) applications. However, the high memory and computation cost induced b
 y the KV cache limits the inference efficiency, especially for long input 
 sequences. Compute-in-memory (CIM)-based accelerators have been proposed f
 or LLM acceleration with KV cache pruning. However, as existing accelerato
 rs only support static pruning with a fixed pattern or dynamic pruning wit
 h primitive implementations, they suffer from either high accuracy degrada
 tion or low efficiency. In this paper, we propose a ferroelectric FET (FeF
 ET)-based unified content addressable memory (CAM) and CIM architecture, d
 ubbed as UniCAIM. UniCAIM features simultaneous support for static and dyn
 amic pruning with 3 computation modes: 1) in the CAM mode, UniCAIM enables
  approximate similarity measurement in O(1) time for dynamic KV cache prun
 ing with high energy efficiency; 2) in the charge-domain CIM mode, static 
 pruning can be supported based on accumulative similarity score, which is 
 much more flexible compared to fixed patterns; 3) in the current-domain mo
 de, exact attention computation can be conducted with a subset of selected
  KV cache. We further propose a novel CAM/CIM cell design that leverages t
 he multi-level characteristics of FeFETs for signed multi-bit storage of t
 he KV cache and in-place attention computation. With extensive experimenta
 l results, we demonstrate UniCAIM can reduce the area-energy-delay product
  (AEDP) by 8.2×~831× over the state-of-the-art CIM-based LLM accelerators 
 at the circuit level, along with high accuracy comparable with dense atten
 tion at the application level, showing its great potential for efficient l
 ong-context LLM inference.\n\nTopics: Design\n\nTracks: DES2A: In-memory a
 nd Near-memory Computing Circuits\n\nSession Chairs: Yu Cao (University of
  Minnesota) and Sumitha George (North Dakota State University)\n\n
END:VEVENT
END:VCALENDAR
