BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Los_Angeles
X-LIC-LOCATION:America/Los_Angeles
BEGIN:DAYLIGHT
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
TZNAME:PDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
TZNAME:PST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20260402T024533Z
LOCATION:3001\, Level 3
DTSTART;TZID=America/Los_Angeles:20250623T111500
DTEND;TZID=America/Los_Angeles:20250623T113000
UID:dac_DAC 2025_sess113_RESEARCH556@linklings.com
SUMMARY:ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recalla
 ble Compression
DESCRIPTION:Guangda Liu, Chengwei Li, Jieru Zhao, Chenqi Zhang, and Minyi 
 Guo (Shanghai Jiao Tong University)\n\nLarge Language Models (LLMs) have b
 een widely deployed in a variety of applications, and the context length i
 s rapidly increasing to handle tasks such as long-document QA and complex 
 logical reasoning. However, long context poses significant challenges for 
 inference efficiency, including high memory costs of key-value (KV) cache 
 and increased latency due to extensive memory accesses. Recent works have 
 proposed compressing KV cache to approximate computation, but these method
 s either evict tokens permanently, never recalling them for later inferenc
 e, or recall previous tokens at the granularity of pages divided by textua
 l positions. Both approaches degrade the model accuracy and output quality
 . To achieve efficient and accurate recallable KV cache compression, we in
 troduce ClusterKV, which recalls tokens at the granularity of semantic clu
 sters. We design and implement efficient algorithms and systems for cluste
 ring, selection, indexing and caching. Experiment results show that Cluste
 rKV attains negligible accuracy loss across various tasks with 32k context
  lengths, using only a 1k to 2k KV cache budget, and achieves up to a 2× s
 peedup in latency and a 2.5× improvement in decoding throughput. Compared 
 to SoTA recallable KV compression methods, ClusterKV demonstrates higher m
 odel accuracy and output quality, while maintaining or exceeding inference
  efficiency.\n\nTopics: AI\n\nTracks: AI4: AI/ML System and Platform Desig
 n\n\nSession Chairs: Chaojian Li (Georgia Institute of Technology) and Zho
 ngzhi Yu (Nvidia)\n\n
END:VEVENT
END:VCALENDAR
