Presentation
EdgeMM: Multi-Core CPU with Heterogeneous AI-Extension and Activation-aware Weight Pruning for Multimodal LLMs at Edge
DescriptionEmerging multimodal LLMs (MLLMs) exhibit strong cross-modal perception and reasoning capabilities, and holding great potential for various applications at the edge. However, MLLMs are normally consist of compute-intensive modality encoder and memory-bound LLMs, leading to distinct performance bottlenecks for hardware designs. In this work, we present an multi-core CPU solution with heterogeneous AI extensions, which based on either compute-centric systolic array or memory-centric digital compute-in-memory (CIM) coprocessors. Furthermore, dynamic activation-aware weight pruning and bandwidth management are developed to enhance bandwidth efficiency and cores utilization, improving system overall performance. We implemented our solution using commercial 22nm technology. For a representative MLLMs, our solution achieves 2.84x performance speedup compared to laptop 3060 GPU, reaching a 138 tokens/s throughput and a 0.28 token/J efficiency.
Event Type
Research Manuscript
TimeWednesday, June 254:00pm - 4:15pm PDT
Location3002, Level 3


