Presentation
AdaMAP: Adaptive Hardware Mapping for Model Compression using Low-Rank Decomposition
DescriptionWith the increasing size of Large Language Models (LLMs), low-rank decompositions are being widely used for model compression. Although these methods achieve high compression ratios, they suffer from poor hardware utilization due to the irregular and low-rank nature of matrices, especially on conventional regular AI accelerators. This paper presents AdaMAP, a hybrid algorithm-to-hardware mapping strategy that optimizes low-rank matrix multiplications using input-stationary and output-stationary mappings. To fully leverage low-rank decomposition, we also propose hardware optimizations for efficient data-loading and output-flushing. Applied to a 92.5% compressed BERT model, our approach achieves up to 75× average layer-wise speed-up over the uncompressed model and 31× over the compressed model using weight-stationary mapping. Post-layout simulations on a 65 nm process show 15.9× higher hardware utilization and 77%–96% energy savings compared to weight-stationary mapping.
Event Type
Networking
Work-in-Progress Poster
TimeSunday, June 226:00pm - 7:00pm PDT
LocationLevel 3 Lobby


