Close

Presentation

SSFT: Algorithm and Hardware Co-design for Structured Sparse Fine-Tuning of Large Language Models
DescriptionA significant number of users depend on Large Language Models (LLMs) for downstream tasks, but training LLMs from scratch remains prohibitively expensive. Sparse fine-tuning (SFT) has emerged as an effective strategy to reduce both the time and memory requirements of fine-tuning LLMs, achieving accuracy on par with fully fine-tuned models. Although SFT has the potential to achieve superior performance by minimizing computational requirements, SFT on GPUs often underperforms compared to dense algorithms like LoRA due to sparse data accesses that modern GPUs cannot efficiently handle. To address these issues, we propose Structured Sparse Fine- Tuning (SSFT). It comprises a novel algorithm, SSFT-Alg, which introduces predictable sparsity patterns to reduce memory access overhead and enhance regularity in the SFT process. To support SSFT-Alg, an accelerator, SSFT-Hw, is proposed to optimize SSFT-Alg through an innovative sparsity-aware design, avoiding the overhead of sparsity operations on GPUs and optimizing latency and energy efficiency. Experiments with relevant models and benchmarks demonstrate that SSFT achieves comparable accuracy to state-of-the-art models on BERT, LLaMA 2 7B, and LLaMA 2 13B. Moreover, SSFT-Hw outperforms both GPUs and the state-of-the-art sparsity-aware transformer accelerators in throughput by 51.0× and 1.32×, respectively, while reducing energy efficiency by 19.0× and 1.48×.
Event Type
Research Manuscript
TimeMonday, June 231:30pm - 1:45pm PDT
Location3001, Level 3
Topics
AI
Tracks
AI3: AI/ML Architecture Design