Presentation
FLAG: An FPGA-Based System for Low-Latency GNN Inference Service Using Vector Quantization
DescriptionEnabling real-time GNN inference services requires low end-to-end latency to meet service level agreements. However, intensive preparation steps and the neighborhood explosion problem pose significant challenges to efficient GNN inference serving. In this paper, we propose FLAG, an FPGA-based GNN inference serving system using vector quantization. To reduce preparation overhead, we introduce offline preprocessing to precompute and compress hidden embeddings for serving. A dedicated FPGA accelerator leverages the precomputed data to enable lightweight aggregation. As a result, FLAG achieves average speedups of 154×, 176×, and 333× on three GNN models compared to the baseline system.
Event Type
Research Manuscript
TimeWednesday, June 2511:45am - 12:00pm PDT
Location3001, Level 3


