Presentation
ScaleX: A Scalable and Flexible Architecture for Efficient GNN Inference
DescriptionGraph Neural Networks (GNNs) are among the most popular machine learning models driven by the need to process relational information embedded in graph data for numerous learning tasks. These tasks span diverse fields, from social science and physical systems to molecular medicine. However, accelerating these tasks is challenging due to significant variations in workload dimensions and sparsity. For example, molecular medicine requires inductive inference on multiple small molecular graphs, while social science involves transductive inference on a single large social network. Prior works typically optimize for one inference type, failing to efficiently support the other due to two fundamental design limitations: (a) limited scalability, resulting in high latency when processing large workloads or energy inefficiency when processing small workloads, and (b) limited flexibility which leads to the high number of off-chip memory accesses, resulting in energy inefficiency.
To address these limitations, we propose ScaleX, a spatially scalable accelerator architecture with flexible processing elements (PEs) that efficiently speeds up the inference of a wide range
of GNN workloads. To support scalability, we introduce a lightweight and dynamic load-balancing technique that uniformly distributes sparse workloads, achieving high speedup as the design scales. To improve flexibility and reduce off-chip memory accesses, each PE is equipped with an elastic on-chip memory allocator, enabling dynamic memory allocation based on workload size. Furthermore, each PE can be configured to optimize the dataflow used, allowing it to adapt to diverse workloads and improve data reuse. Our results show that for GCN, ScaleX is 5.25× more energy efficient and 2.25× faster than prior works. For GIN and GraphSage, ScaleX achieves 2.2× to 432× speedup
over the A100 GPU. Scalability evaluation shows that ScaleX uniformly balances the workloads among PEs as the design scales, achieving superlinear speedup.
To address these limitations, we propose ScaleX, a spatially scalable accelerator architecture with flexible processing elements (PEs) that efficiently speeds up the inference of a wide range
of GNN workloads. To support scalability, we introduce a lightweight and dynamic load-balancing technique that uniformly distributes sparse workloads, achieving high speedup as the design scales. To improve flexibility and reduce off-chip memory accesses, each PE is equipped with an elastic on-chip memory allocator, enabling dynamic memory allocation based on workload size. Furthermore, each PE can be configured to optimize the dataflow used, allowing it to adapt to diverse workloads and improve data reuse. Our results show that for GCN, ScaleX is 5.25× more energy efficient and 2.25× faster than prior works. For GIN and GraphSage, ScaleX achieves 2.2× to 432× speedup
over the A100 GPU. Scalability evaluation shows that ScaleX uniformly balances the workloads among PEs as the design scales, achieving superlinear speedup.
Event Type
Networking
Work-in-Progress Poster
TimeMonday, June 236:00pm - 7:00pm PDT
LocationLevel 2 Lobby