Presentation
A Cross-model Fusion-aware Framework for Optimizing (gather-matmul-scatter)s Workload
DescriptionModern deep learning models, such as Relation Graph Convolutional Networks (RGCNs), Sparse Convolutional Networks (SpConv), and Mixture of Experts Networks (MoE), are significantly dependent on the (gather − matrix multiplication − scatter)s (abbreviated as (g −mm−s)s) paradigm as their fundamental computational pattern. While existing works have made optimization attempts, several critical challenges remain unsolved: (1) Suboptimal operator fusion due to incomplete dataflow analysis. Current approaches lack systematic analysis of fusion strategies within the (g −mm−s)s, resulting in up to 35% performance degradation due to suboptimal operator fusion and inefficient computation patterns. (2) Time-consuming exploration within large configuration space. Finding optimal configurations in the complete solution space (often exceeding 10,000 configurations) can take up to 2000 seconds, while experience-based configurations often lead to suboptimal performance. (3) Inefficient static dataflow with dynamic inputs. Dataflow performance is significantly affected by input dynamism, fixed dataflow patterns can cause up to 1.7× performance degradation when input characteristics vary significantly. To address this challenge, we introduce Efficient-GMS, a com prehensive framework that enhances (g − mm−s)s paradigm across diverse input scenarios. Our framework introduces: (1) Complete dataflow analysis enabling efficient operator fusion strategies. We systematically analyze the efficiency of operator fusion and propose four optimal dataflow patterns with segment-gemm optimization, specifically designed for unbalanced inputs. (2) Performance model-guided configuration space reduction. We develop a performance model to predict the relative execution efficiency across configurations, thereby reducing the search space and minimizing search time while ensuring optimal configuration selection. (3) Adaptive dataflow selection mechanism. We implement a lightweight heuristic model that dynamically selects optimal dataflow pattern based on characteristics of the input and the hardware. Experimental results demonstrate that Efficient GMS achieves significant performance improvements across var ious applications: up to 3× speedup in RGCN computations, 1.23 1.59× acceleration in sparse convolution operations, and 1.17x improvement in MoE computations compared to state-of-the-art implementations.
Event Type
Research Manuscript
TimeMonday, June 2310:45am - 11:00am PDT
Location3001, Level 3
Similar Presentations


