BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Los_Angeles
X-LIC-LOCATION:America/Los_Angeles
BEGIN:DAYLIGHT
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
TZNAME:PDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
TZNAME:PST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20260402T024532Z
LOCATION:3001\, Level 3
DTSTART;TZID=America/Los_Angeles:20250623T104500
DTEND;TZID=America/Los_Angeles:20250623T110000
UID:dac_DAC 2025_sess113_RESEARCH618@linklings.com
SUMMARY:A Cross-model Fusion-aware Framework for Optimizing (gather-matmul
 -scatter)s Workload
DESCRIPTION:Yaoxiu Lian, zhihong gou, and Yibo Han (Shanghai Jiao Tong Uni
 versity); Zhongming Yu (University of California, San Diego); Jiaming Xu (
 Shanghai Jiao Tong University); Sheng Yuan, Zhilin Pei, and Xingcheng Zhan
 g (Shanghai Artificial Intelligence Laboratory); and Ningyi Xu and Guohao 
 Dai (Shanghai Jiao Tong University)\n\nModern deep learning models, such a
 s Relation Graph Convolutional Networks (RGCNs), Sparse  Convolutional Net
 works (SpConv), and Mixture of Experts Networks (MoE), are significantly d
 ependent on  the (gather − matrix multiplication − scatter)s (abbreviated 
 as  (g −mm−s)s) paradigm as their fundamental computational  pattern. Whil
 e existing works have made optimization attempts,  several critical challe
 nges remain unsolved: (1) Suboptimal  operator fusion due to incomplete da
 taflow analysis. Current  approaches lack systematic analysis of fusion st
 rategies within the  (g −mm−s)s, resulting in up to 35% performance degrad
 ation  due to suboptimal operator fusion and inefficient computation  patt
 erns. (2) Time-consuming exploration within large configuration space. Fin
 ding optimal configurations in the complete  solution space (often exceedi
 ng 10,000 configurations) can take up  to 2000 seconds, while experience-b
 ased configurations often lead  to suboptimal performance. (3) Inefficient
  static dataflow with  dynamic inputs. Dataflow performance is significant
 ly affected  by input dynamism, fixed dataflow patterns can cause up to  1
 .7× performance degradation when input characteristics vary  significantly
 .  To address this challenge, we introduce Efficient-GMS, a com  prehensiv
 e framework that enhances (g − mm−s)s paradigm  across diverse input scena
 rios. Our framework introduces: (1)  Complete dataflow analysis enabling e
 fficient operator fusion  strategies. We systematically analyze the effici
 ency of operator  fusion and propose four optimal dataflow patterns with s
 egment-gemm optimization, specifically designed for unbalanced inputs.  (2
 ) Performance model-guided configuration space reduction. We  develop a pe
 rformance model to predict the relative execution  efficiency across confi
 gurations, thereby reducing the search  space and minimizing search time w
 hile ensuring optimal configuration selection. (3) Adaptive dataflow selec
 tion mechanism. We  implement a lightweight heuristic model that dynamical
 ly selects optimal dataflow pattern based on characteristics of the input 
 and  the hardware. Experimental results demonstrate that Efficient GMS ach
 ieves significant performance improvements across var  ious applications: 
 up to 3× speedup in RGCN computations, 1.23 1.59× acceleration in sparse c
 onvolution operations, and 1.17x improvement in MoE computations compared 
 to state-of-the-art  implementations.\n\nTopics: AI\n\nTracks: AI4: AI/ML 
 System and Platform Design\n\nSession Chairs: Chaojian Li (Georgia Institu
 te of Technology) and Zhongzhi Yu (Nvidia)\n\n
END:VEVENT
END:VCALENDAR
