Presentation
Introducing Instruction-Accurate Simulators for Performance Estimation of Autotuning Workloads
DescriptionAccelerating Machine Learning (ML) workloads
requires efficient methods due to their large optimization space.
Autotuning has emerged as an effective approach for system-
atically evaluating variations of implementations. Traditionally,
autotuning requires the workloads to be executed on the target
hardware (HW). We present an interface that allows executing
autotuning workloads on simulators. This approach offers high
scalability when the availability of the target HW is limited, as
many simulations can be run in parallel on any accessible HW.
Additionally, we evaluate the feasibility of using fast
instruction-accurate simulators for autotuning. We train various
predictors to forecast the performance of ML workload imple-
mentations on the target HW based on simulation statistics.
Our results demonstrate that the tuned predictors are highly
effective. The best workload implementation in terms of actual
run time on the target HW is always within the top 3 %
of predictions for the tested x86, ARM, and RISC-V-based
architectures. In the best case, this approach outperforms native
execution on the target HW for embedded architectures when
running as few as three samples on three simulators in parallel.
requires efficient methods due to their large optimization space.
Autotuning has emerged as an effective approach for system-
atically evaluating variations of implementations. Traditionally,
autotuning requires the workloads to be executed on the target
hardware (HW). We present an interface that allows executing
autotuning workloads on simulators. This approach offers high
scalability when the availability of the target HW is limited, as
many simulations can be run in parallel on any accessible HW.
Additionally, we evaluate the feasibility of using fast
instruction-accurate simulators for autotuning. We train various
predictors to forecast the performance of ML workload imple-
mentations on the target HW based on simulation statistics.
Our results demonstrate that the tuned predictors are highly
effective. The best workload implementation in terms of actual
run time on the target HW is always within the top 3 %
of predictions for the tested x86, ARM, and RISC-V-based
architectures. In the best case, this approach outperforms native
execution on the target HW for embedded architectures when
running as few as three samples on three simulators in parallel.
Event Type
Research Manuscript
TimeTuesday, June 2411:45am - 12:00pm PDT
Location3000, Level 3
AI
AI4: AI/ML System and Platform Design