Presentation
NDFT: Accelerating Density Functional Theory Calculations via Hardware/Software Co-Design on Near-Data Computing System
DescriptionLinear-response time-dependent Density Functional Theory (LR-TDDFT) is a widely used method for accurately predicting the excited-state properties of physical systems.
Previous works have attempted to accelerate LR-TDDFT using heterogeneous systems such as GPUs, FPGAs, and the Sunway architecture.
However, a major drawback of these approaches is the constant data movement between host memory and the memory of the heterogeneous systems, which results in substantial \textit{data movement overhead}.
Moreover, these works focus primarily on optimizing the compute-intensive portions of LR-TDDFT, even though the calculation steps are fundamentally \textit{memory-bound}.
To address these challenges, we propose NDFT, a \underline{N}ear-\underline{D}ata Density \underline{F}unctional \underline{T}heory framework.
Specifically, we design a novel task partitioning and scheduling mechanism to offload each part of LR-TDDFT to the most suitable computing units within a CPU-NDP system.
Additionally, we implement a hardware/software co-optimization of a critical kernel in LR-TDDFT to further enhance performance on the CPU-NDP system.
Our results show that NDFT achieves performance improvements of 5.2x and 2.5x over CPU and GPU baselines, respectively, on a large physical system.
Previous works have attempted to accelerate LR-TDDFT using heterogeneous systems such as GPUs, FPGAs, and the Sunway architecture.
However, a major drawback of these approaches is the constant data movement between host memory and the memory of the heterogeneous systems, which results in substantial \textit{data movement overhead}.
Moreover, these works focus primarily on optimizing the compute-intensive portions of LR-TDDFT, even though the calculation steps are fundamentally \textit{memory-bound}.
To address these challenges, we propose NDFT, a \underline{N}ear-\underline{D}ata Density \underline{F}unctional \underline{T}heory framework.
Specifically, we design a novel task partitioning and scheduling mechanism to offload each part of LR-TDDFT to the most suitable computing units within a CPU-NDP system.
Additionally, we implement a hardware/software co-optimization of a critical kernel in LR-TDDFT to further enhance performance on the CPU-NDP system.
Our results show that NDFT achieves performance improvements of 5.2x and 2.5x over CPU and GPU baselines, respectively, on a large physical system.
Event Type
Research Manuscript
TimeTuesday, June 245:00pm - 5:15pm PDT
Location3001, Level 3
Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems