Close

Presentation

ROPE-MLA: Row-Access Optimized Processing Element Machine Learning Accelerator
DescriptionMachine Learning (ML) applications exponentially scale their model parameter sizes and complexity. The inference and training process access the memory in non-predictable orders, making it hard to optimize. This work presents an architecture for ML acceleration with some key features. Firstly, it reduces the unique memory page access by utilizing the data in the open page to compute and store partial results asynchronously. Secondly, it reduces memory latency for different memory organizations by asynchronous fetching and interleaving data from independent banks. This work achieves an area efficiency of 1896.30 GFLOPS/mm2 and a power efficiency of 17.07 GFLOPS/W.