Benanza: Automatic μBenchmark Generation to Compute "Lower-bound" Latency and Inform Optimizations of Deep Learning Models on GPUs

Type: Preprint

Publication Date: 2020-05-01

Citations: 7

DOI: https://doi.org/10.1109/ipdps47924.2020.00053

Abstract

As Deep Learning (DL) models have been increasingly used in latency-sensitive applications, there has been a growing interest in improving their response time. An important venue for such improvement is to profile the execution of these models and characterize their performance to identify possible optimization opportunities. However, the current profiling tools lack the highly desired abilities to characterize ideal performance, identify sources of inefficiency, and quantify the benefits of potential optimizations. Such deficiencies have led to slow characterization/optimization cycles that cannot keep up with the fast pace at which new DL models are introduced. We propose Benanza, a sustainable and extensible benchmarking and analysis design that speeds up the characterization/optimization cycle of DL models on GPUs. Benanza consists of four major components: a model processor that parses models into an internal representation, a configurable benchmark generator that automatically generates micro-benchmarks given a set of models, a database of benchmark results, and an analyzer that computes the "lower-bound" latency of DL models using the benchmark data and informs optimizations of model execution. The "lower-bound" latency metric estimates the ideal model execution on a GPU system and serves as the basis for identifying optimization opportunities in frameworks or system libraries. We used Benanza to evaluate 30 ONNX models in MXNet, ONNX Runtime, and PyTorch on 7 GPUs ranging from Kepler to the latest Turing, and identified optimizations in parallel layer execution, cuDNN convolution algorithm selection, framework inefficiency, layer fusion, and using Tensor Cores.

Locations

  • arXiv (Cornell University) - View - PDF
  • 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) - View
  • DataCite API - View

Similar Works

Action Title Year Authors
+ Demystifying the MLPerf Benchmark Suite 2019 Snehil Verma
Qinzhe Wu
Bagus Hanindhito
Gunjan Jha
Eugene John
Ramesh Radhakrishnan
Lizy K. John
+ Across-Stack Profiling and Characterization of Machine Learning Models on GPUs. 2019 Cheng Li
Abdul Dakkak
Jinjun Xiong
Wei Wei
Lingjie Xu
Wen‐mei Hwu
+ PDF Chat XSP: Across-Stack Profiling and Analysis of Machine Learning Models on GPUs 2020 Cheng Li
Abdul Dakkak
Jinjun Xiong
Wei Wei
Lingjie Xu
Wen‐mei Hwu
+ PDF Chat Forecasting GPU Performance for Deep Learning Training and Inference 2024 Seonho Lee
Amar Phanishayee
Divya Mahajan
+ PDF Chat The Design and Implementation of a Scalable Deep Learning Benchmarking Platform 2020 Cheng Li
Abdul Dakkak
Jinjun Xiong
Wen‐mei Hwu
+ TVM: End-to-End Optimization Stack for Deep Learning 2018 Tianqi Chen
Thierry Moreau
Ziheng Jiang
Haichen Shen
Eddie Yan
Leyuan Wang
Yuwei Hu
Luís Ceze
Carlos Guestrin
Arvind Krishnamurthy
+ PDF Chat Deep Learning Models on CPUs: A Methodology for Efficient Training 2022 Quchen Fu
Ramesh Chukka
Keith Achorn
Thomas Atta-fosu
Deepak R. Canchi
Zhongwei Teng
Jules White
Douglas C. Schmidt
+ Deep Learning Models on CPUs: A Methodology for Efficient Training 2022 Quchen Fu
Ramesh Chukka
Keith Achorn
Thomas Atta-fosu
Deepak R. Canchi
Zhongwei Teng
Jules White
Douglas C. Schmidt
+ Time-Based Roofline for Deep Learning Performance Analysis 2020 Yunsong Wang
Charlene Yang
Steven Farrell
Yan Zhang
Thorsten Kurth
Samuel Williams
+ Third ArchEdge Workshop: Exploring the Design Space of Efficient Deep Neural Networks 2020 Fuxun Yu
Dimitrios Stamoulis
Di Wang
Dimitrios Lymberopoulos
Xiang Chen
+ PDF Chat DeepContext: A Context-aware, Cross-platform, and Cross-framework Tool for Performance Profiling and Analysis of Deep Learning Workloads 2024 Qidong Zhao
Hao Wu
Yue Hao
Z. Ye
Jiajia Li
Xu Liu
Keren Zhou
+ PDF Chat Toward accurate platform-aware performance modeling for deep neural networks 2021 Chuan-Chi Wang
Ying-Chiao Liao
Ming-Chang Kao
Wen-Yew Liang
Shih‐Hao Hung
+ Toward Accurate Platform-Aware Performance Modeling for Deep Neural Networks 2020 Chuan-Chi Wang
Ying-Chiao Liao
Ming-Chang Kao
Wen-Yew Liang
Shih‐Hao Hung
+ PDF Chat DLBricks: Composable Benchmark Generation to Reduce Deep Learning Benchmarking Effort on CPUs 2020 Cheng Li
Abdul Dakkak
Jinjun Xiong
Wen‐mei Hwu
+ DUET: A Compiler-Runtime Subgraph Scheduling Approach for Tensor Programs on a Coupled CPU-GPU Architecture 2021 Minjia Zhang
Zehua Hu
Mingqin Li
+ GPUNet: Searching the Deployable Convolution Neural Networks for GPUs 2022 Linnan Wang
Chenhan D. Yu
Satish Salian
Slawomir Kierat
Szymon Migacz
Alex Fit Florea
+ PDF Chat Fathom: reference workloads for modern deep learning methods 2016 Robert Adolf
Saketh Rama
Brandon Reagen
Gu-Yeon Wei
David Brooks
+ Violet: Architecturally Exposed Orchestration, Movement, and Placement for Generalized Deep Learning 2021 Michael C. R. Davies
Adam Labiosa
Karthikeyan Sankaralingam
+ TSPLIT: Fine-grained GPU Memory Management for Efficient DNN Training via Tensor Splitting 2022 Xiaonan Nie
Xupeng Miao
Zhi Yang
Bin Cui
+ Benchmarking TPU, GPU, and CPU Platforms for Deep Learning 2019 Yu Emma Wang
Gu-Yeon Wei
David Brooks