Benanza: Automatic μBenchmark Generation to Compute "Lower-bound" Latency and Inform Optimizations of Deep Learning Models on GPUs

Cheng Li, Abdul Dakkak, Jinjun Xiong, Wen‐mei Hwu

Type: Preprint

Publication Date: 2020-05-01

Citations: 7

DOI: https://doi.org/10.1109/ipdps47924.2020.00053

Abstract

As Deep Learning (DL) models have been increasingly used in latency-sensitive applications, there has been a growing interest in improving their response time. An important venue for such improvement is to profile the execution of these models and characterize their performance to identify possible optimization opportunities. However, the current profiling tools lack the highly desired abilities to characterize ideal performance, identify sources of inefficiency, and quantify the benefits of potential optimizations. Such deficiencies have led to slow characterization/optimization cycles that cannot keep up with the fast pace at which new DL models are introduced. We propose Benanza, a sustainable and extensible benchmarking and analysis design that speeds up the characterization/optimization cycle of DL models on GPUs. Benanza consists of four major components: a model processor that parses models into an internal representation, a configurable benchmark generator that automatically generates micro-benchmarks given a set of models, a database of benchmark results, and an analyzer that computes the "lower-bound" latency of DL models using the benchmark data and informs optimizations of model execution. The "lower-bound" latency metric estimates the ideal model execution on a GPU system and serves as the basis for identifying optimization opportunities in frameworks or system libraries. We used Benanza to evaluate 30 ONNX models in MXNet, ONNX Runtime, and PyTorch on 7 GPUs ranging from Kepler to the latest Turing, and identified optimizations in parallel layer execution, cuDNN convolution algorithm selection, framework inefficiency, layer fusion, and using Tensor Cores.

Locations

arXiv (Cornell University) - View - PDF
2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) - View
DataCite API - View

Similar Works

Action	Title	Year	Authors
+	Demystifying the MLPerf Benchmark Suite	2019	Snehil Verma Qinzhe Wu Bagus Hanindhito Gunjan Jha Eugene John Ramesh Radhakrishnan Lizy K. John
+	Across-Stack Profiling and Characterization of Machine Learning Models on GPUs.	2019	Cheng Li Abdul Dakkak Jinjun Xiong Wei Wei Lingjie Xu Wen‐mei Hwu
+ PDF Chat	XSP: Across-Stack Profiling and Analysis of Machine Learning Models on GPUs	2020	Cheng Li Abdul Dakkak Jinjun Xiong Wei Wei Lingjie Xu Wen‐mei Hwu
+ PDF Chat	Forecasting GPU Performance for Deep Learning Training and Inference	2024	Seonho Lee Amar Phanishayee Divya Mahajan
+ PDF Chat	The Design and Implementation of a Scalable Deep Learning Benchmarking Platform	2020	Cheng Li Abdul Dakkak Jinjun Xiong Wen‐mei Hwu
+	TVM: End-to-End Optimization Stack for Deep Learning	2018	Tianqi Chen Thierry Moreau Ziheng Jiang Haichen Shen Eddie Yan Leyuan Wang Yuwei Hu Luís Ceze Carlos Guestrin Arvind Krishnamurthy
+ PDF Chat	Deep Learning Models on CPUs: A Methodology for Efficient Training	2022	Quchen Fu Ramesh Chukka Keith Achorn Thomas Atta-fosu Deepak R. Canchi Zhongwei Teng Jules White Douglas C. Schmidt
+	Deep Learning Models on CPUs: A Methodology for Efficient Training	2022	Quchen Fu Ramesh Chukka Keith Achorn Thomas Atta-fosu Deepak R. Canchi Zhongwei Teng Jules White Douglas C. Schmidt
+	Time-Based Roofline for Deep Learning Performance Analysis	2020	Yunsong Wang Charlene Yang Steven Farrell Yan Zhang Thorsten Kurth Samuel Williams
+	Third ArchEdge Workshop: Exploring the Design Space of Efficient Deep Neural Networks	2020	Fuxun Yu Dimitrios Stamoulis Di Wang Dimitrios Lymberopoulos Xiang Chen
+ PDF Chat	DeepContext: A Context-aware, Cross-platform, and Cross-framework Tool for Performance Profiling and Analysis of Deep Learning Workloads	2024	Qidong Zhao Hao Wu Yue Hao Z. Ye Jiajia Li Xu Liu Keren Zhou
+ PDF Chat	Toward accurate platform-aware performance modeling for deep neural networks	2021	Chuan-Chi Wang Ying-Chiao Liao Ming-Chang Kao Wen-Yew Liang Shih‐Hao Hung
+	Toward Accurate Platform-Aware Performance Modeling for Deep Neural Networks	2020	Chuan-Chi Wang Ying-Chiao Liao Ming-Chang Kao Wen-Yew Liang Shih‐Hao Hung
+ PDF Chat	DLBricks: Composable Benchmark Generation to Reduce Deep Learning Benchmarking Effort on CPUs	2020	Cheng Li Abdul Dakkak Jinjun Xiong Wen‐mei Hwu
+	DUET: A Compiler-Runtime Subgraph Scheduling Approach for Tensor Programs on a Coupled CPU-GPU Architecture	2021	Minjia Zhang Zehua Hu Mingqin Li
+	GPUNet: Searching the Deployable Convolution Neural Networks for GPUs	2022	Linnan Wang Chenhan D. Yu Satish Salian Slawomir Kierat Szymon Migacz Alex Fit Florea
+ PDF Chat	Fathom: reference workloads for modern deep learning methods	2016	Robert Adolf Saketh Rama Brandon Reagen Gu-Yeon Wei David Brooks
+	Violet: Architecturally Exposed Orchestration, Movement, and Placement for Generalized Deep Learning	2021	Michael C. R. Davies Adam Labiosa Karthikeyan Sankaralingam
+	TSPLIT: Fine-grained GPU Memory Management for Efficient DNN Training via Tensor Splitting	2022	Xiaonan Nie Xupeng Miao Zhi Yang Bin Cui
+	Benchmarking TPU, GPU, and CPU Platforms for Deep Learning	2019	Yu Emma Wang Gu-Yeon Wei David Brooks

Works That Cite This (5)

Action	Title	Year	Authors
+	Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA	2020	Mohamed Wahib Haoyu Zhang Truong Thao Nguyen Aleksandr Drozd Jens Domke Lingqi Zhang Ryousei Takano Satoshi Matsuoka
+	Accelerating ML Workloads using GPU Tensor Cores: The Good, the Bad, and the Ugly	2024	Bagus Hanindhito Lizy K. John
+	DUET: A Compiler-Runtime Subgraph Scheduling Approach for Tensor Programs on a Coupled CPU-GPU Architecture	2021	Minjia Zhang Zehua Hu Mingqin Li
+	Group Fisher Pruning for Practical Network Compression	2021	Liyang Liu Shilong Zhang Zhanghui Kuang Aojun Zhou Jing‐Hao Xue Xinjiang Wang Yimin Chen Wenming Yang Qingmin Liao Wei Zhang
+ PDF Chat	Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA	2020	Mohamed Wahib Haoyu Zhang Truong Thao Nguyen Aleksandr Drozd Jens Domke Lingqi Zhang Ryousei Takano Satoshi Matsuoka

Works Cited by This (33)

Action	Title	Year	Authors
+	Very Deep Convolutional Networks for Large-Scale Image Recognition	2014	Karen Simonyan Andrew Zisserman
+	Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift	2015	Sergey Ioffe Christian Szegedy
+ PDF Chat	Going deeper with convolutions	2015	Christian Szegedy Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke Andrew Rabinovich
+ PDF Chat	Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation	2014	Ross Girshick Jeff Donahue Trevor Darrell Jitendra Malik
+ PDF Chat	Rethinking the Inception Architecture for Computer Vision	2016	Christian Szegedy Vincent Vanhoucke Sergey Ioffe Jon Shlens Zbigniew Wojna
+ PDF Chat	Deep Residual Learning for Image Recognition	2016	Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun
+	SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size	2016	Forrest Iandola Song Han Matthew W. Moskewicz Khalid Ashraf William J. Dally Kurt Keutzer
+ PDF Chat	Identity Mappings in Deep Residual Networks	2016	Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun
+ PDF Chat	Training deep networks for facial expression recognition with crowd-sourced label distribution	2016	Emad Barsoum Cha Zhang Cristian Canton Ferrer Zhengyou Zhang
+	Densely Connected Convolutional Networks	2016	Gao Huang Zhuang Liu Laurens van der Maaten Kilian Q. Weinberger