Author Description

Login to generate an author description

Ask a Question About This Mathematician

All published works (16)

Directed acyclic graphs (DAGs) serve as crucial data representations in domains such as hardware synthesis and compiler/program optimization for computing systems. DAG generative models facilitate the creation of synthetic DAGs, … Directed acyclic graphs (DAGs) serve as crucial data representations in domains such as hardware synthesis and compiler/program optimization for computing systems. DAG generative models facilitate the creation of synthetic DAGs, which can be used for benchmarking computing systems while preserving intellectual property. However, generating realistic DAGs is challenging due to their inherent directional and logical dependencies. This paper introduces LayerDAG, an autoregressive diffusion model, to address these challenges. LayerDAG decouples the strong node dependencies into manageable units that can be processed sequentially. By interpreting the partial order of nodes as a sequence of bipartite graphs, LayerDAG leverages autoregressive generation to model directional dependencies and employs diffusion models to capture logical dependencies within each bipartite graph. Comparative analyses demonstrate that LayerDAG outperforms existing DAG generative models in both expressiveness and generalization, particularly for generating large-scale DAGs with up to 400 nodes-a critical scenario for system benchmarking. Extensive experiments on both synthetic and real-world flow graphs from various computing platforms show that LayerDAG generates valid DAGs with superior statistical properties and benchmarking performance. The synthetic DAGs generated by LayerDAG enhance the training of ML-based surrogate models, resulting in improved accuracy in predicting performance metrics of real-world DAGs across diverse computing platforms.
Distributed training is a solution to reduce DNN training time by splitting the task across multiple NPUs (e.g., GPU/TPU). However, distributed training adds communication overhead between the NPUs in order … Distributed training is a solution to reduce DNN training time by splitting the task across multiple NPUs (e.g., GPU/TPU). However, distributed training adds communication overhead between the NPUs in order to synchronize the gradients and/or activation, depending on the parallelization strategy. In next-generation platforms for training at scale, NPUs will be connected through multidimensional networks with diverse, heterogeneous bandwidths. This work identifies a looming challenge of keeping all network dimensions busy and maximizing the network BW within the hybrid environment if we leverage scheduling techniques for collective communication on systems today. We propose Themis, a novel collective scheduling scheme that dynamically schedules collectives (divided into chunks) to balance the communication loads across all dimensions, further improving the network BW utilization. Our results show that on average, Themis can improve the network BW utilization of the single All-Reduce by 1.72× (2.70× max), and improve the end-to-end training iteration performance of real workloads such as ResNet-152, GNMT, DLRM, and Transformer-1T by 1.49× (2.25× max), 1.30× (1.78× max), 1.30× (1.77× max), and 1.25× (1.53× max), respectively.
The state-of-the-art (SOTA) for mixed precision training is dominated by variants of low precision floating point operations, and in particular, FP16 accumulating into FP32 Micikevicius et al. (2017). On the … The state-of-the-art (SOTA) for mixed precision training is dominated by variants of low precision floating point operations, and in particular, FP16 accumulating into FP32 Micikevicius et al. (2017). On the other hand, while a lot of research has also happened in the domain of low and mixed-precision Integer training, these works either present results for non-SOTA networks (for instance only AlexNet for ImageNet-1K), or relatively small datasets (like CIFAR-10). In this work, we train state-of-the-art visual understanding neural networks on the ImageNet-1K dataset, with Integer operations on General Purpose (GP) hardware. In particular, we focus on Integer Fused-Multiply-and-Accumulate (FMA) operations which take two pairs of INT16 operands and accumulate results into an INT32 output.We propose a shared exponent representation of tensors and develop a Dynamic Fixed Point (DFP) scheme suitable for common neural network operations. The nuances of developing an efficient integer convolution kernel is examined, including methods to handle overflow of the INT32 accumulator. We implement CNN training for ResNet-50, GoogLeNet-v1, VGG-16 and AlexNet; and these networks achieve or exceed SOTA accuracy within the same number of iterations as their FP32 counterparts without any change in hyper-parameters and with a 1.8X improvement in end-to-end training throughput. To the best of our knowledge these results represent the first INT16 training results on GP hardware for ImageNet-1K dataset using SOTA CNNs and achieve highest reported accuracy using half-precision
This paper presents the first, 15-PetaFLOP Deep Learning system for solving scientific pattern classification problems on contemporary HPC architectures. We develop supervised convolutional architectures for discriminating signals in high-energy physics … This paper presents the first, 15-PetaFLOP Deep Learning system for solving scientific pattern classification problems on contemporary HPC architectures. We develop supervised convolutional architectures for discriminating signals in high-energy physics data as well as semi-supervised architectures for localizing and classifying extreme weather in climate data. Our Intelcaffe-based implementation obtains $\sim$2TFLOP/s on a single Cori Phase-II Xeon-Phi node. We use a hybrid strategy employing synchronous node-groups, while using asynchronous communication across groups. We use this strategy to scale training of a single model to $\sim$9600 Xeon-Phi nodes; obtaining peak performance of 11.73-15.07 PFLOP/s and sustained performance of 11.41-13.27 PFLOP/s. At scale, our HEP architecture produces state-of-the-art classification accuracy on a dataset with 10M images, exceeding that achieved by selections on high-level physics-motivated features. Our semi-supervised architecture successfully extracts weather patterns in a 15TB climate dataset. Our results demonstrate that Deep Learning can be optimized and scaled effectively on many-core, HPC systems.
We design and implement a distributed multinode synchronous SGD algorithm, without altering hyper parameters, or compressing data, or altering algorithmic behavior. We perform a detailed analysis of scaling, and identify … We design and implement a distributed multinode synchronous SGD algorithm, without altering hyper parameters, or compressing data, or altering algorithmic behavior. We perform a detailed analysis of scaling, and identify optimal design points for different networks. We demonstrate scaling of CNNs on 100s of nodes, and present what we believe to be record training throughputs. A 512 minibatch VGG-A CNN training run is scaled 90X on 128 nodes. Also 256 minibatch VGG-A and OverFeat-FAST networks are scaled 53X and 42X respectively on a 64 node cluster. We also demonstrate the generality of our approach via best-in-class 6.5X scaling for a 7-layer DNN on 16 nodes. Thereafter we attempt to democratize deep-learning by training on an Ethernet based AWS cluster and show ~14X scaling on 16 nodes.
Recently a new class of techniques termed the max-plus curse of dimensionality-free methods have been developed to solve nonlinear optimal control problems. In these methods the discretization in state space … Recently a new class of techniques termed the max-plus curse of dimensionality-free methods have been developed to solve nonlinear optimal control problems. In these methods the discretization in state space is avoided by using a max-plus basis expansion of the value function. This requires storing only the coefficients of the basis functions used for representation. However, the number of basis functions grows exponentially with respect to the number of time steps of propagation to the time horizon of the control problem. This so called "curse of complexity" can be managed by applying a pruning procedure which selects the subset of basis functions that contribute most to the approximation of the value function. The pruning procedures described thus far in the literature rely on the solution of a sequence of high dimensional optimization problems which can become computationally expensive. In this paper we show that if the max-plus basis functions are linear and the region of interest in state space is convex, the pruning problem can be efficiently solved by the bundle method. This approach combining the bundle method and semidefinite formulations is applied to the quantum gate synthesis problem, in which the state space is the special unitary group (which is non-convex). This is based on the observation that the convexification of the unitary group leads to an exact relaxation. The results are studied and validated via examples.
This article approaches the deterministic filtering problem for a discrete-time nonlinear system (nonlinear dynamics and output functions) with a sum quadratic constraint on the process and the measurement disturbances. The … This article approaches the deterministic filtering problem for a discrete-time nonlinear system (nonlinear dynamics and output functions) with a sum quadratic constraint on the process and the measurement disturbances. The problem is formulated using an optimal control framework and the solution to the associated Hamilton-Jacobi-Bellman (HJB) equation is obtained. This approach enables the design of the filter without recourse to linearization of the system dynamics/ output equation and yields a set-valued state estimator. A computationally tractable numerical algorithm is introduced by utilizing the min-plus linearity of the corresponding dynamic programming operator.
This article approaches deterministic filtering via an application of the min-plus linearity of the corresponding dynamic programming operator. This filter design method yields a set-valued state estimator for discrete-time nonlinear … This article approaches deterministic filtering via an application of the min-plus linearity of the corresponding dynamic programming operator. This filter design method yields a set-valued state estimator for discrete-time nonlinear systems (nonlinear dynamics and output functions). The energy bounds in the process and the measurement disturbances are modeled using a sum quadratic constraint. The filtering problem is recast into an optimal control problem in the form of a Hamilton-Jacobi-Bellman (HJB) equation, the solution to which is obtained by employing the min-plus linearity property of the dynamic programming operator. This approach enables the solution to the HJB equation and the design of the filter without recourse to linearization of the system dynamics/ output equation.
In this article we introduce the use of recently developed min/max-plus techniques in order to solve the optimal attitude estimation problem in filtering for nonlinear systems on the special orthogonal … In this article we introduce the use of recently developed min/max-plus techniques in order to solve the optimal attitude estimation problem in filtering for nonlinear systems on the special orthogonal (SO(3)) group. This work helps obtain computationally efficient methods for the synthesis of deterministic filters for nonlinear systems -- i.e. optimal filters which estimate the state using a related optimal control problem. The technique indicated herein is validated using a set of optimal attitude estimation example problems on SO(3).
The design of deterministic filters can be cast as a problem of minimizing an associated cost function for an optimal control problem. Employing the min-plus linearity property of the dynamic … The design of deterministic filters can be cast as a problem of minimizing an associated cost function for an optimal control problem. Employing the min-plus linearity property of the dynamic programming operator (associated with the control problem) results in a computationally feasible approach (while avoiding linearization of the system dynamics/output). This article describes the salient features of this approach and a specific form of pruning/projection, based on clustering, which serves to facilitate the numerical efficiency of these methods.
A robust (deterministic) filtering approach to the problem of optimal sensor selection is considered herein. For a given system with several sensors, at each time step the output of one … A robust (deterministic) filtering approach to the problem of optimal sensor selection is considered herein. For a given system with several sensors, at each time step the output of one of the sensors must be chosen in order to obtain the best state estimate. We reformulate this problem in an optimal control framework which can then be solved using dynamic programming. In order to tackle the numerical computation of the solution in an efficient manner, we exploit the preservation of the min-plus structure of the optimal cost function when acted upon by the dynamic programming operator. This technique yields a grid free numerical approach to the problem. Simulations on an example problem serve to highlight the efficacy of this generalizable approach to robust multi-sensor state estimation.
This article approaches deterministic filtering via an application of the min-plus linearity of the corresponding dynamic programming operator. This filter design method yields a set-valued state estimator for discrete-time nonlinear … This article approaches deterministic filtering via an application of the min-plus linearity of the corresponding dynamic programming operator. This filter design method yields a set-valued state estimator for discrete-time nonlinear systems (nonlinear dynamics and output functions). The energy bounds in the process and the measurement disturbances are modeled using a sum quadratic constraint. The filtering problem is recast into an optimal control problem in the form of a Hamilton-Jacobi-Bellman (HJB) equation, the solution to which is obtained by employing the min-plus linearity property of the dynamic programming operator. This approach enables the solution to the HJB equation and the design of the filter without recourse to linearization of the system dynamics/ output equation.
In this article we explore a modification in the problem of controlling the rotation of a two level quantum system from an initial state to a final state in minimum … In this article we explore a modification in the problem of controlling the rotation of a two level quantum system from an initial state to a final state in minimum time. Specifically we consider the case where the qubit is being weakly monitored -- albeit with an assumption that both the measurement strength as well as the angular velocity are assumed to be control signals. This modification alters the dynamics significantly and enables the exploitation of the measurement backaction to assist in achieving the control objective. The proposed method yields a significant speedup in achieving the desired state transfer compared to previous approaches. These results are demonstrated via numerical solutions for an example problem on a single qubit.
Although quantum computers have the potential to efficiently solve certain problems considered difficult by known classical approaches, the design of a quantum circuit remains computationally difficult. It is known that … Although quantum computers have the potential to efficiently solve certain problems considered difficult by known classical approaches, the design of a quantum circuit remains computationally difficult. It is known that the optimal gate-design problem is equivalent to the solution of an associated optimal-control problem; the solution to which is also computationally intensive. Hence, in this article, we introduce the application of a class of numerical methods (termed the max-plus curse of dimensionality-free techniques) that determine the optimal control, thereby synthesizing the desired unitary gate. The application of this technique to quantum systems has a growth in complexity that depends on the cardinality of the control-set approximation rather than the much larger growth with respect to spatial dimensions in approaches based on gridding of the space, which is used in previous research. This technique is demonstrated by obtaining an approximate solution for the gate synthesis on SU(4)---a problem that is computationally intractable by grid-based approaches.

Commonly Cited References

In previous works of the author and others, max-plus methods have been explored for the solution of first-order, nonlinear Hamilton–Jacobi–Bellman partial differential equations (HJB PDEs) and corresponding nonlinear control problems. … In previous works of the author and others, max-plus methods have been explored for the solution of first-order, nonlinear Hamilton–Jacobi–Bellman partial differential equations (HJB PDEs) and corresponding nonlinear control problems. These methods exploit the max-plus linearity of the associated semigroups. In particular, although the problems are nonlinear, the semigroups are linear in the max-plus sense. These methods have been used successfully to compute solutions. Although they provide certain computational-speed advantages, they still generally suffer from the curse of dimensionality. Here we consider HJB PDEs in which the Hamiltonian takes the form of a (pointwise) maximum of linear/quadratic forms. The approach to the solution will be rather general, but in order to ground the work, we consider only constituent Hamiltonians corresponding to long-run average-cost-per-unit-time optimal control problems for the development. We obtain a numerical method not subject to the curse of dimensionality. The method is based on construction of the dual-space semigroup corresponding to the HJB PDE. This dual-space semigroup is constructed from the dual-space semigroups corresponding to the constituent linear/quadratic Hamiltonians. The dual-space semigroup is particularly useful due to its form as a max-plus integral operator with a kernel obtained from the originating semigroup. One considers repeated application of the dual-space semigroup to obtain the solution.
The Hamilton--Jacobi--Bellman (HJB) equation associated with the {robust/\hinfty} filter (as well as the Mortensen filter) is considered. These filters employ a model where the disturbances have finite power. The HJB … The Hamilton--Jacobi--Bellman (HJB) equation associated with the {robust/\hinfty} filter (as well as the Mortensen filter) is considered. These filters employ a model where the disturbances have finite power. The HJB equation for the filter information state is a first-order equation with a term that is quadratic in the gradient. Yet the solution operator is linear in the max-plus algebra. This property is exploited by the development of a numerical algorithm where the effect of the solution operator on a set of basis functions is computed off-line. The precomputed solutions are stored as vectors of coefficients of the basis functions. These coefficients are then used directly in the real-time computations.
Recently, a curse-of-dimensionality-free method was developed for solution of Hamilton-Jacobi-Bellman partial differential equations (HJB PDEs) for nonlinear control problems, using semiconvex duality and max-plus analysis. The curse-of-dimensionality-free method may be … Recently, a curse-of-dimensionality-free method was developed for solution of Hamilton-Jacobi-Bellman partial differential equations (HJB PDEs) for nonlinear control problems, using semiconvex duality and max-plus analysis. The curse-of-dimensionality-free method may be applied to HJB PDEs where the Hamiltonian is given as (or well-approximated by) a pointwise maximum of quadratic forms. Such HJB PDEs also arise in certain switched linear systems. The method constructs the correct solution of an HJB PDE from a max-plus linear combination of quadratics. The method completely avoids the curse-of-dimensionality, and is subject to cubic computational growth as a function of space dimension. However, it is subject to a curse-of-complexity. In particular, the number of quadratics in the approximation grows exponentially with the number of iterations. Efficacy of such a method depends on the pruning of quadratics to keep the complexity growth at a reasonable level. Here we apply a pruning algorithm based on semidefinite programming. Computational speeds are exceptional, with an example HJB PDE in six-dimensional Euclidean space solved to the indicated quality in approximately 30 minutes on a typical desktop machine.
Quantum computers hold great promise, but it remains a challenge to find efficient quantum circuits that solve interesting computational problems. We show that finding optimal quantum circuits is essentially equivalent … Quantum computers hold great promise, but it remains a challenge to find efficient quantum circuits that solve interesting computational problems. We show that finding optimal quantum circuits is essentially equivalent to finding the shortest path between two points in a certain curved geometry. By recasting the problem of finding quantum circuits as a geometric problem, we open up the possibility of using the mathematical techniques of Riemannian geometry to suggest new quantum algorithms, or to prove limitations on the power of quantum computers.
Recently M. G. Crandall and P. L. Lions introduced the notion of "viscosity solutions" of scalar nonlinear first order partial differential equations. Viscosity solutions need not be differentiable anywhere and … Recently M. G. Crandall and P. L. Lions introduced the notion of "viscosity solutions" of scalar nonlinear first order partial differential equations. Viscosity solutions need not be differentiable anywhere and thus are not sensitive to the classical problem of the crossing of characteristics. The value of this concept is established by the fact that very general existence, uniqueness and continuous dependence results hold for viscosity solutions of many problems arising in fields of application. The notion of a " viscosity solution" admits several equivalent formulations. Here we look more closely at two of these equivalent criteria and exhibit their virtues by both proving several new facts and reproving various known results in a simpler manner. Moreover, by forsaking technical generality we hereby provide a more congenial introduction to this subject than the original paper.
We prove upper and lower bounds relating the quantum gate complexity of a unitary operation, $U$, to the optimal control cost associated to the synthesis of $U$. These bounds apply … We prove upper and lower bounds relating the quantum gate complexity of a unitary operation, $U$, to the optimal control cost associated to the synthesis of $U$. These bounds apply for any optimal control problem, and can be used to show that the quantum gate complexity is essentially equivalent to the optimal control cost for a wide range of problems, including time-optimal control and finding minimal distances on certain Riemannian, sub-Riemannian, and Finslerian manifolds. These results generalize the results of [Nielsen, Dowling, Gu, and Doherty, Science 311, 1133 (2006)], which showed that the gate complexity can be related to distances on a Riemannian manifold.
The relationship between efficient quantum gate synthesis and control theory has been a topic of interest in the quantum control literature. Motivated by this work, we describe in the present … The relationship between efficient quantum gate synthesis and control theory has been a topic of interest in the quantum control literature. Motivated by this work, we describe in the present article how the dynamic programming technique from optimal control may be used for the optimal synthesis of quantum circuits. We demonstrate simulation results on an example system on SU(2), to obtain plots related to the gate complexity and sample paths for different logic gates.
Although quantum computers have the potential to efficiently solve certain problems considered difficult by known classical approaches, the design of a quantum circuit remains computationally difficult. It is known that … Although quantum computers have the potential to efficiently solve certain problems considered difficult by known classical approaches, the design of a quantum circuit remains computationally difficult. It is known that the optimal gate-design problem is equivalent to the solution of an associated optimal-control problem; the solution to which is also computationally intensive. Hence, in this article, we introduce the application of a class of numerical methods (termed the max-plus curse of dimensionality-free techniques) that determine the optimal control, thereby synthesizing the desired unitary gate. The application of this technique to quantum systems has a growth in complexity that depends on the cardinality of the control-set approximation rather than the much larger growth with respect to spatial dimensions in approaches based on gridding of the space, which is used in previous research. This technique is demonstrated by obtaining an approximate solution for the gate synthesis on SU(4)---a problem that is computationally intractable by grid-based approaches.
During recent years various proposals for the minimization of a nonsmooth functional have been made. Amongst these, the bundle concept turned out to be an especially fruitful idea. Based on … During recent years various proposals for the minimization of a nonsmooth functional have been made. Amongst these, the bundle concept turned out to be an especially fruitful idea. Based on this concept, a number of authors have developed codes that can successfully deal with nonsmooth problems. The aim of the paper is to show that, by adding some features of the trust region philosophy to the bundle concept, the end result is a distinguished member of the bundle family with a more stable behaviour than some other bundle versions. The reliability and efficiency of this code is demonstrated on the standard academic test examples and on some real-life problems.
A central drawback of primal-dual interior point methods for semidefinite programs is their lack of ability to exploit problem structure in cost and coefficient matrices. This restricts applicability to problems … A central drawback of primal-dual interior point methods for semidefinite programs is their lack of ability to exploit problem structure in cost and coefficient matrices. This restricts applicability to problems of small dimension. Typically, semidefinite relaxations arising in combinatorial applications have sparse and well-structured cost and coefficient matrices of huge order. We present a method that allows us to compute acceptable approximations to the optimal solution of large problems within reasonable time. Semidefinite programming problems with constant trace on the primal feasible set are equivalent to eigenvalue optimization problems. These are convex nonsmooth programming problems and can be solved by bundle methods. We propose replacing the traditional polyhedral cutting plane model constructed from subgradient information by a semidefinite model that is tailored to eigenvalue problems. Convergence follows from the traditional approach but a proof is included for completeness. We present numerical examples demonstrating the efficiency of the approach on combinatorial examples.
In this paper we consider the minimum time population transfer problem for the $z$-component of the spin of a (spin 1/2) particle driven by a magnetic field, controlled along the … In this paper we consider the minimum time population transfer problem for the $z$-component of the spin of a (spin 1/2) particle driven by a magnetic field, controlled along the x axis, with bounded amplitude. On the Bloch sphere (i.e. after a suitable Hopf projection), this problem can be attacked with techniques of optimal syntheses on 2-D manifolds. Let $(-E,E)$ be the two energy levels, and $|\Omega(t)|\leq M$ the bound on the field amplitude. For each couple of values $E$ and $M$, we determine the time optimal synthesis starting from the level $-E$ and we provide the explicit expression of the time optimal trajectories steering the state one to the state two, in terms of a parameter that can be computed solving numerically a suitable equation. For $M/E<<1$, every time optimal trajectory is bang-bang and in particular the corresponding control is periodic with frequency of the order of the resonance frequency $\omega_R=2E$. On the other side, for $M/E>1$, the time optimal trajectory steering the state one to the state two is bang-bang with exactly one switching. Fixed $E$ we also prove that for $M\to\infty$ the time needed to reach the state two tends to zero. In the case $M/E>1$ there are time optimal trajectories containing a singular arc. Finally we compare these results with some known results of Khaneja, Brockett and Glaser and with those obtained by controlling the magnetic field both on the $x$ and $y$ directions (or with one external field, but in the rotating wave approximation). As byproduct we prove that the qualitative shape of the time optimal synthesis presents different patterns, that cyclically alternate as $M/E\to0$, giving a partial proof of a conjecture formulated in a previous paper.
We explicitly compute the optimal cost for a class of example problems in geometric quantum control. These problems are defined by a Cartan decomposition of $\mathrm{su}({2}^{n})$ into orthogonal subspaces $\mathfrak{l}$ … We explicitly compute the optimal cost for a class of example problems in geometric quantum control. These problems are defined by a Cartan decomposition of $\mathrm{su}({2}^{n})$ into orthogonal subspaces $\mathfrak{l}$ and $\mathfrak{p}$ such that $[\mathfrak{l},\mathfrak{l}]\ensuremath{\subseteq}\mathfrak{p},[\mathfrak{p},\mathfrak{l}]=\mathfrak{p},[\mathfrak{p},\mathfrak{p}]\ensuremath{\subseteq}\mathfrak{l}$. Motion in the $\mathfrak{l}$ direction is assumed to have negligible cost, where motion in the $\mathfrak{p}$ direction does not. In the special case of two qubits, our results correspond to the minimal interaction cost of a given unitary.
A digital computer is generally believed to be an efficient universal computing device; that is, it is believed to be able to simulate any physical computing device with an increase … A digital computer is generally believed to be an efficient universal computing device; that is, it is believed to be able to simulate any physical computing device with an increase in computation time by at most a polynomial factor. This may not be true when quantum mechanics is taken into consideration. This paper considers factoring integers and finding discrete logarithms, two problems that are generally thought to be hard on classical computers and that have been used as the basis of several proposed cryptosystems. Efficient randomized algorithms are given for these two problems on a hypothetical quantum computer. These algorithms take a number of steps polynomial in the input size, for example, the number of digits of the integer to be factored.
In this paper, we study the design of pulse sequences for NMR spectroscopy as a problem of time optimal control of the unitary propagator. Radio frequency pulses are used in … In this paper, we study the design of pulse sequences for NMR spectroscopy as a problem of time optimal control of the unitary propagator. Radio frequency pulses are used in coherent spectroscopy to implement a unitary transfer of state. Pulse sequences that accomplish a desired transfer should be as short as possible in order to minimize the effects of relaxation and to optimize the sensitivity of the experiments. Here, we give an analytical characterization of such time optimal pulse sequences applicable to coherence transfer experiments in multiple-spin systems. We have adopted a general mathematical formulation, and present many of our results in this setting, mindful of the fact that new structures in optimal pulse design are constantly arising. Moreover, the general proofs are no more difficult than the specific problems of current interest. From a general control theory perspective, the problems we want to study have the following character. Suppose we are given a controllable right invariant system on a compact Lie group, what is the minimum time required to steer the system from some initial point to a specified final point? In NMR spectroscopy and quantum computing, this translates to, what is the minimum time required to produce a unitary propagator? We also give an analytical characterization of maximum achievable transfer in a given time for the two spin system.
We consider the construction of Markov chain approximations for an important class of deterministic control problems. The emphasis is on the construction of schemes that can be easily implemented and … We consider the construction of Markov chain approximations for an important class of deterministic control problems. The emphasis is on the construction of schemes that can be easily implemented and which possess a number of highly desirable qualitative properties. The class of problems covered is that for which the control is affine in the dynamics and with quadratic running cost. This class covers a number of interesting areas of application, including problems that arise in large deviations, risk-sensitive and robust control, robust filtering, and certain problems in computer vision. Examples are given as well as a proof of convergence.
Semidefinite programs derived from the Kalman-Yakubovich-Popov (KYP) lemma are quite common in control and signal processing applications. The programs are often of high dimension which makes them hard or impossible … Semidefinite programs derived from the Kalman-Yakubovich-Popov (KYP) lemma are quite common in control and signal processing applications. The programs are often of high dimension which makes them hard or impossible to solve with general-purpose solvers. Here we present a customized preprocessor, KYPD, that utilizes the inherent structure of this particular optimization problem. The key to an efficient implementation is to transform the optimization problem into an equivalent semidefinite program. This equivalent problem has much fewer variables and the matrices in the linear matrix inequality constraints are of low rank. KYPD can use any primal-dual solver for semidefinite programs as an underlying solver.
Convex optimization problems arise frequently in many different fields. A comprehensive introduction to the subject, this book shows in detail how such problems can be solved numerically with great efficiency. … Convex optimization problems arise frequently in many different fields. A comprehensive introduction to the subject, this book shows in detail how such problems can be solved numerically with great efficiency. The focus is on recognizing convex optimization problems and then finding the most appropriate technique for solving them. The text contains many worked examples and homework exercises and will appeal to students, researchers and practitioners in fields such as engineering, computer science, mathematics, statistics, finance, and economics.
By J. R. Munkres: pp. x, 107; 22s. (Princeton U.P.—O.U.P.; 1963). By J. R. Munkres: pp. x, 107; 22s. (Princeton U.P.—O.U.P.; 1963).
We prove comparison, uniqueness and existence results for viscosity solutions to a wide class of fully nonlinear second order partial differential equations $F(x, u, du, d^{2}u)=0$ defined on a finite-dimensional … We prove comparison, uniqueness and existence results for viscosity solutions to a wide class of fully nonlinear second order partial differential equations $F(x, u, du, d^{2}u)=0$ defined on a finite-dimensional Riemannian manifold $M$. Finest results (with hypothesis that require the function $F$ to be degenerate elliptic, that is nonincreasing in the second order derivative variable, and uniformly continuous with respect to the variable $x$) are obtained under the assumption that $M$ has nonnegative sectional curvature, while, if one additionally requires $F$ to depend on $d^{2}u$ in a uniformly continuous manner, then comparison results are established with no restrictive assumptions on curvature.
This article approaches deterministic filtering via an application of the min-plus linearity of the corresponding dynamic programming operator. This filter design method yields a set-valued state estimator for discrete-time nonlinear … This article approaches deterministic filtering via an application of the min-plus linearity of the corresponding dynamic programming operator. This filter design method yields a set-valued state estimator for discrete-time nonlinear systems (nonlinear dynamics and output functions). The energy bounds in the process and the measurement disturbances are modeled using a sum quadratic constraint. The filtering problem is recast into an optimal control problem in the form of a Hamilton-Jacobi-Bellman (HJB) equation, the solution to which is obtained by employing the min-plus linearity property of the dynamic programming operator. This approach enables the solution to the HJB equation and the design of the filter without recourse to linearization of the system dynamics/ output equation.
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks … In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.
Abstract: We present an integrated framework for using Convolutional Networks for classification, localization and detection. We show how a multiscale and sliding window approach can be efficiently implemented within a … Abstract: We present an integrated framework for using Convolutional Networks for classification, localization and detection. We show how a multiscale and sliding window approach can be efficiently implemented within a ConvNet. We also introduce a novel deep learning approach to localization by learning to predict object boundaries. Bounding boxes are then accumulated rather than suppressed in order to increase detection confidence. We show that different tasks can be learned simultaneously using a single shared network. This integrated framework is the winner of the localization task of the ImageNet Large Scale Visual Recognition Challenge 2013 (ILSVRC2013) and obtained very competitive results for the detection and classifications tasks. In post-competition work, we establish a new state of the art for the detection task. Finally, we release a feature extractor from our best model called OverFeat.
Collective communication algorithms are an important component of distributed computation. Indeed, in the case of deep-learning, collective communication is the Amdahl's bottleneck of data-parallel training. This paper introduces SCCL (for … Collective communication algorithms are an important component of distributed computation. Indeed, in the case of deep-learning, collective communication is the Amdahl's bottleneck of data-parallel training. This paper introduces SCCL (for Synthesized Collective Communication Library), a systematic approach to synthesize collective communication algorithms that are explicitly tailored to a particular hardware topology. SCCL synthesizes algorithms along the Pareto-frontier spanning from latency-optimal to bandwidth-optimal implementations of a collective. The paper demonstrates how to encode SCCL's synthesis as a quantifier-free SMT formula which can be discharged to a theorem prover. We further demonstrate how to scale our synthesis by exploiting symmetries in topologies and collectives. We synthesize and introduce novel latency and bandwidth optimal algorithms not seen in the literature on two popular hardware topologies. We also show how SCCL efficiently lowers algorithms to implementations on two hardware architectures (NVIDIA and AMD) and demonstrate competitive performance with hand optimized collective communication libraries.
Large deep learning models offer significant accuracy gains, but training billions to trillions of parameters is challenging. Existing solutions such as data and model parallelisms exhibit fundamental limitations to fit … Large deep learning models offer significant accuracy gains, but training billions to trillions of parameters is challenging. Existing solutions such as data and model parallelisms exhibit fundamental limitations to fit these models into limited device memory, while obtaining computation, communication and development efficiency. We develop a novel solution, Zero Redundancy Optimizer (ZeRO), to optimize memory, vastly improving training speed while increasing the model size that can be efficiently trained. ZeRO eliminates memory redundancies in data- and model-parallel training while retaining low communication volume and high computational granularity, allowing us to scale the model size proportional to the number of devices with sustained high efficiency. Our analysis on memory requirements and communication volume demonstrates: ZeRO has the potential to scale beyond 1 Trillion parameters using today's hardware. We implement and evaluate ZeRO: it trains large models of over 100B parameter with super-linear speedup on 400 GPUs, achieving throughput of 15 Petaflops. This represents an 8x increase in model size and 10x increase in achievable performance over state-of-the-art. In terms of usability, ZeRO can train large models of up to 13B parameters (e.g., larger than Megatron GPT 8. 3B and T5 11B) without requiring model parallelism which is harder for scientists to apply. Last but not the least, researchers have used the system breakthroughs of ZeRO to create Turing-NLG, the world's largest language model at the time (17B parameters) with record breaking accuracy.
Deep Learning (DL) training platforms are built by interconnecting multiple DL accelerators (e.g., GPU/TPU) via fast, customized interconnects with 100s of gigabytes (GBs) of bandwidth. However, as we identify in … Deep Learning (DL) training platforms are built by interconnecting multiple DL accelerators (e.g., GPU/TPU) via fast, customized interconnects with 100s of gigabytes (GBs) of bandwidth. However, as we identify in this work, driving this bandwidth is quite challenging. This is because there is a pernicious balance between using the accelerator's compute and memory for both DL computations and communication. This work makes two key contributions. First, via real system measurements and detailed modeling, we provide an understanding of compute and memory bandwidth demands for DL compute and comms. Second, we propose a novel DL collective communication accelerator called Accelerator Collectives Engine (ACE) that sits alongside the compute and networking engines at the accelerator endpoint. ACE frees up the endpoint's compute and memory resources for DL compute, which in turn reduces the required memory BW by 3.5X on average to drive the same network BW compared to state-of-the-art baselines. For modern DL workloads and different network sizes, ACE, on average, increases the effective network bandwidth utilization by 1.44X (up to 2.67X), resulting in an average of 1.41X (up to 1.51X), 1.12X (up to 1.17X), and 1.13X (up to 1.19X) speedup in iteration time for ResNet-50, GNMT and DLRM when compared to the best baseline configuration, respectively.
Deep learning recommendation models (DLRMs) have been used across many business-critical services at Meta and are the single largest AI application in terms of infrastructure demand in its data-centers. In … Deep learning recommendation models (DLRMs) have been used across many business-critical services at Meta and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper, we present Neo, a software-hardware co-designed system for high-performance distributed training of large-scale DLRMs. Neo employs a novel 4D parallelism strategy that combines table-wise, row-wise, column-wise, and data parallelism for training massive embedding operators in DLRMs. In addition, Neo enables extremely high-performance and memory-efficient embedding computations using a variety of critical systems optimizations, including hybrid kernel fusion, software-managed caching, and quality-preserving compression. Finally, Neo is paired with ZionEX, a new hardware platform co-designed with Neo's 4D parallelism for optimizing communications for large-scale DLRM training. Our evaluation on 128 GPUs using 16 ZionEX nodes shows that Neo outperforms existing systems by up to 40× for training 12-trillion-parameter DLRM models deployed in production.
The allreduce operation is one of the most commonly used communication routines in distributed applications. To improve its bandwidth and to reduce network traffic, this operation can be accelerated by … The allreduce operation is one of the most commonly used communication routines in distributed applications. To improve its bandwidth and to reduce network traffic, this operation can be accelerated by offloading it to network switches, that aggregate the data received from the hosts, and send them back the aggregated result. However, existing solutions provide limited customization opportunities and might provide suboptimal performance when dealing with custom operators and data types, with sparse data, or when reproducibility of the aggregation is a concern. To deal with these problems, in this work we design a flexible programmable switch by using as a building block PsPIN, a RISC-V architecture implementing the sPIN programming model. We then design, model, and analyze different algorithms for executing the aggregation on this architecture, showing performance improvements compared to state-of-the-art approaches.
We present a state-of-the-art image recognition system, Deep Image, developed using end-to-end deep learning. The key components are a custom-built supercomputer dedicated to deep learning, a highly optimized parallel algorithm … We present a state-of-the-art image recognition system, Deep Image, developed using end-to-end deep learning. The key components are a custom-built supercomputer dedicated to deep learning, a highly optimized parallel algorithm using new strategies for data partitioning and communication, larger deep neural network models, novel data augmentation approaches, and usage of multi-scale high-resolution images. Our method achieves excellent results on multiple challenging computer vision benchmarks.
Max-plus based methods have been recently developed to approximate the value function of possibly high dimensional optimal control problems. A critical step of these methods consists in approximating a function … Max-plus based methods have been recently developed to approximate the value function of possibly high dimensional optimal control problems. A critical step of these methods consists in approximating a function by a supremum of a small number of functions (max-plus "basis functions") taken from a prescribed dictionary. We study several variants of this approximation problem, which we show to be continuous versions of the facility location and $k$-center combinatorial optimization problems, in which the connection costs arise from a Bregman distance. We give theoretical error estimates, quantifying the number of basis functions needed to reach a prescribed accuracy. We derive from our approach a refinement of the curse of dimensionality free method introduced previously by McEneaney, with a higher accuracy for a comparable computational cost.
We introduce a max-plus analogue of the Petrov-Galerkin finite element method to solve finite horizon deterministic optimal control problems. The method relies on a max-plus variational formulation. We show that … We introduce a max-plus analogue of the Petrov-Galerkin finite element method to solve finite horizon deterministic optimal control problems. The method relies on a max-plus variational formulation. We show that the error in the sup norm can be bounded from the difference between the value function and its projections on max-plus and min-plus semimodules, when the max-plus analogue of the stiffness matrix is exactly known. In general, the stiffness matrix must be approximated: this requires approximating the operation of the Lax-Oleinik semigroup on finite elements. We consider two approximations relying on the Hamiltonian. We derive a convergence result, in arbitrary dimension, showing that for a class of problems, the error estimate is of order $\delta+\Delta x(\delta)^{-1}$ or $\sqrt{\delta}+\Delta x(\delta)^{-1}$, depending on the choice of the approximation, where $\delta$ and $\Delta x$ are respectively the time and space discretization steps. We compare our method with another max-plus based discretization method previously introduced by Fleming and McEneaney. We give numerical examples in dimension 1 and 2.
We present a general framework for finding the time-optimal evolution and the optimal Hamiltonian for a quantum system with a given set of initial and final states. Our formulation is … We present a general framework for finding the time-optimal evolution and the optimal Hamiltonian for a quantum system with a given set of initial and final states. Our formulation is based on the variational principle and is analogous to that for the brachistochrone in classical mechanics. We reduce the problem to a formal equation for the Hamiltonian which depends on certain constraint functions specifying the range of available Hamiltonians. For some simple examples of the constraints, we explicitly find the optimal solutions.
Recent studies on control of aggregate power of an ensemble of thermostatically-controlled-loads (TCLs) have been concentrated on shifting the temperature set points of each TCL in the population. A sudden … Recent studies on control of aggregate power of an ensemble of thermostatically-controlled-loads (TCLs) have been concentrated on shifting the temperature set points of each TCL in the population. A sudden shift in the set point, however, is known to be associated with undesirable power oscillations which require closed-loop control strategies to regulate the aggregate power consumption of the population. In this article, we propose a new approach which we term as a "safe protocol" to implement the shift in temperature set point. It is shown analytically and verified numerically that by shifting the set point "safely" the aggregate power consumption can be changed to a different value within a time frame of the order of a TCL's cycle duration and avoid the undesired oscillations seen otherwise in a "sudden" shift. We discuss how the excess aggregate energy transferred under a safe shift in the set point could potentially mitigate the burden due to abnormal energy generation within a short time span.
Logistic regression with l1 regularization has been proposed as a promising method for feature selection in classification problems. In this paper we describe an efficient interior-point method for solving large-scale … Logistic regression with l1 regularization has been proposed as a promising method for feature selection in classification problems. In this paper we describe an efficient interior-point method for solving large-scale l1-regularized logistic regression problems. Small problems with up to a thousand or so features and examples can be solved in seconds on a PC; medium sized problems, with tens of thousands of features and examples, can be solved in tens of seconds (assuming some sparsity in the data). A variation on the basic method, that uses a preconditioned conjugate gradient method to compute the search step, can solve very large problems, with a million features and examples (e.g., the 20 Newsgroups data set), in a few minutes, on a PC. Using warm-start techniques, a good approximation of the entire regularization path can be computed much more efficiently than by solving a family of problems independently.
Equations of Hamilton-Jacobi type arise in many areas of application, including the calculus of variations, control theory and differential games. The associated initial-value problems almost never have global-time classical solutions, … Equations of Hamilton-Jacobi type arise in many areas of application, including the calculus of variations, control theory and differential games. The associated initial-value problems almost never have global-time classical solutions, and one must deal with suitable generalized solutions. The correct class of generalized solutions has only recently been established by the authors. This article establishes the convergence of a class of difference approximations to these solutions by obtaining explicit error estimates. Analogous results are proved by similar means for the method of vanishing viscosity.