Author Description

Login to generate an author description

Ask a Question About This Mathematician

In a series of recent theoretical works, it was shown that strongly over-parameterized neural networks trained with gradient-based methods could converge exponentially fast to zero training loss, with their parameters … In a series of recent theoretical works, it was shown that strongly over-parameterized neural networks trained with gradient-based methods could converge exponentially fast to zero training loss, with their parameters hardly varying. In this work, we show that this "lazy training" phenomenon is not specific to over-parameterized neural networks, and is due to a choice of scaling, often implicit, that makes the model behave as its linearization around the initialization, thus yielding a model equivalent to learning with positive-definite kernels. Through a theoretical analysis, we exhibit various situations where this phenomenon arises in non-convex optimization and we provide bounds on the distance between the lazy and linearized optimization paths. Our numerical experiments bring a critical note, as we observe that the performance of commonly used non-linear deep convolutional neural networks in computer vision degrades when trained in the lazy regime. This makes it unlikely that "lazy training" is behind the many successes of neural networks in difficult high dimensional tasks.
This article introduces a new class of fast algorithms to approximate variational problems involving unbalanced optimal transport. While classical optimal transport considers only normalized probability distributions, it is important for … This article introduces a new class of fast algorithms to approximate variational problems involving unbalanced optimal transport. While classical optimal transport considers only normalized probability distributions, it is important for many applications to be able to compute some sort of relaxed transportation between arbitrary positive measures. A generic class of such "unbalanced" optimal transport problems has been recently proposed by several authors. In this paper, we show how to extend the now classical entropic regularization scheme to these unbalanced problems. This gives rise to fast, highly parallelizable algorithms that operate by performing only diagonal scaling (i.e., pointwise multiplications) of the transportation couplings. They are generalizations of the celebrated Sinkhorn algorithm. We show how these methods can be used to solve unbalanced transport, unbalanced gradient flows, and to compute unbalanced barycenters. We showcase applications to 2-D shape modification, color transfer, and growth models.
Many tasks in machine learning and signal processing can be solved by minimizing a convex function of a measure. This includes sparse spikes deconvolution or training a neural network with … Many tasks in machine learning and signal processing can be solved by minimizing a convex function of a measure. This includes sparse spikes deconvolution or training a neural network with a single hidden layer. For these problems, we study a simple minimization method: the unknown measure is discretized into a mixture of particles and a continuous-time gradient descent is performed on their weights and positions. This is an idealization of the usual way to train neural networks with a large hidden layer. We show that, when initialized correctly and in the many-particle limit, this gradient flow, although non-convex, converges to global minimizers. The proof involves Wasserstein gradient flows, a by-product of optimal transport theory. Numerical experiments show that this asymptotic behavior is already at play for a reasonable number of particles, even in high dimension.
Optimal transport (OT) and maximum mean discrepancies (MMD) are now routinely used in machine learning to compare probability measures. We focus in this paper on \emph{Sinkhorn divergences} (SDs), a regularized … Optimal transport (OT) and maximum mean discrepancies (MMD) are now routinely used in machine learning to compare probability measures. We focus in this paper on \emph{Sinkhorn divergences} (SDs), a regularized variant of OT distances which can interpolate, depending on the regularization strength $\varepsilon$, between OT ($\varepsilon=0$) and MMD ($\varepsilon=\infty$). Although the tradeoff induced by that regularization is now well understood computationally (OT, SDs and MMD require respectively $O(n^3\log n)$, $O(n^2)$ and $n^2$ operations given a sample size $n$), much less is known in terms of their \emph{sample complexity}, namely the gap between these quantities, when evaluated using finite samples \emph{vs.} their respective densities. Indeed, while the sample complexity of OT and MMD stand at two extremes, $1/n^{1/d}$ for OT in dimension $d$ and $1/\sqrt{n}$ for MMD, that for SDs has only been studied empirically. In this paper, we \emph{(i)} derive a bound on the approximation error made with SDs when approximating OT as a function of the regularizer $\varepsilon$, \emph{(ii)} prove that the optimizers of regularized OT are bounded in a Sobolev (RKHS) ball independent of the two measures and \emph{(iii)} provide the first sample complexity bound for SDs, obtained,by reformulating SDs as a maximization problem in a RKHS. We thus obtain a scaling in $1/\sqrt{n}$ (as in MMD), with a constant that depends however on $\varepsilon$, making the bridge between OT and MMD complete.
Many tasks in machine learning and signal processing can be solved by minimizing a convex function of a measure. This includes sparse spikes deconvolution or training a neural network with … Many tasks in machine learning and signal processing can be solved by minimizing a convex function of a measure. This includes sparse spikes deconvolution or training a neural network with a single hidden layer. For these problems, we study a simple minimization method: the unknown measure is discretized into a mixture of particles and a continuous-time gradient descent is performed on their weights and positions. This is an idealization of the usual way to train neural networks with a large hidden layer. We show that, when initialized correctly and in the many-particle limit, this gradient flow, although non-convex, converges to global minimizers. The proof involves Wasserstein gradient flows, a by-product of optimal transport theory. Numerical experiments show that this asymptotic behavior is already at play for a reasonable number of particles, even in high dimension.
In a series of recent theoretical works, it has been shown that strongly over-parameterized neural networks trained with gradient-based methods could converge linearly to zero loss, with their parameters hardly … In a series of recent theoretical works, it has been shown that strongly over-parameterized neural networks trained with gradient-based methods could converge linearly to zero loss, with their parameters hardly varying. In this note, our goal is to exhibit the simple structure that is behind these results. In a simplified setting, we prove that lazy training essentially solves a kernel regression. We also show that this behavior is not so much due to over-parameterization than to a choice of scaling, often implicit, that allows to linearize the model around its initialization. These theoretical results complemented with simple numerical experiments make it seem unlikely that lazy training is behind the many successes of neural networks in high dimensional tasks.
Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss with gradient-based methods are observed to perform well in many supervised classification tasks. Towards understanding this phenomenon, we analyze the … Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss with gradient-based methods are observed to perform well in many supervised classification tasks. Towards understanding this phenomenon, we analyze the training and generalization behavior of infinitely wide two-layer neural networks with homogeneous activations. We show that the limits of the gradient flow on exponentially tailed losses can be fully characterized as a max-margin classifier in a certain non-Hilbertian space of functions. In presence of hidden low-dimensional structures, the resulting margin is independent of the ambiant dimension, which leads to strong generalization bounds. In contrast, training only the output layer implicitly solves a kernel support vector machine, which a priori does not enjoy such an adaptivity. Our analysis of training is non-quantitative in terms of running time but we prove computational guarantees in simplified settings by showing equivalences with online mirror descent. Finally, numerical experiments suggest that our analysis describes well the practical behavior of two-layer neural networks with ReLU activation and confirm the statistical benefits of this implicit bias.
This article introduces a new class of fast algorithms to approximate variational problems involving unbalanced optimal transport. While classical optimal transport considers only normalized probability distributions, it is important for … This article introduces a new class of fast algorithms to approximate variational problems involving unbalanced optimal transport. While classical optimal transport considers only normalized probability distributions, it is important for many applications to be able to compute some sort of relaxed transportation between arbitrary positive measures. A generic class of such "unbalanced" optimal transport problems has been recently proposed by several authors. In this paper, we show how to extend the, now classical, entropic regularization scheme to these unbalanced problems. This gives rise to fast, highly parallelizable algorithms that operate by performing only diagonal scaling (i.e. pointwise multiplications) of the transportation couplings. They are generalizations of the celebrated Sinkhorn algorithm. We show how these methods can be used to solve unbalanced transport, unbalanced gradient flows, and to compute unbalanced barycenters. We showcase applications to 2-D shape modification, color transfer, and growth models.
This article presents a new class of distances between arbitrary nonnegative Radon measures inspired by optimal transport. These distances are defined by two equivalent alternative formulations: (i) a dynamic formulation … This article presents a new class of distances between arbitrary nonnegative Radon measures inspired by optimal transport. These distances are defined by two equivalent alternative formulations: (i) a dynamic formulation defining the distance as a geodesic distance over the space of measures (ii) a static "Kantorovich" formulation where the distance is the minimum of an optimization problem over pairs of couplings describing the transfer (transport, creation and destruction) of mass between two measures. Both formulations are convex optimization problems, and the ability to switch from one to the other depending on the targeted application is a crucial property of our models. Of particular interest is the Wasserstein-Fisher-Rao metric recently introduced independently by Chizat et al. and Kondratyev et al. Defined initially through a dynamic formulation, it belongs to this class of metrics and hence automatically benefits from a static Kantorovich formulation.
Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss with gradient-based methods are observed to perform well in many supervised classification tasks. Towards understanding this phenomenon, we analyze the … Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss with gradient-based methods are observed to perform well in many supervised classification tasks. Towards understanding this phenomenon, we analyze the training and generalization behavior of infinitely wide two-layer neural networks with homogeneous activations. We show that the limits of the gradient flow on exponentially tailed losses can be fully characterized as a max-margin classifier in a certain non-Hilbertian space of functions. In presence of hidden low-dimensional structures, the resulting margin is independent of the ambiant dimension, which leads to strong generalization bounds. In contrast, training only the output layer implicitly solves a kernel support vector machine, which a priori does not enjoy such an adaptivity. Our analysis of training is non-quantitative in terms of running time but we prove computational guarantees in simplified settings by showing equivalences with online mirror descent. Finally, numerical experiments suggest that our analysis describes well the practical behavior of two-layer neural networks with ReLU activation and confirm the statistical benefits of this implicit bias.
This paper defines a new transport metric over the space of non-negative measures. This metric interpolates between the quadratic Wasserstein and the Fisher-Rao metrics and generalizes optimal transport to measures … This paper defines a new transport metric over the space of non-negative measures. This metric interpolates between the quadratic Wasserstein and the Fisher-Rao metrics and generalizes optimal transport to measures with different masses. It is defined as a generalization of the dynamical formulation of optimal transport of Benamou and Brenier, by introducing a source term in the continuity equation. The influence of this source term is measured using the Fisher-Rao metric, and is averaged with the transportation term. This gives rise to a convex variational problem defining our metric. Our first contribution is a proof of the existence of geodesics (i.e. solutions to this variational problem). We then show that (generalized) optimal transport and Fisher-Rao metrics are obtained as limiting cases of our metric. Our last theoretical contribution is a proof that geodesics between mixtures of sufficiently close Diracs are made of translating mixtures of Diracs. Lastly, we propose a numerical scheme making use of first order proximal splitting methods and we show an application of this new distance to image interpolation.
The squared Wasserstein distance is a natural quantity to compare probability distributions in a non-parametric setting. This quantity is usually estimated with the plug-in estimator, defined via a discrete optimal … The squared Wasserstein distance is a natural quantity to compare probability distributions in a non-parametric setting. This quantity is usually estimated with the plug-in estimator, defined via a discrete optimal transport problem which can be solved to $ε$-accuracy by adding an entropic regularization of order $ε$ and using for instance Sinkhorn's algorithm. In this work, we propose instead to estimate it with the Sinkhorn divergence, which is also built on entropic regularization but includes debiasing terms. We show that, for smooth densities, this estimator has a comparable sample complexity but allows higher regularization levels, of order $ε^{1/2}$, which leads to improved computational complexity bounds and a strong speedup in practice. Our theoretical analysis covers the case of both randomly sampled densities and deterministic discretizations on uniform grids. We also propose and analyze an estimator based on Richardson extrapolation of the Sinkhorn divergence which enjoys improved statistical and computational efficiency guarantees, under a condition on the regularity of the approximation error, which is in particular satisfied for Gaussian densities. We finally demonstrate the efficiency of the proposed estimators with numerical experiments.
This article presents a new class of distances between arbitrary nonnegative Radon measures inspired by optimal transport. These distances are defined by two equivalent alternative formulations: (i) a dynamic formulation … This article presents a new class of distances between arbitrary nonnegative Radon measures inspired by optimal transport. These distances are defined by two equivalent alternative formulations: (i) a dynamic formulation defining the distance as a geodesic distance over the space of measures (ii) a static formulation where the distance is the minimum of an optimization problem over pairs of couplings describing the transfer (transport, creation and destruction) of mass between two measures. Both formulations are convex optimization problems, and the ability to switch from one to the other depending on the targeted application is a crucial property of our models. Of particular interest is the Wasserstein-Fisher-Rao metric recently introduced independently by Chizat et al. and Kondratyev et al. Defined initially through a dynamic formulation, it belongs to this class of metrics and hence automatically benefits from a static Kantorovich formulation.
This article introduces a new notion of optimal transport (OT) between tensor fields, which are measures whose values are positive semidefinite (PSD) matrices. This “quantum” formulation of optimal transport (Q-OT) … This article introduces a new notion of optimal transport (OT) between tensor fields, which are measures whose values are positive semidefinite (PSD) matrices. This “quantum” formulation of optimal transport (Q-OT) corresponds to a relaxed version of the classical Kantorovich transport problem, where the fidelity between the input PSD-valued measures is captured using the geometry of the Von-Neumann quantum entropy. We propose a quantum-entropic regularization of the resulting convex optimization problem, which can be solved efficiently using an iterative scaling algorithm. This method is a generalization of the celebrated Sinkhorn algorithm to the quantum setting of PSD matrices. We extend this formulation and the quantum Sinkhorn algorithm to compute barycentres within a collection of input tensor fields. We illustrate the usefulness of the proposed approach on applications to procedural noise generation, anisotropic meshing, diffusion tensor imaging and spectral texture synthesis.
In this paper, we characterize a degenerate PDE as the gradient flow in the space of nonnegative measures endowed with an optimal transport-growth metric. The PDE of concern, of Hele-Shaw … In this paper, we characterize a degenerate PDE as the gradient flow in the space of nonnegative measures endowed with an optimal transport-growth metric. The PDE of concern, of Hele-Shaw type, was introduced by Perthame et. al . as a mechanical model for tumor growth and the metric was introduced recently in several articles as the analogue of the Wasserstein metric for nonnegative measures. We show existence of solutions using minimizing movements and show uniqueness of solutions on convex domains by proving the Evolutional Variational Inequality . Our analysis does not require any regularity assumption on the initial condition. We also derive a numerical scheme based on the discretization of the gradient flow and the idea of entropic regularization. We assess the convergence of the scheme on explicit solutions. In doing this analysis, we prove several new properties of the optimal transport-growth metric, which generally have a known counterpart for the Wasserstein metric.
Optimal transport (OT) and maximum mean discrepancies (MMD) are now routinely used in machine learning to compare probability measures. We focus in this paper on \emph{Sinkhorn divergences} (SDs), a regularized … Optimal transport (OT) and maximum mean discrepancies (MMD) are now routinely used in machine learning to compare probability measures. We focus in this paper on \emph{Sinkhorn divergences} (SDs), a regularized variant of OT distances which can interpolate, depending on the regularization strength $\varepsilon$, between OT ($\varepsilon=0$) and MMD ($\varepsilon=\infty$). Although the tradeoff induced by that regularization is now well understood computationally (OT, SDs and MMD require respectively $O(n^3\log n)$, $O(n^2)$ and $n^2$ operations given a sample size $n$), much less is known in terms of their \emph{sample complexity}, namely the gap between these quantities, when evaluated using finite samples \emph{vs.} their respective densities. Indeed, while the sample complexity of OT and MMD stand at two extremes, $1/n^{1/d}$ for OT in dimension $d$ and $1/\sqrt{n}$ for MMD, that for SDs has only been studied empirically. In this paper, we \emph{(i)} derive a bound on the approximation error made with SDs when approximating OT as a function of the regularizer $\varepsilon$, \emph{(ii)} prove that the optimizers of regularized OT are bounded in a Sobolev (RKHS) ball independent of the two measures and \emph{(iii)} provide the first sample complexity bound for SDs, obtained,by reformulating SDs as a maximization problem in a RKHS. We thus obtain a scaling in $1/\sqrt{n}$ (as in MMD), with a constant that depends however on $\varepsilon$, making the bridge between OT and MMD complete.
The idea of slicing divergences has been proven to be successful when comparing two probability measures in various machine learning applications including generative modeling, and consists in computing the expected … The idea of slicing divergences has been proven to be successful when comparing two probability measures in various machine learning applications including generative modeling, and consists in computing the expected value of a `base divergence' between one-dimensional random projections of the two measures. However, the topological, statistical, and computational consequences of this technique have not yet been well-established. In this paper, we aim at bridging this gap and derive various theoretical properties of sliced probability divergences. First, we show that slicing preserves the metric axioms and the weak continuity of the divergence, implying that the sliced divergence will share similar topological properties. We then precise the results in the case where the base divergence belongs to the class of integral probability metrics. On the other hand, we establish that, under mild conditions, the sample complexity of a sliced divergence does not depend on the problem dimension. We finally apply our general results to several base divergences, and illustrate our theory on both synthetic and real data experiments.
This thesis generalizes optimal transport beyond the classical balanced setting of probability distributions. We define unbalanced optimal transport models between nonnegative measures, based either on the notion of interpolation or … This thesis generalizes optimal transport beyond the classical balanced setting of probability distributions. We define unbalanced optimal transport models between nonnegative measures, based either on the notion of interpolation or the notion of coupling of measures. We show relationships between these approaches. One of the outcomes of this framework is a generalization of the p-Wasserstein metrics. Secondly, we build numerical methods to solve interpolation and coupling-based models. We study, in particular, a new family of scaling algorithms that generalize Sinkhorn's algorithm. The third part deals with applications. It contains a theoretical and numerical study of a Hele-Shaw type gradient flow in the space of nonnegative measures. It also adresses the case of measures taking values in the cone of positive semi-definite matrices, for which we introduce a model that achieves a balance between geometrical accuracy and algorithmic efficiency.
This article describes a set of methods for quickly computing the solution to the regularized optimal transport problem. It generalizes and improves upon the widely used iterative Bregman projections algorithm … This article describes a set of methods for quickly computing the solution to the regularized optimal transport problem. It generalizes and improves upon the widely used iterative Bregman projections algorithm (or Sinkhorn–Knopp algorithm). We first proposed to rely on regularized nonlinear acceleration schemes. In practice, such approaches lead to fast algorithms, but their global convergence is not ensured. Hence, we next proposed a new algorithm with convergence guarantees. The idea is to overrelax the Bregman projection operators, allowing for faster convergence. We proposed a simple method for establishing global convergence by ensuring the decrease of a Lyapunov function at each step. An adaptive choice of the overrelaxation parameter based on the Lyapunov function was constructed. We also suggested a heuristic to choose a suitable asymptotic overrelaxation parameter, based on a local convergence analysis. Our numerical experiments showed a gain in convergence speed by an order of magnitude in certain regimes.
Many supervised machine learning methods are naturally cast as optimization problems. For prediction models which are linear in their parameters, this often leads to convex problems for which many mathematical … Many supervised machine learning methods are naturally cast as optimization problems. For prediction models which are linear in their parameters, this often leads to convex problems for which many mathematical guarantees exist. Models which are nonlinear in their parameters such as neural networks lead to nonconvex optimization problems for which guarantees are harder to obtain. In this paper, we consider two-layer neural networks with homogeneous activation functions where the number of hidden neurons tends to infinity, and show how qualitative convergence guarantees may be derived.
With huge data acquisition progresses realized in the past decades and acquisition systems now able to produce high resolution grids and point clouds, the digitization of physical terrains becomes increasingly … With huge data acquisition progresses realized in the past decades and acquisition systems now able to produce high resolution grids and point clouds, the digitization of physical terrains becomes increasingly more precise. Such extreme quantities of generated and modeled data greatly impact computational performances on many levels of high-performance computing (HPC): storage media, memory requirements, transfer capability, and finally simulation interactivity, necessary to exploit this instance of big data. Efficient representations and storage are thus becoming "enabling technologies'' in HPC experimental and simulation science. We propose HexaShrink, an original decomposition scheme for structured hexahedral volume meshes. The latter are used for instance in biomedical engineering, materials science, or geosciences. HexaShrink provides a comprehensive framework allowing efficient mesh visualization and storage. Its exactly reversible multiresolution decomposition yields a hierarchy of meshes of increasing levels of details, in terms of either geometry, continuous or categorical properties of cells. Starting with an overview of volume meshes compression techniques, our contribution blends coherently different multiresolution wavelet schemes in different dimensions. It results in a global framework preserving discontinuities (faults) across scales, implemented as a fully reversible upscaling at different resolutions. Experimental results are provided on meshes of varying size and complexity. They emphasize the consistency of the proposed representation, in terms of visualization, attribute downsampling and distribution at different resolutions. Finally, HexaShrink yields gains in storage space when combined to lossless compression techniques.
This article introduces a new notion of optimal transport (OT) between tensor fields, which are measures whose values are positive semidefinite (PSD) matrices. This "quantum" formulation of OT (Q-OT) corresponds … This article introduces a new notion of optimal transport (OT) between tensor fields, which are measures whose values are positive semidefinite (PSD) matrices. This "quantum" formulation of OT (Q-OT) corresponds to a relaxed version of the classical Kantorovich transport problem, where the fidelity between the input PSD-valued measures is captured using the geometry of the Von-Neumann quantum entropy. We propose a quantum-entropic regularization of the resulting convex optimization problem, which can be solved efficiently using an iterative scaling algorithm. This method is a generalization of the celebrated Sinkhorn algorithm to the quantum setting of PSD matrices. We extend this formulation and the quantum Sinkhorn algorithm to compute barycenters within a collection of input tensor fields. We illustrate the usefulness of the proposed approach on applications to procedural noise generation, anisotropic meshing, diffusion tensor imaging and spectral texture synthesis.
This article describes a set of methods for quickly computing the solution to the regularized optimal transport problem. It generalizes and improves upon the widely-used iterative Bregman projections algorithm (or … This article describes a set of methods for quickly computing the solution to the regularized optimal transport problem. It generalizes and improves upon the widely-used iterative Bregman projections algorithm (or Sinkhorn--Knopp algorithm). We first propose to rely on regularized nonlinear acceleration schemes. In practice, such approaches lead to fast algorithms, but their global convergence is not ensured. Hence, we next propose a new algorithm with convergence guarantees. The idea is to overrelax the Bregman projection operators, allowing for faster convergence. We propose a simple method for establishing global convergence by ensuring the decrease of a Lyapunov function at each step. An adaptive choice of overrelaxation parameter based on the Lyapunov function is constructed. We also suggest a heuristic to choose a suitable asymptotic overrelaxation parameter, based on a local convergence analysis. Our numerical experiments show a gain in convergence speed by an order of magnitude in certain regimes.
Noisy particle gradient descent (NPGD) is an algorithm to minimize convex functions over the space of measures that include an entropy term. In the many-particle limit, this algorithm is described … Noisy particle gradient descent (NPGD) is an algorithm to minimize convex functions over the space of measures that include an entropy term. In the many-particle limit, this algorithm is described by a Mean-Field Langevin dynamics - a generalization of the Langevin dynamics with a non-linear drift - which is our main object of study. Previous work have shown its convergence to the unique minimizer via non-quantitative arguments. We prove that this dynamics converges at an exponential rate, under the assumption that a certain family of Log-Sobolev inequalities holds. This assumption holds for instance for the minimization of the risk of certain two-layer neural networks, where NPGD is equivalent to standard noisy gradient descent. We also study the annealed dynamics, and show that for a noise decaying at a logarithmic rate, the dynamics converges in value to the global minimizer of the unregularized objective function.
This paper studies the infinite-width limit of deep linear neural networks initialized with random parameters. We obtain that, when the number of neurons diverges, the training dynamics converge (in a … This paper studies the infinite-width limit of deep linear neural networks initialized with random parameters. We obtain that, when the number of neurons diverges, the training dynamics converge (in a precise sense) to the dynamics obtained from a gradient descent on an infinitely wide deterministic linear neural network. Moreover, even if the weights remain random, we get their precise law along the training dynamics, and prove a quantitative convergence result of the linear predictor in terms of the number of neurons. We finally study the continuous-time limit obtained for infinitely wide linear neural networks and show that the linear predictors of the neural network converge at an exponential rate to the minimal $\ell_2$-norm minimizer of the risk.
We study the convergence rate of Bregman gradient methods for convex optimization in the space of measures on a d-dimensional manifold. Under basic regularity assumptions, we show that the suboptimality … We study the convergence rate of Bregman gradient methods for convex optimization in the space of measures on a d-dimensional manifold. Under basic regularity assumptions, we show that the suboptimality gap at iteration k is in O(log(k)k -1 ) for multiplicative updates, while it is in O(k -q/(d+q) ) for additive updates for some q∈{1,2,4} determined by the structure of the objective function. Our flexible proof strategy, based on approximation arguments, allows us to painlessly cover all Bregman Proximal Gradient Methods (PGM) and their acceleration (APGM) under various geometries such as the hyperbolic entropy and L p divergences. We also prove the tightness of our analysis with matching lower bounds and confirm the theoretical results with numerical experiments on low dimensional problems. Note that all these optimization methods must additionally pay the computational cost of discretization, which can be exponential in d.
We study the convergence rate of Bregman gradient methods for convex optimization in the space of measures on a $d$-dimensional manifold. Under basic regularity assumptions, we show that the suboptimality … We study the convergence rate of Bregman gradient methods for convex optimization in the space of measures on a $d$-dimensional manifold. Under basic regularity assumptions, we show that the suboptimality gap at iteration $k$ is in $O(log(k)k^{--1})$ for multiplicative updates, while it is in $O(k^{--q/(d+q)})$ for additive updates for some $q \in {1, 2, 4}$ determined by the structure of the objective function. Our flexible proof strategy, based on approximation arguments, allows us to painlessly cover all Bregman Proximal Gradient Methods (PGM) and their acceleration (APGM) under various geometries such as the hyperbolic entropy and $L^p$ divergences. We also prove the tightness of our analysis with matching lower bounds and confirm the theoretical results with numerical experiments on low dimensional problems. Note that all these optimization methods must additionally pay the computational cost of discretization, which can be exponential in $d$.
The function that maps a family of probability measures to the solution of the dual entropic optimal transport problem is known as the Schr¨odinger map. We prove that when the … The function that maps a family of probability measures to the solution of the dual entropic optimal transport problem is known as the Schr¨odinger map. We prove that when the cost function is C k +1 with k ∈ ℕ * then this map is Lipschitz continuous from the L2-Wasserstein space to the space of C k functions. Our result holds on compact domains and covers the multi-marginal case. We also include regularity results under negative Sobolev metrics weaker than Wasserstein under stronger smoothness assumptions on the cost. As applications, we prove displacement smoothness of the entropic optimal transport cost and the well-posedness of certain Wasserstein gradient flows involving this functional, including the Sinkhorn divergence and a multi-species system.
The function that maps a family of probability measures to the solution of the dual entropic optimal transport problem is known as the Schr\"odinger map. We prove that when the … The function that maps a family of probability measures to the solution of the dual entropic optimal transport problem is known as the Schr\"odinger map. We prove that when the cost function is $\mathcal{C}^{k+1}$ with $k\in \mathbb{N}^*$ then this map is Lipschitz continuous from the $L^2$-Wasserstein space to the space of $\mathcal{C}^k$ functions. Our result holds on compact domains and covers the multi-marginal case. We also include regularity results under negative Sobolev metrics weaker than Wasserstein under stronger smoothness assumptions on the cost. As applications, we prove displacement smoothness of the entropic optimal transport cost and the well-posedness of certain Wasserstein gradient flows involving this functional, including the Sinkhorn divergence and a multi-species system.
<abstract><p>We consider the idealized setting of gradient flow on the population risk for infinitely wide two-layer ReLU neural networks (without bias), and study the effect of symmetries on the learned … <abstract><p>We consider the idealized setting of gradient flow on the population risk for infinitely wide two-layer ReLU neural networks (without bias), and study the effect of symmetries on the learned parameters and predictors. We first describe a general class of symmetries which, when satisfied by the target function $ f^* $ and the input distribution, are preserved by the dynamics. We then study more specific cases. When $ f^* $ is odd, we show that the dynamics of the predictor reduces to that of a (non-linearly parameterized) linear predictor, and its exponential convergence can be guaranteed. When $ f^* $ has a low-dimensional structure, we prove that the gradient flow PDE reduces to a lower-dimensional PDE. Furthermore, we present informal and numerical arguments that suggest that the input neurons align with the lower-dimensional structure of the problem.</p></abstract>
We consider the problem of computing mixed Nash equilibria of two-player zero-sum games with continuous sets of pure strategies and with first-order access to the payoff function. This problem arises … We consider the problem of computing mixed Nash equilibria of two-player zero-sum games with continuous sets of pure strategies and with first-order access to the payoff function. This problem arises for example in game-theory-inspired machine learning applications, such as distributionally-robust learning. In those applications, the strategy sets are high-dimensional and thus methods based on discretisation cannot tractably return high-accuracy solutions. In this paper, we introduce and analyze a particle-based method that enjoys guaranteed local convergence for this problem. This method consists in parametrizing the mixed strategies as atomic measures and applying proximal point updates to both the atoms' weights and positions. It can be interpreted as a time-implicit discretization of the "interacting" Wasserstein-Fisher-Rao gradient flow. We prove that, under non-degeneracy assumptions, this method converges at an exponential rate to the exact mixed Nash equilibrium from any initialization satisfying a natural notion of closeness to optimality. We illustrate our results with numerical experiments and discuss applications to max-margin and distributionally-robust classification using two-layer neural networks, where our method has a natural interpretation as a simultaneous training of the network's weights and of the adversarial distribution.
We study a general formulation of regularized Wasserstein barycenters that enjoys favorable regularity, approximation, stability and (grid-free) optimization properties. This barycenter is defined as the unique probability measure that minimizes … We study a general formulation of regularized Wasserstein barycenters that enjoys favorable regularity, approximation, stability and (grid-free) optimization properties. This barycenter is defined as the unique probability measure that minimizes the sum of entropic optimal transport (EOT) costs with respect to a family of given probability measures, plus an entropy term. We denote it $(λ,τ)$-barycenter, where $λ$ is the inner regularization strength and $τ$ the outer one. This formulation recovers several previously proposed EOT barycenters for various choices of $λ,τ\geq 0$ and generalizes them. First, in spite of -- and in fact owing to -- being \emph{doubly} regularized, we show that our formulation is debiased for $τ=λ/2$: the suboptimality in the (unregularized) Wasserstein barycenter objective is, for smooth densities, of the order of the strength $λ^2$ of entropic regularization, instead of $\max\{λ,τ\}$ in general. We discuss this phenomenon for isotropic Gaussians where all $(λ,τ)$-barycenters have closed form. Second, we show that for $λ,τ>0$, this barycenter has a smooth density and is strongly stable under perturbation of the marginals. In particular, it can be estimated efficiently: given $n$ samples from each of the probability measures, it converges in relative entropy to the population barycenter at a rate $n^{-1/2}$. And finally, this formulation lends itself naturally to a grid-free optimization algorithm: we propose a simple \emph{noisy particle gradient descent} which, in the mean-field limit, converges globally at an exponential rate to the barycenter.
We study the convergence to local Nash equilibria of gradient methods for two-player zero-sum differentiable games. It is well-known that such dynamics converge locally when $S \succ 0$ and may … We study the convergence to local Nash equilibria of gradient methods for two-player zero-sum differentiable games. It is well-known that such dynamics converge locally when $S \succ 0$ and may diverge when $S=0$, where $S\succeq 0$ is the symmetric part of the Jacobian at equilibrium that accounts for the "potential" component of the game. We show that these dynamics also converge as soon as $S$ is nonzero (partial curvature) and the eigenvectors of the antisymmetric part $A$ are in general position with respect to the kernel of $S$. We then study the convergence rates when $S \ll A$ and prove that they typically depend on the average of the eigenvalues of $S$, instead of the minimum as an analogy with minimization problems would suggest. To illustrate our results, we consider the problem of computing mixed Nash equilibria of continuous games. We show that, thanks to partial curvature, conic particle methods -- which optimize over both weights and supports of the mixed strategies -- generically converge faster than fixed-support methods. For min-max games, it is thus beneficial to add degrees of freedom "with curvature": this can be interpreted as yet another benefit of over-parameterization.
Abstract This paper studies the infinite‐width limit of deep linear neural networks (NNs) initialized with random parameters. We obtain that, when the number of parameters diverges, the training dynamics converge … Abstract This paper studies the infinite‐width limit of deep linear neural networks (NNs) initialized with random parameters. We obtain that, when the number of parameters diverges, the training dynamics converge (in a precise sense) to the dynamics obtained from a gradient descent on an infinitely wide deterministic linear NN. Moreover, even if the weights remain random, we get their precise law along the training dynamics, and prove a quantitative convergence result of the linear predictor in terms of the number of parameters. We finally study the continuous‐time limit obtained for infinitely wide linear NNs and show that the linear predictors of the NN converge at an exponential rate to the minimal ‐norm minimizer of the risk.
To theoretically understand the behavior of trained deep neural networks, it is necessary to study the dynamics induced by gradient methods from a random initialization. However, the nonlinear and compositional … To theoretically understand the behavior of trained deep neural networks, it is necessary to study the dynamics induced by gradient methods from a random initialization. However, the nonlinear and compositional structure of these models make these dynamics difficult to analyze. To overcome these challenges, large-width asymptotics have recently emerged as a fruitful viewpoint and led to practical insights on real-world deep networks. For two-layer neural networks, it has been understood via these asymptotics that the nature of the trained model radically changes depending on the scale of the initial random weights, ranging from a kernel regime (for large initial variance) to a feature learning regime (for initial variance). For deeper networks more regimes are possible, and in this paper we study in detail a specific choice of small initialization corresponding to mean-field limits of neural networks, which we call integrable parameterizations (IPs). First, we show that under standard i.i.d. zero-mean initialization, integrable parameterizations of neural networks with more than four layers start at a stationary point in the infinite-width limit and no learning occurs. We then propose various methods to avoid this trivial behavior and analyze in detail the resulting dynamics. In particular, one of these methods consists in using large initial learning rates, and we show that it is equivalent to a modification of the recently proposed maximal update parameterization µP. We confirm our results with numerical experiments on image classification tasks, which additionally show a strong difference in behavior between various choices of activation functions that is not yet captured by theory.
Trajectory inference aims at recovering the dynamics of a population from snapshots of its temporal marginals. To solve this task, a min-entropy estimator relative to the Wiener measure in path … Trajectory inference aims at recovering the dynamics of a population from snapshots of its temporal marginals. To solve this task, a min-entropy estimator relative to the Wiener measure in path space was introduced by Lavenant et al. arXiv:2102.09204, and shown to consistently recover the dynamics of a large class of drift-diffusion processes from the solution of an infinite dimensional convex optimization problem. In this paper, we introduce a grid-free algorithm to compute this estimator. Our method consists in a family of point clouds (one per snapshot) coupled via Schr\"odinger bridges which evolve with noisy gradient descent. We study the mean-field limit of the dynamics and prove its global convergence to the desired estimator. Overall, this leads to an inference method with end-to-end theoretical guarantees that solves an interpretable model for trajectory inference. We also present how to adapt the method to deal with mass variations, a useful extension when dealing with single cell RNA-sequencing data where cells can branch and die.
To theoretically understand the behavior of trained deep neural networks, it is necessary to study the dynamics induced by gradient methods from a random initialization. However, the nonlinear and compositional … To theoretically understand the behavior of trained deep neural networks, it is necessary to study the dynamics induced by gradient methods from a random initialization. However, the nonlinear and compositional structure of these models make these dynamics difficult to analyze. To overcome these challenges, large-width asymptotics have recently emerged as a fruitful viewpoint and led to practical insights on real-world deep networks. For two-layer neural networks, it has been understood via these asymptotics that the nature of the trained model radically changes depending on the scale of the initial random weights, ranging from a kernel regime (for large initial variance) to a feature learning regime (for small initial variance). For deeper networks more regimes are possible, and in this paper we study in detail a specific choice of ''small'' initialization corresponding to "mean-field" limits of neural networks, which we call integrable parameterizations (IPs). First, we show that under standard i.i.d. zero-mean initialization, integrable parameterizations of neural networks with more than four layers start at a stationary point in the infinite-width limit and no learning occurs. We then propose various methods to avoid this trivial behavior and analyze in detail the resulting dynamics. In particular, one of these methods consists in using large initial learning rates, and we show that it is equivalent to a modification of the recently proposed maximal update parameterization $\mu$P. We confirm our results with numerical experiments on image classification tasks, which additionally show a strong difference in behavior between various choices of activation functions that is not yet captured by theory.
In supervised learning, the regularization path is sometimes used as a convenient theoretical proxy for the optimization path of gradient descent initialized from zero. In this paper, we study a … In supervised learning, the regularization path is sometimes used as a convenient theoretical proxy for the optimization path of gradient descent initialized from zero. In this paper, we study a modification of the regularization path for infinite-width 2-layer ReLU neural networks with nonzero initial distribution of the weights at different scales. By exploiting a link with unbalanced optimal-transport theory, we show that, despite the non-convexity of the 2-layer network training, this problem admits an infinite-dimensional convex counterpart. We formulate the corresponding functional-optimization problem and investigate its main properties. In particular, we show that, as the scale of the initialization ranges between $0$ and $+\infty$, the associated path interpolates continuously between the so-called kernel and rich regimes. Numerical experiments confirm that, in our setting, the scaling path and the final states of the optimization path behave similarly, even beyond these extreme points.
We study the computation of doubly regularized Wasserstein barycenters, a recently introduced family of entropic barycenters governed by inner and outer regularization strengths. Previous research has demonstrated that various regularization … We study the computation of doubly regularized Wasserstein barycenters, a recently introduced family of entropic barycenters governed by inner and outer regularization strengths. Previous research has demonstrated that various regularization parameter choices unify several notions of entropy-penalized barycenters while also revealing new ones, including a special case of debiased barycenters. In this paper, we propose and analyze an algorithm for computing doubly regularized Wasserstein barycenters. Our procedure builds on damped Sinkhorn iterations followed by exact maximization/minimization steps and guarantees convergence for any choice of regularization parameters. An inexact variant of our algorithm, implementable using approximate Monte Carlo sampling, offers the first non-asymptotic convergence guarantees for approximating Wasserstein barycenters between discrete point clouds in the free-support/grid-free setting.
Deep learning succeeds by doing hierarchical feature learning, yet tuning Hyper-Parameters (HP) such as initialization scales, learning rates etc., only give indirect control over this behavior. In this paper, we … Deep learning succeeds by doing hierarchical feature learning, yet tuning Hyper-Parameters (HP) such as initialization scales, learning rates etc., only give indirect control over this behavior. In this paper, we propose the alignment between the feature updates and the backward pass as a key notion to predict, measure and control feature learning. On the one hand, we show that when alignment holds, the magnitude of feature updates after one SGD step is related to the magnitude of the forward and backward passes by a simple and general formula. This leads to techniques to automatically adjust HPs (initialization scales and learning rates) at initialization and throughout training to attain a desired feature learning behavior. On the other hand, we show that, at random initialization, this alignment is determined by the spectrum of a certain kernel, and that well-conditioned layer-to-layer Jacobians (aka dynamical isometry) implies alignment. Finally, we investigate ReLU MLPs and ResNets in the large width-then-depth limit. Combining hints from random matrix theory and numerical experiments, we show that (i) in MLP with iid initializations, alignment degenerates with depth, making it impossible to start training, and that (ii) in ResNets, the branch scale $1/\sqrt{\text{depth}}$ is the only one maintaining non-trivial alignment at infinite depth.
The largest eigenvalue of the Hessian, or sharpness, of neural networks is a key quantity to understand their optimization dynamics. In this paper, we study the sharpness of deep linear … The largest eigenvalue of the Hessian, or sharpness, of neural networks is a key quantity to understand their optimization dynamics. In this paper, we study the sharpness of deep linear networks for overdetermined univariate regression. Minimizers can have arbitrarily large sharpness, but not an arbitrarily small one. Indeed, we show a lower bound on the sharpness of minimizers, which grows linearly with depth. We then study the properties of the minimizer found by gradient flow, which is the limit of gradient descent with vanishing learning rate. We show an implicit regularization towards flat minima: the sharpness of the minimizer is no more than a constant times the lower bound. The constant depends on the condition number of the data covariance matrix, but not on width or depth. This result is proven both for a small-scale initialization and a residual initialization. Results of independent interest are shown in both cases. For small-scale initialization, we show that the learned weight matrices are approximately rank-one and that their singular vectors align. For residual initialization, convergence of the gradient flow for a Gaussian initialization of the residual network is proven. Numerical experiments illustrate our results and connect them to gradient descent with non-vanishing learning rate.
Mean-field Langevin dynamics (MLFD) is a class of interacting particle methods that tackle convex optimization over probability measures on a manifold, which are scalable, versatile, and enjoy computational guarantees. However, … Mean-field Langevin dynamics (MLFD) is a class of interacting particle methods that tackle convex optimization over probability measures on a manifold, which are scalable, versatile, and enjoy computational guarantees. However, some important problems -- such as risk minimization for infinite width two-layer neural networks, or sparse deconvolution -- are originally defined over the set of signed, rather than probability, measures. In this paper, we investigate how to extend the MFLD framework to convex optimization problems over signed measures. Among two known reductions from signed to probability measures -- the lifting and the bilevel approaches -- we show that the bilevel reduction leads to stronger guarantees and faster rates (at the price of a higher per-iteration complexity). In particular, we investigate the convergence rate of MFLD applied to the bilevel reduction in the low-noise regime and obtain two results. First, this dynamics is amenable to an annealing schedule, adapted from Suzuki et al. (2023), that results in improved convergence rates to a fixed multiplicative accuracy. Second, we investigate the problem of learning a single neuron with the bilevel approach and obtain local exponential convergence rates that depend polynomially on the dimension and noise level (to compare with the exponential dependence that would result from prior analyses).
We study the convergence rate of Sinkhorn's algorithm for solving entropy-regularized optimal transport problems when at least one of the probability measures, $\mu$, admits a density over $\mathbb{R}^d$. For a … We study the convergence rate of Sinkhorn's algorithm for solving entropy-regularized optimal transport problems when at least one of the probability measures, $\mu$, admits a density over $\mathbb{R}^d$. For a semi-concave cost function bounded by $c_{\infty}$ and a regularization parameter $\lambda > 0$, we obtain exponential convergence guarantees on the dual sub-optimality gap with contraction rate polynomial in $\lambda/c_{\infty}$. This represents an exponential improvement over the known contraction rate $1 - \Theta(\exp(-c_{\infty}/\lambda))$ achievable via Hilbert's projective metric. Specifically, we prove a contraction rate value of $1-\Theta(\lambda^2/c_\infty^2)$ when $\mu$ has a bounded log-density. In some cases, such as when $\mu$ is log-concave and the cost function is $c(x,y)=-\langle x, y \rangle$, this rate improves to $1-\Theta(\lambda/c_\infty)$. The latter rate matches the one that we derive for the transport between isotropic Gaussian measures, indicating tightness in the dependency in $\lambda/c_\infty$. Our results are fully non-asymptotic and explicit in all the parameters of the problem.
Sinkhorn's algorithm is a method of choice to solve large-scale optimal transport (OT) problems. In this context, it involves an inverse temperature parameter $\beta$ that determines the speed-accuracy trade-off. To … Sinkhorn's algorithm is a method of choice to solve large-scale optimal transport (OT) problems. In this context, it involves an inverse temperature parameter $\beta$ that determines the speed-accuracy trade-off. To improve this trade-off, practitioners often use a variant of this algorithm, Annealed Sinkhorn, that uses an nondecreasing sequence $(\beta_t)_{t\in \mathbb{N}}$ where $t$ is the iteration count. However, besides for the schedule $\beta_t=\Theta(\log t)$ which is impractically slow, it is not known whether this variant is guaranteed to actually solve OT. Our first contribution answers this question: we show that a concave annealing schedule asymptotically solves OT if and only if $\beta_t\to+\infty$ and $\beta_t-\beta_{t-1}\to 0$. The proof is based on an equivalence with Online Mirror Descent and further suggests that the iterates of Annealed Sinkhorn follow the solutions of a sequence of relaxed, entropic OT problems, the regularization path. An analysis of this path reveals that, in addition to the well-known "entropic" error in $\Theta(\beta^{-1}_t)$, the annealing procedure induces a "relaxation" error in $\Theta(\beta_{t}-\beta_{t-1})$. The best error trade-off is achieved with the schedule $\beta_t = \Theta(\sqrt{t})$ which, albeit slow, is a universal limitation of this method. Going beyond this limitation, we propose a simple modification of Annealed Sinkhorn that reduces the relaxation error, and therefore enables faster annealing schedules. In toy experiments, we observe the effectiveness of our Debiased Annealed Sinkhorn's algorithm: a single run of this algorithm spans the whole speed-accuracy Pareto front of the standard Sinkhorn's algorithm.
We consider the problem of computing mixed Nash equilibria of two-player zero-sum games with continuous sets of pure strategies and with first-order access to the payoff function. This problem arises … We consider the problem of computing mixed Nash equilibria of two-player zero-sum games with continuous sets of pure strategies and with first-order access to the payoff function. This problem arises for example in game-theory-inspired machine learning applications, such as distributionally-robust learning. In those applications, the strategy sets are high-dimensional and thus methods based on discretisation cannot tractably return high-accuracy solutions. In this paper, we introduce and analyze a particle-based method that enjoys guaranteed local convergence for this problem. This method consists in parametrizing the mixed strategies as atomic measures and applying proximal point updates to both the atoms' weights and positions. It can be interpreted as an implicit time discretization of the "interacting" Wasserstein–Fisher–Rao gradient flow.
We consider the problem of computing mixed Nash equilibria of two-player zero-sum games with continuous sets of pure strategies and with first-order access to the payoff function. This problem arises … We consider the problem of computing mixed Nash equilibria of two-player zero-sum games with continuous sets of pure strategies and with first-order access to the payoff function. This problem arises for example in game-theory-inspired machine learning applications, such as distributionally-robust learning. In those applications, the strategy sets are high-dimensional and thus methods based on discretisation cannot tractably return high-accuracy solutions. In this paper, we introduce and analyze a particle-based method that enjoys guaranteed local convergence for this problem. This method consists in parametrizing the mixed strategies as atomic measures and applying proximal point updates to both the atoms' weights and positions. It can be interpreted as an implicit time discretization of the "interacting" Wasserstein–Fisher–Rao gradient flow.
Sinkhorn's algorithm is a method of choice to solve large-scale optimal transport (OT) problems. In this context, it involves an inverse temperature parameter $\beta$ that determines the speed-accuracy trade-off. To … Sinkhorn's algorithm is a method of choice to solve large-scale optimal transport (OT) problems. In this context, it involves an inverse temperature parameter $\beta$ that determines the speed-accuracy trade-off. To improve this trade-off, practitioners often use a variant of this algorithm, Annealed Sinkhorn, that uses an nondecreasing sequence $(\beta_t)_{t\in \mathbb{N}}$ where $t$ is the iteration count. However, besides for the schedule $\beta_t=\Theta(\log t)$ which is impractically slow, it is not known whether this variant is guaranteed to actually solve OT. Our first contribution answers this question: we show that a concave annealing schedule asymptotically solves OT if and only if $\beta_t\to+\infty$ and $\beta_t-\beta_{t-1}\to 0$. The proof is based on an equivalence with Online Mirror Descent and further suggests that the iterates of Annealed Sinkhorn follow the solutions of a sequence of relaxed, entropic OT problems, the regularization path. An analysis of this path reveals that, in addition to the well-known "entropic" error in $\Theta(\beta^{-1}_t)$, the annealing procedure induces a "relaxation" error in $\Theta(\beta_{t}-\beta_{t-1})$. The best error trade-off is achieved with the schedule $\beta_t = \Theta(\sqrt{t})$ which, albeit slow, is a universal limitation of this method. Going beyond this limitation, we propose a simple modification of Annealed Sinkhorn that reduces the relaxation error, and therefore enables faster annealing schedules. In toy experiments, we observe the effectiveness of our Debiased Annealed Sinkhorn's algorithm: a single run of this algorithm spans the whole speed-accuracy Pareto front of the standard Sinkhorn's algorithm.
We study the convergence rate of Sinkhorn's algorithm for solving entropy-regularized optimal transport problems when at least one of the probability measures, $\mu$, admits a density over $\mathbb{R}^d$. For a … We study the convergence rate of Sinkhorn's algorithm for solving entropy-regularized optimal transport problems when at least one of the probability measures, $\mu$, admits a density over $\mathbb{R}^d$. For a semi-concave cost function bounded by $c_{\infty}$ and a regularization parameter $\lambda > 0$, we obtain exponential convergence guarantees on the dual sub-optimality gap with contraction rate polynomial in $\lambda/c_{\infty}$. This represents an exponential improvement over the known contraction rate $1 - \Theta(\exp(-c_{\infty}/\lambda))$ achievable via Hilbert's projective metric. Specifically, we prove a contraction rate value of $1-\Theta(\lambda^2/c_\infty^2)$ when $\mu$ has a bounded log-density. In some cases, such as when $\mu$ is log-concave and the cost function is $c(x,y)=-\langle x, y \rangle$, this rate improves to $1-\Theta(\lambda/c_\infty)$. The latter rate matches the one that we derive for the transport between isotropic Gaussian measures, indicating tightness in the dependency in $\lambda/c_\infty$. Our results are fully non-asymptotic and explicit in all the parameters of the problem.
Mean-field Langevin dynamics (MLFD) is a class of interacting particle methods that tackle convex optimization over probability measures on a manifold, which are scalable, versatile, and enjoy computational guarantees. However, … Mean-field Langevin dynamics (MLFD) is a class of interacting particle methods that tackle convex optimization over probability measures on a manifold, which are scalable, versatile, and enjoy computational guarantees. However, some important problems -- such as risk minimization for infinite width two-layer neural networks, or sparse deconvolution -- are originally defined over the set of signed, rather than probability, measures. In this paper, we investigate how to extend the MFLD framework to convex optimization problems over signed measures. Among two known reductions from signed to probability measures -- the lifting and the bilevel approaches -- we show that the bilevel reduction leads to stronger guarantees and faster rates (at the price of a higher per-iteration complexity). In particular, we investigate the convergence rate of MFLD applied to the bilevel reduction in the low-noise regime and obtain two results. First, this dynamics is amenable to an annealing schedule, adapted from Suzuki et al. (2023), that results in improved convergence rates to a fixed multiplicative accuracy. Second, we investigate the problem of learning a single neuron with the bilevel approach and obtain local exponential convergence rates that depend polynomially on the dimension and noise level (to compare with the exponential dependence that would result from prior analyses).
The largest eigenvalue of the Hessian, or sharpness, of neural networks is a key quantity to understand their optimization dynamics. In this paper, we study the sharpness of deep linear … The largest eigenvalue of the Hessian, or sharpness, of neural networks is a key quantity to understand their optimization dynamics. In this paper, we study the sharpness of deep linear networks for overdetermined univariate regression. Minimizers can have arbitrarily large sharpness, but not an arbitrarily small one. Indeed, we show a lower bound on the sharpness of minimizers, which grows linearly with depth. We then study the properties of the minimizer found by gradient flow, which is the limit of gradient descent with vanishing learning rate. We show an implicit regularization towards flat minima: the sharpness of the minimizer is no more than a constant times the lower bound. The constant depends on the condition number of the data covariance matrix, but not on width or depth. This result is proven both for a small-scale initialization and a residual initialization. Results of independent interest are shown in both cases. For small-scale initialization, we show that the learned weight matrices are approximately rank-one and that their singular vectors align. For residual initialization, convergence of the gradient flow for a Gaussian initialization of the residual network is proven. Numerical experiments illustrate our results and connect them to gradient descent with non-vanishing learning rate.
Abstract This paper studies the infinite‐width limit of deep linear neural networks (NNs) initialized with random parameters. We obtain that, when the number of parameters diverges, the training dynamics converge … Abstract This paper studies the infinite‐width limit of deep linear neural networks (NNs) initialized with random parameters. We obtain that, when the number of parameters diverges, the training dynamics converge (in a precise sense) to the dynamics obtained from a gradient descent on an infinitely wide deterministic linear NN. Moreover, even if the weights remain random, we get their precise law along the training dynamics, and prove a quantitative convergence result of the linear predictor in terms of the number of parameters. We finally study the continuous‐time limit obtained for infinitely wide linear NNs and show that the linear predictors of the NN converge at an exponential rate to the minimal ‐norm minimizer of the risk.
The function that maps a family of probability measures to the solution of the dual entropic optimal transport problem is known as the Schr¨odinger map. We prove that when the … The function that maps a family of probability measures to the solution of the dual entropic optimal transport problem is known as the Schr¨odinger map. We prove that when the cost function is C k +1 with k ∈ ℕ * then this map is Lipschitz continuous from the L2-Wasserstein space to the space of C k functions. Our result holds on compact domains and covers the multi-marginal case. We also include regularity results under negative Sobolev metrics weaker than Wasserstein under stronger smoothness assumptions on the cost. As applications, we prove displacement smoothness of the entropic optimal transport cost and the well-posedness of certain Wasserstein gradient flows involving this functional, including the Sinkhorn divergence and a multi-species system.
Many supervised machine learning methods are naturally cast as optimization problems. For prediction models which are linear in their parameters, this often leads to convex problems for which many mathematical … Many supervised machine learning methods are naturally cast as optimization problems. For prediction models which are linear in their parameters, this often leads to convex problems for which many mathematical guarantees exist. Models which are nonlinear in their parameters such as neural networks lead to nonconvex optimization problems for which guarantees are harder to obtain. In this paper, we consider two-layer neural networks with homogeneous activation functions where the number of hidden neurons tends to infinity, and show how qualitative convergence guarantees may be derived.
We study the convergence rate of Bregman gradient methods for convex optimization in the space of measures on a d-dimensional manifold. Under basic regularity assumptions, we show that the suboptimality … We study the convergence rate of Bregman gradient methods for convex optimization in the space of measures on a d-dimensional manifold. Under basic regularity assumptions, we show that the suboptimality gap at iteration k is in O(log(k)k -1 ) for multiplicative updates, while it is in O(k -q/(d+q) ) for additive updates for some q∈{1,2,4} determined by the structure of the objective function. Our flexible proof strategy, based on approximation arguments, allows us to painlessly cover all Bregman Proximal Gradient Methods (PGM) and their acceleration (APGM) under various geometries such as the hyperbolic entropy and L p divergences. We also prove the tightness of our analysis with matching lower bounds and confirm the theoretical results with numerical experiments on low dimensional problems. Note that all these optimization methods must additionally pay the computational cost of discretization, which can be exponential in d.
<abstract><p>We consider the idealized setting of gradient flow on the population risk for infinitely wide two-layer ReLU neural networks (without bias), and study the effect of symmetries on the learned … <abstract><p>We consider the idealized setting of gradient flow on the population risk for infinitely wide two-layer ReLU neural networks (without bias), and study the effect of symmetries on the learned parameters and predictors. We first describe a general class of symmetries which, when satisfied by the target function $ f^* $ and the input distribution, are preserved by the dynamics. We then study more specific cases. When $ f^* $ is odd, we show that the dynamics of the predictor reduces to that of a (non-linearly parameterized) linear predictor, and its exponential convergence can be guaranteed. When $ f^* $ has a low-dimensional structure, we prove that the gradient flow PDE reduces to a lower-dimensional PDE. Furthermore, we present informal and numerical arguments that suggest that the input neurons align with the lower-dimensional structure of the problem.</p></abstract>
We study a general formulation of regularized Wasserstein barycenters that enjoys favorable regularity, approximation, stability and (grid-free) optimization properties. This barycenter is defined as the unique probability measure that minimizes … We study a general formulation of regularized Wasserstein barycenters that enjoys favorable regularity, approximation, stability and (grid-free) optimization properties. This barycenter is defined as the unique probability measure that minimizes the sum of entropic optimal transport (EOT) costs with respect to a family of given probability measures, plus an entropy term. We denote it $(λ,τ)$-barycenter, where $λ$ is the inner regularization strength and $τ$ the outer one. This formulation recovers several previously proposed EOT barycenters for various choices of $λ,τ\geq 0$ and generalizes them. First, in spite of -- and in fact owing to -- being \emph{doubly} regularized, we show that our formulation is debiased for $τ=λ/2$: the suboptimality in the (unregularized) Wasserstein barycenter objective is, for smooth densities, of the order of the strength $λ^2$ of entropic regularization, instead of $\max\{λ,τ\}$ in general. We discuss this phenomenon for isotropic Gaussians where all $(λ,τ)$-barycenters have closed form. Second, we show that for $λ,τ>0$, this barycenter has a smooth density and is strongly stable under perturbation of the marginals. In particular, it can be estimated efficiently: given $n$ samples from each of the probability measures, it converges in relative entropy to the population barycenter at a rate $n^{-1/2}$. And finally, this formulation lends itself naturally to a grid-free optimization algorithm: we propose a simple \emph{noisy particle gradient descent} which, in the mean-field limit, converges globally at an exponential rate to the barycenter.
In supervised learning, the regularization path is sometimes used as a convenient theoretical proxy for the optimization path of gradient descent initialized from zero. In this paper, we study a … In supervised learning, the regularization path is sometimes used as a convenient theoretical proxy for the optimization path of gradient descent initialized from zero. In this paper, we study a modification of the regularization path for infinite-width 2-layer ReLU neural networks with nonzero initial distribution of the weights at different scales. By exploiting a link with unbalanced optimal-transport theory, we show that, despite the non-convexity of the 2-layer network training, this problem admits an infinite-dimensional convex counterpart. We formulate the corresponding functional-optimization problem and investigate its main properties. In particular, we show that, as the scale of the initialization ranges between $0$ and $+\infty$, the associated path interpolates continuously between the so-called kernel and rich regimes. Numerical experiments confirm that, in our setting, the scaling path and the final states of the optimization path behave similarly, even beyond these extreme points.
We study the convergence to local Nash equilibria of gradient methods for two-player zero-sum differentiable games. It is well-known that such dynamics converge locally when $S \succ 0$ and may … We study the convergence to local Nash equilibria of gradient methods for two-player zero-sum differentiable games. It is well-known that such dynamics converge locally when $S \succ 0$ and may diverge when $S=0$, where $S\succeq 0$ is the symmetric part of the Jacobian at equilibrium that accounts for the "potential" component of the game. We show that these dynamics also converge as soon as $S$ is nonzero (partial curvature) and the eigenvectors of the antisymmetric part $A$ are in general position with respect to the kernel of $S$. We then study the convergence rates when $S \ll A$ and prove that they typically depend on the average of the eigenvalues of $S$, instead of the minimum as an analogy with minimization problems would suggest. To illustrate our results, we consider the problem of computing mixed Nash equilibria of continuous games. We show that, thanks to partial curvature, conic particle methods -- which optimize over both weights and supports of the mixed strategies -- generically converge faster than fixed-support methods. For min-max games, it is thus beneficial to add degrees of freedom "with curvature": this can be interpreted as yet another benefit of over-parameterization.
We study the computation of doubly regularized Wasserstein barycenters, a recently introduced family of entropic barycenters governed by inner and outer regularization strengths. Previous research has demonstrated that various regularization … We study the computation of doubly regularized Wasserstein barycenters, a recently introduced family of entropic barycenters governed by inner and outer regularization strengths. Previous research has demonstrated that various regularization parameter choices unify several notions of entropy-penalized barycenters while also revealing new ones, including a special case of debiased barycenters. In this paper, we propose and analyze an algorithm for computing doubly regularized Wasserstein barycenters. Our procedure builds on damped Sinkhorn iterations followed by exact maximization/minimization steps and guarantees convergence for any choice of regularization parameters. An inexact variant of our algorithm, implementable using approximate Monte Carlo sampling, offers the first non-asymptotic convergence guarantees for approximating Wasserstein barycenters between discrete point clouds in the free-support/grid-free setting.
Deep learning succeeds by doing hierarchical feature learning, yet tuning Hyper-Parameters (HP) such as initialization scales, learning rates etc., only give indirect control over this behavior. In this paper, we … Deep learning succeeds by doing hierarchical feature learning, yet tuning Hyper-Parameters (HP) such as initialization scales, learning rates etc., only give indirect control over this behavior. In this paper, we propose the alignment between the feature updates and the backward pass as a key notion to predict, measure and control feature learning. On the one hand, we show that when alignment holds, the magnitude of feature updates after one SGD step is related to the magnitude of the forward and backward passes by a simple and general formula. This leads to techniques to automatically adjust HPs (initialization scales and learning rates) at initialization and throughout training to attain a desired feature learning behavior. On the other hand, we show that, at random initialization, this alignment is determined by the spectrum of a certain kernel, and that well-conditioned layer-to-layer Jacobians (aka dynamical isometry) implies alignment. Finally, we investigate ReLU MLPs and ResNets in the large width-then-depth limit. Combining hints from random matrix theory and numerical experiments, we show that (i) in MLP with iid initializations, alignment degenerates with depth, making it impossible to start training, and that (ii) in ResNets, the branch scale $1/\sqrt{\text{depth}}$ is the only one maintaining non-trivial alignment at infinite depth.
Trajectory inference aims at recovering the dynamics of a population from snapshots of its temporal marginals. To solve this task, a min-entropy estimator relative to the Wiener measure in path … Trajectory inference aims at recovering the dynamics of a population from snapshots of its temporal marginals. To solve this task, a min-entropy estimator relative to the Wiener measure in path space was introduced by Lavenant et al. arXiv:2102.09204, and shown to consistently recover the dynamics of a large class of drift-diffusion processes from the solution of an infinite dimensional convex optimization problem. In this paper, we introduce a grid-free algorithm to compute this estimator. Our method consists in a family of point clouds (one per snapshot) coupled via Schr\"odinger bridges which evolve with noisy gradient descent. We study the mean-field limit of the dynamics and prove its global convergence to the desired estimator. Overall, this leads to an inference method with end-to-end theoretical guarantees that solves an interpretable model for trajectory inference. We also present how to adapt the method to deal with mass variations, a useful extension when dealing with single cell RNA-sequencing data where cells can branch and die.
Noisy particle gradient descent (NPGD) is an algorithm to minimize convex functions over the space of measures that include an entropy term. In the many-particle limit, this algorithm is described … Noisy particle gradient descent (NPGD) is an algorithm to minimize convex functions over the space of measures that include an entropy term. In the many-particle limit, this algorithm is described by a Mean-Field Langevin dynamics - a generalization of the Langevin dynamics with a non-linear drift - which is our main object of study. Previous work have shown its convergence to the unique minimizer via non-quantitative arguments. We prove that this dynamics converges at an exponential rate, under the assumption that a certain family of Log-Sobolev inequalities holds. This assumption holds for instance for the minimization of the risk of certain two-layer neural networks, where NPGD is equivalent to standard noisy gradient descent. We also study the annealed dynamics, and show that for a noise decaying at a logarithmic rate, the dynamics converges in value to the global minimizer of the unregularized objective function.
The function that maps a family of probability measures to the solution of the dual entropic optimal transport problem is known as the Schr\"odinger map. We prove that when the … The function that maps a family of probability measures to the solution of the dual entropic optimal transport problem is known as the Schr\"odinger map. We prove that when the cost function is $\mathcal{C}^{k+1}$ with $k\in \mathbb{N}^*$ then this map is Lipschitz continuous from the $L^2$-Wasserstein space to the space of $\mathcal{C}^k$ functions. Our result holds on compact domains and covers the multi-marginal case. We also include regularity results under negative Sobolev metrics weaker than Wasserstein under stronger smoothness assumptions on the cost. As applications, we prove displacement smoothness of the entropic optimal transport cost and the well-posedness of certain Wasserstein gradient flows involving this functional, including the Sinkhorn divergence and a multi-species system.
We consider the problem of computing mixed Nash equilibria of two-player zero-sum games with continuous sets of pure strategies and with first-order access to the payoff function. This problem arises … We consider the problem of computing mixed Nash equilibria of two-player zero-sum games with continuous sets of pure strategies and with first-order access to the payoff function. This problem arises for example in game-theory-inspired machine learning applications, such as distributionally-robust learning. In those applications, the strategy sets are high-dimensional and thus methods based on discretisation cannot tractably return high-accuracy solutions. In this paper, we introduce and analyze a particle-based method that enjoys guaranteed local convergence for this problem. This method consists in parametrizing the mixed strategies as atomic measures and applying proximal point updates to both the atoms' weights and positions. It can be interpreted as a time-implicit discretization of the "interacting" Wasserstein-Fisher-Rao gradient flow. We prove that, under non-degeneracy assumptions, this method converges at an exponential rate to the exact mixed Nash equilibrium from any initialization satisfying a natural notion of closeness to optimality. We illustrate our results with numerical experiments and discuss applications to max-margin and distributionally-robust classification using two-layer neural networks, where our method has a natural interpretation as a simultaneous training of the network's weights and of the adversarial distribution.
This paper studies the infinite-width limit of deep linear neural networks initialized with random parameters. We obtain that, when the number of neurons diverges, the training dynamics converge (in a … This paper studies the infinite-width limit of deep linear neural networks initialized with random parameters. We obtain that, when the number of neurons diverges, the training dynamics converge (in a precise sense) to the dynamics obtained from a gradient descent on an infinitely wide deterministic linear neural network. Moreover, even if the weights remain random, we get their precise law along the training dynamics, and prove a quantitative convergence result of the linear predictor in terms of the number of neurons. We finally study the continuous-time limit obtained for infinitely wide linear neural networks and show that the linear predictors of the neural network converge at an exponential rate to the minimal $\ell_2$-norm minimizer of the risk.
To theoretically understand the behavior of trained deep neural networks, it is necessary to study the dynamics induced by gradient methods from a random initialization. However, the nonlinear and compositional … To theoretically understand the behavior of trained deep neural networks, it is necessary to study the dynamics induced by gradient methods from a random initialization. However, the nonlinear and compositional structure of these models make these dynamics difficult to analyze. To overcome these challenges, large-width asymptotics have recently emerged as a fruitful viewpoint and led to practical insights on real-world deep networks. For two-layer neural networks, it has been understood via these asymptotics that the nature of the trained model radically changes depending on the scale of the initial random weights, ranging from a kernel regime (for large initial variance) to a feature learning regime (for initial variance). For deeper networks more regimes are possible, and in this paper we study in detail a specific choice of small initialization corresponding to mean-field limits of neural networks, which we call integrable parameterizations (IPs). First, we show that under standard i.i.d. zero-mean initialization, integrable parameterizations of neural networks with more than four layers start at a stationary point in the infinite-width limit and no learning occurs. We then propose various methods to avoid this trivial behavior and analyze in detail the resulting dynamics. In particular, one of these methods consists in using large initial learning rates, and we show that it is equivalent to a modification of the recently proposed maximal update parameterization µP. We confirm our results with numerical experiments on image classification tasks, which additionally show a strong difference in behavior between various choices of activation functions that is not yet captured by theory.
This article describes a set of methods for quickly computing the solution to the regularized optimal transport problem. It generalizes and improves upon the widely used iterative Bregman projections algorithm … This article describes a set of methods for quickly computing the solution to the regularized optimal transport problem. It generalizes and improves upon the widely used iterative Bregman projections algorithm (or Sinkhorn–Knopp algorithm). We first proposed to rely on regularized nonlinear acceleration schemes. In practice, such approaches lead to fast algorithms, but their global convergence is not ensured. Hence, we next proposed a new algorithm with convergence guarantees. The idea is to overrelax the Bregman projection operators, allowing for faster convergence. We proposed a simple method for establishing global convergence by ensuring the decrease of a Lyapunov function at each step. An adaptive choice of the overrelaxation parameter based on the Lyapunov function was constructed. We also suggested a heuristic to choose a suitable asymptotic overrelaxation parameter, based on a local convergence analysis. Our numerical experiments showed a gain in convergence speed by an order of magnitude in certain regimes.
We study the convergence rate of Bregman gradient methods for convex optimization in the space of measures on a $d$-dimensional manifold. Under basic regularity assumptions, we show that the suboptimality … We study the convergence rate of Bregman gradient methods for convex optimization in the space of measures on a $d$-dimensional manifold. Under basic regularity assumptions, we show that the suboptimality gap at iteration $k$ is in $O(log(k)k^{--1})$ for multiplicative updates, while it is in $O(k^{--q/(d+q)})$ for additive updates for some $q \in {1, 2, 4}$ determined by the structure of the objective function. Our flexible proof strategy, based on approximation arguments, allows us to painlessly cover all Bregman Proximal Gradient Methods (PGM) and their acceleration (APGM) under various geometries such as the hyperbolic entropy and $L^p$ divergences. We also prove the tightness of our analysis with matching lower bounds and confirm the theoretical results with numerical experiments on low dimensional problems. Note that all these optimization methods must additionally pay the computational cost of discretization, which can be exponential in $d$.
To theoretically understand the behavior of trained deep neural networks, it is necessary to study the dynamics induced by gradient methods from a random initialization. However, the nonlinear and compositional … To theoretically understand the behavior of trained deep neural networks, it is necessary to study the dynamics induced by gradient methods from a random initialization. However, the nonlinear and compositional structure of these models make these dynamics difficult to analyze. To overcome these challenges, large-width asymptotics have recently emerged as a fruitful viewpoint and led to practical insights on real-world deep networks. For two-layer neural networks, it has been understood via these asymptotics that the nature of the trained model radically changes depending on the scale of the initial random weights, ranging from a kernel regime (for large initial variance) to a feature learning regime (for small initial variance). For deeper networks more regimes are possible, and in this paper we study in detail a specific choice of ''small'' initialization corresponding to "mean-field" limits of neural networks, which we call integrable parameterizations (IPs). First, we show that under standard i.i.d. zero-mean initialization, integrable parameterizations of neural networks with more than four layers start at a stationary point in the infinite-width limit and no learning occurs. We then propose various methods to avoid this trivial behavior and analyze in detail the resulting dynamics. In particular, one of these methods consists in using large initial learning rates, and we show that it is equivalent to a modification of the recently proposed maximal update parameterization $\mu$P. We confirm our results with numerical experiments on image classification tasks, which additionally show a strong difference in behavior between various choices of activation functions that is not yet captured by theory.
Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss with gradient-based methods are observed to perform well in many supervised classification tasks. Towards understanding this phenomenon, we analyze the … Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss with gradient-based methods are observed to perform well in many supervised classification tasks. Towards understanding this phenomenon, we analyze the training and generalization behavior of infinitely wide two-layer neural networks with homogeneous activations. We show that the limits of the gradient flow on exponentially tailed losses can be fully characterized as a max-margin classifier in a certain non-Hilbertian space of functions. In presence of hidden low-dimensional structures, the resulting margin is independent of the ambiant dimension, which leads to strong generalization bounds. In contrast, training only the output layer implicitly solves a kernel support vector machine, which a priori does not enjoy such an adaptivity. Our analysis of training is non-quantitative in terms of running time but we prove computational guarantees in simplified settings by showing equivalences with online mirror descent. Finally, numerical experiments suggest that our analysis describes well the practical behavior of two-layer neural networks with ReLU activation and confirm the statistical benefits of this implicit bias.
The idea of slicing divergences has been proven to be successful when comparing two probability measures in various machine learning applications including generative modeling, and consists in computing the expected … The idea of slicing divergences has been proven to be successful when comparing two probability measures in various machine learning applications including generative modeling, and consists in computing the expected value of a `base divergence' between one-dimensional random projections of the two measures. However, the topological, statistical, and computational consequences of this technique have not yet been well-established. In this paper, we aim at bridging this gap and derive various theoretical properties of sliced probability divergences. First, we show that slicing preserves the metric axioms and the weak continuity of the divergence, implying that the sliced divergence will share similar topological properties. We then precise the results in the case where the base divergence belongs to the class of integral probability metrics. On the other hand, we establish that, under mild conditions, the sample complexity of a sliced divergence does not depend on the problem dimension. We finally apply our general results to several base divergences, and illustrate our theory on both synthetic and real data experiments.
In this paper, we characterize a degenerate PDE as the gradient flow in the space of nonnegative measures endowed with an optimal transport-growth metric. The PDE of concern, of Hele-Shaw … In this paper, we characterize a degenerate PDE as the gradient flow in the space of nonnegative measures endowed with an optimal transport-growth metric. The PDE of concern, of Hele-Shaw type, was introduced by Perthame et. al . as a mechanical model for tumor growth and the metric was introduced recently in several articles as the analogue of the Wasserstein metric for nonnegative measures. We show existence of solutions using minimizing movements and show uniqueness of solutions on convex domains by proving the Evolutional Variational Inequality . Our analysis does not require any regularity assumption on the initial condition. We also derive a numerical scheme based on the discretization of the gradient flow and the idea of entropic regularization. We assess the convergence of the scheme on explicit solutions. In doing this analysis, we prove several new properties of the optimal transport-growth metric, which generally have a known counterpart for the Wasserstein metric.
The squared Wasserstein distance is a natural quantity to compare probability distributions in a non-parametric setting. This quantity is usually estimated with the plug-in estimator, defined via a discrete optimal … The squared Wasserstein distance is a natural quantity to compare probability distributions in a non-parametric setting. This quantity is usually estimated with the plug-in estimator, defined via a discrete optimal transport problem which can be solved to $ε$-accuracy by adding an entropic regularization of order $ε$ and using for instance Sinkhorn's algorithm. In this work, we propose instead to estimate it with the Sinkhorn divergence, which is also built on entropic regularization but includes debiasing terms. We show that, for smooth densities, this estimator has a comparable sample complexity but allows higher regularization levels, of order $ε^{1/2}$, which leads to improved computational complexity bounds and a strong speedup in practice. Our theoretical analysis covers the case of both randomly sampled densities and deterministic discretizations on uniform grids. We also propose and analyze an estimator based on Richardson extrapolation of the Sinkhorn divergence which enjoys improved statistical and computational efficiency guarantees, under a condition on the regularity of the approximation error, which is in particular satisfied for Gaussian densities. We finally demonstrate the efficiency of the proposed estimators with numerical experiments.
Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss with gradient-based methods are observed to perform well in many supervised classification tasks. Towards understanding this phenomenon, we analyze the … Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss with gradient-based methods are observed to perform well in many supervised classification tasks. Towards understanding this phenomenon, we analyze the training and generalization behavior of infinitely wide two-layer neural networks with homogeneous activations. We show that the limits of the gradient flow on exponentially tailed losses can be fully characterized as a max-margin classifier in a certain non-Hilbertian space of functions. In presence of hidden low-dimensional structures, the resulting margin is independent of the ambiant dimension, which leads to strong generalization bounds. In contrast, training only the output layer implicitly solves a kernel support vector machine, which a priori does not enjoy such an adaptivity. Our analysis of training is non-quantitative in terms of running time but we prove computational guarantees in simplified settings by showing equivalences with online mirror descent. Finally, numerical experiments suggest that our analysis describes well the practical behavior of two-layer neural networks with ReLU activation and confirm the statistical benefits of this implicit bias.
With huge data acquisition progresses realized in the past decades and acquisition systems now able to produce high resolution grids and point clouds, the digitization of physical terrains becomes increasingly … With huge data acquisition progresses realized in the past decades and acquisition systems now able to produce high resolution grids and point clouds, the digitization of physical terrains becomes increasingly more precise. Such extreme quantities of generated and modeled data greatly impact computational performances on many levels of high-performance computing (HPC): storage media, memory requirements, transfer capability, and finally simulation interactivity, necessary to exploit this instance of big data. Efficient representations and storage are thus becoming "enabling technologies'' in HPC experimental and simulation science. We propose HexaShrink, an original decomposition scheme for structured hexahedral volume meshes. The latter are used for instance in biomedical engineering, materials science, or geosciences. HexaShrink provides a comprehensive framework allowing efficient mesh visualization and storage. Its exactly reversible multiresolution decomposition yields a hierarchy of meshes of increasing levels of details, in terms of either geometry, continuous or categorical properties of cells. Starting with an overview of volume meshes compression techniques, our contribution blends coherently different multiresolution wavelet schemes in different dimensions. It results in a global framework preserving discontinuities (faults) across scales, implemented as a fully reversible upscaling at different resolutions. Experimental results are provided on meshes of varying size and complexity. They emphasize the consistency of the proposed representation, in terms of visualization, attribute downsampling and distribution at different resolutions. Finally, HexaShrink yields gains in storage space when combined to lossless compression techniques.
Optimal transport (OT) and maximum mean discrepancies (MMD) are now routinely used in machine learning to compare probability measures. We focus in this paper on \emph{Sinkhorn divergences} (SDs), a regularized … Optimal transport (OT) and maximum mean discrepancies (MMD) are now routinely used in machine learning to compare probability measures. We focus in this paper on \emph{Sinkhorn divergences} (SDs), a regularized variant of OT distances which can interpolate, depending on the regularization strength $\varepsilon$, between OT ($\varepsilon=0$) and MMD ($\varepsilon=\infty$). Although the tradeoff induced by that regularization is now well understood computationally (OT, SDs and MMD require respectively $O(n^3\log n)$, $O(n^2)$ and $n^2$ operations given a sample size $n$), much less is known in terms of their \emph{sample complexity}, namely the gap between these quantities, when evaluated using finite samples \emph{vs.} their respective densities. Indeed, while the sample complexity of OT and MMD stand at two extremes, $1/n^{1/d}$ for OT in dimension $d$ and $1/\sqrt{n}$ for MMD, that for SDs has only been studied empirically. In this paper, we \emph{(i)} derive a bound on the approximation error made with SDs when approximating OT as a function of the regularizer $\varepsilon$, \emph{(ii)} prove that the optimizers of regularized OT are bounded in a Sobolev (RKHS) ball independent of the two measures and \emph{(iii)} provide the first sample complexity bound for SDs, obtained,by reformulating SDs as a maximization problem in a RKHS. We thus obtain a scaling in $1/\sqrt{n}$ (as in MMD), with a constant that depends however on $\varepsilon$, making the bridge between OT and MMD complete.
In a series of recent theoretical works, it has been shown that strongly over-parameterized neural networks trained with gradient-based methods could converge linearly to zero loss, with their parameters hardly … In a series of recent theoretical works, it has been shown that strongly over-parameterized neural networks trained with gradient-based methods could converge linearly to zero loss, with their parameters hardly varying. In this note, our goal is to exhibit the simple structure that is behind these results. In a simplified setting, we prove that lazy training essentially solves a kernel regression. We also show that this behavior is not so much due to over-parameterization than to a choice of scaling, often implicit, that allows to linearize the model around its initialization. These theoretical results complemented with simple numerical experiments make it seem unlikely that lazy training is behind the many successes of neural networks in high dimensional tasks.
Many tasks in machine learning and signal processing can be solved by minimizing a convex function of a measure. This includes sparse spikes deconvolution or training a neural network with … Many tasks in machine learning and signal processing can be solved by minimizing a convex function of a measure. This includes sparse spikes deconvolution or training a neural network with a single hidden layer. For these problems, we study a simple minimization method: the unknown measure is discretized into a mixture of particles and a continuous-time gradient descent is performed on their weights and positions. This is an idealization of the usual way to train neural networks with a large hidden layer. We show that, when initialized correctly and in the many-particle limit, this gradient flow, although non-convex, converges to global minimizers. The proof involves Wasserstein gradient flows, a by-product of optimal transport theory. Numerical experiments show that this asymptotic behavior is already at play for a reasonable number of particles, even in high dimension.
Optimal transport (OT) and maximum mean discrepancies (MMD) are now routinely used in machine learning to compare probability measures. We focus in this paper on \emph{Sinkhorn divergences} (SDs), a regularized … Optimal transport (OT) and maximum mean discrepancies (MMD) are now routinely used in machine learning to compare probability measures. We focus in this paper on \emph{Sinkhorn divergences} (SDs), a regularized variant of OT distances which can interpolate, depending on the regularization strength $\varepsilon$, between OT ($\varepsilon=0$) and MMD ($\varepsilon=\infty$). Although the tradeoff induced by that regularization is now well understood computationally (OT, SDs and MMD require respectively $O(n^3\log n)$, $O(n^2)$ and $n^2$ operations given a sample size $n$), much less is known in terms of their \emph{sample complexity}, namely the gap between these quantities, when evaluated using finite samples \emph{vs.} their respective densities. Indeed, while the sample complexity of OT and MMD stand at two extremes, $1/n^{1/d}$ for OT in dimension $d$ and $1/\sqrt{n}$ for MMD, that for SDs has only been studied empirically. In this paper, we \emph{(i)} derive a bound on the approximation error made with SDs when approximating OT as a function of the regularizer $\varepsilon$, \emph{(ii)} prove that the optimizers of regularized OT are bounded in a Sobolev (RKHS) ball independent of the two measures and \emph{(iii)} provide the first sample complexity bound for SDs, obtained,by reformulating SDs as a maximization problem in a RKHS. We thus obtain a scaling in $1/\sqrt{n}$ (as in MMD), with a constant that depends however on $\varepsilon$, making the bridge between OT and MMD complete.
In a series of recent theoretical works, it was shown that strongly over-parameterized neural networks trained with gradient-based methods could converge exponentially fast to zero training loss, with their parameters … In a series of recent theoretical works, it was shown that strongly over-parameterized neural networks trained with gradient-based methods could converge exponentially fast to zero training loss, with their parameters hardly varying. In this work, we show that this "lazy training" phenomenon is not specific to over-parameterized neural networks, and is due to a choice of scaling, often implicit, that makes the model behave as its linearization around the initialization, thus yielding a model equivalent to learning with positive-definite kernels. Through a theoretical analysis, we exhibit various situations where this phenomenon arises in non-convex optimization and we provide bounds on the distance between the lazy and linearized optimization paths. Our numerical experiments bring a critical note, as we observe that the performance of commonly used non-linear deep convolutional neural networks in computer vision degrades when trained in the lazy regime. This makes it unlikely that "lazy training" is behind the many successes of neural networks in difficult high dimensional tasks.
Many tasks in machine learning and signal processing can be solved by minimizing a convex function of a measure. This includes sparse spikes deconvolution or training a neural network with … Many tasks in machine learning and signal processing can be solved by minimizing a convex function of a measure. This includes sparse spikes deconvolution or training a neural network with a single hidden layer. For these problems, we study a simple minimization method: the unknown measure is discretized into a mixture of particles and a continuous-time gradient descent is performed on their weights and positions. This is an idealization of the usual way to train neural networks with a large hidden layer. We show that, when initialized correctly and in the many-particle limit, this gradient flow, although non-convex, converges to global minimizers. The proof involves Wasserstein gradient flows, a by-product of optimal transport theory. Numerical experiments show that this asymptotic behavior is already at play for a reasonable number of particles, even in high dimension.
This thesis generalizes optimal transport beyond the classical balanced setting of probability distributions. We define unbalanced optimal transport models between nonnegative measures, based either on the notion of interpolation or … This thesis generalizes optimal transport beyond the classical balanced setting of probability distributions. We define unbalanced optimal transport models between nonnegative measures, based either on the notion of interpolation or the notion of coupling of measures. We show relationships between these approaches. One of the outcomes of this framework is a generalization of the p-Wasserstein metrics. Secondly, we build numerical methods to solve interpolation and coupling-based models. We study, in particular, a new family of scaling algorithms that generalize Sinkhorn's algorithm. The third part deals with applications. It contains a theoretical and numerical study of a Hele-Shaw type gradient flow in the space of nonnegative measures. It also adresses the case of measures taking values in the cone of positive semi-definite matrices, for which we introduce a model that achieves a balance between geometrical accuracy and algorithmic efficiency.
This article introduces a new notion of optimal transport (OT) between tensor fields, which are measures whose values are positive semidefinite (PSD) matrices. This “quantum” formulation of optimal transport (Q-OT) … This article introduces a new notion of optimal transport (OT) between tensor fields, which are measures whose values are positive semidefinite (PSD) matrices. This “quantum” formulation of optimal transport (Q-OT) corresponds to a relaxed version of the classical Kantorovich transport problem, where the fidelity between the input PSD-valued measures is captured using the geometry of the Von-Neumann quantum entropy. We propose a quantum-entropic regularization of the resulting convex optimization problem, which can be solved efficiently using an iterative scaling algorithm. This method is a generalization of the celebrated Sinkhorn algorithm to the quantum setting of PSD matrices. We extend this formulation and the quantum Sinkhorn algorithm to compute barycentres within a collection of input tensor fields. We illustrate the usefulness of the proposed approach on applications to procedural noise generation, anisotropic meshing, diffusion tensor imaging and spectral texture synthesis.
This article introduces a new class of fast algorithms to approximate variational problems involving unbalanced optimal transport. While classical optimal transport considers only normalized probability distributions, it is important for … This article introduces a new class of fast algorithms to approximate variational problems involving unbalanced optimal transport. While classical optimal transport considers only normalized probability distributions, it is important for many applications to be able to compute some sort of relaxed transportation between arbitrary positive measures. A generic class of such "unbalanced" optimal transport problems has been recently proposed by several authors. In this paper, we show how to extend the now classical entropic regularization scheme to these unbalanced problems. This gives rise to fast, highly parallelizable algorithms that operate by performing only diagonal scaling (i.e., pointwise multiplications) of the transportation couplings. They are generalizations of the celebrated Sinkhorn algorithm. We show how these methods can be used to solve unbalanced transport, unbalanced gradient flows, and to compute unbalanced barycenters. We showcase applications to 2-D shape modification, color transfer, and growth models.
This article describes a set of methods for quickly computing the solution to the regularized optimal transport problem. It generalizes and improves upon the widely-used iterative Bregman projections algorithm (or … This article describes a set of methods for quickly computing the solution to the regularized optimal transport problem. It generalizes and improves upon the widely-used iterative Bregman projections algorithm (or Sinkhorn--Knopp algorithm). We first propose to rely on regularized nonlinear acceleration schemes. In practice, such approaches lead to fast algorithms, but their global convergence is not ensured. Hence, we next propose a new algorithm with convergence guarantees. The idea is to overrelax the Bregman projection operators, allowing for faster convergence. We propose a simple method for establishing global convergence by ensuring the decrease of a Lyapunov function at each step. An adaptive choice of overrelaxation parameter based on the Lyapunov function is constructed. We also suggest a heuristic to choose a suitable asymptotic overrelaxation parameter, based on a local convergence analysis. Our numerical experiments show a gain in convergence speed by an order of magnitude in certain regimes.
This article introduces a new class of fast algorithms to approximate variational problems involving unbalanced optimal transport. While classical optimal transport considers only normalized probability distributions, it is important for … This article introduces a new class of fast algorithms to approximate variational problems involving unbalanced optimal transport. While classical optimal transport considers only normalized probability distributions, it is important for many applications to be able to compute some sort of relaxed transportation between arbitrary positive measures. A generic class of such "unbalanced" optimal transport problems has been recently proposed by several authors. In this paper, we show how to extend the, now classical, entropic regularization scheme to these unbalanced problems. This gives rise to fast, highly parallelizable algorithms that operate by performing only diagonal scaling (i.e. pointwise multiplications) of the transportation couplings. They are generalizations of the celebrated Sinkhorn algorithm. We show how these methods can be used to solve unbalanced transport, unbalanced gradient flows, and to compute unbalanced barycenters. We showcase applications to 2-D shape modification, color transfer, and growth models.
This article introduces a new notion of optimal transport (OT) between tensor fields, which are measures whose values are positive semidefinite (PSD) matrices. This "quantum" formulation of OT (Q-OT) corresponds … This article introduces a new notion of optimal transport (OT) between tensor fields, which are measures whose values are positive semidefinite (PSD) matrices. This "quantum" formulation of OT (Q-OT) corresponds to a relaxed version of the classical Kantorovich transport problem, where the fidelity between the input PSD-valued measures is captured using the geometry of the Von-Neumann quantum entropy. We propose a quantum-entropic regularization of the resulting convex optimization problem, which can be solved efficiently using an iterative scaling algorithm. This method is a generalization of the celebrated Sinkhorn algorithm to the quantum setting of PSD matrices. We extend this formulation and the quantum Sinkhorn algorithm to compute barycenters within a collection of input tensor fields. We illustrate the usefulness of the proposed approach on applications to procedural noise generation, anisotropic meshing, diffusion tensor imaging and spectral texture synthesis.
This article presents a new class of distances between arbitrary nonnegative Radon measures inspired by optimal transport. These distances are defined by two equivalent alternative formulations: (i) a dynamic formulation … This article presents a new class of distances between arbitrary nonnegative Radon measures inspired by optimal transport. These distances are defined by two equivalent alternative formulations: (i) a dynamic formulation defining the distance as a geodesic distance over the space of measures (ii) a static formulation where the distance is the minimum of an optimization problem over pairs of couplings describing the transfer (transport, creation and destruction) of mass between two measures. Both formulations are convex optimization problems, and the ability to switch from one to the other depending on the targeted application is a crucial property of our models. Of particular interest is the Wasserstein-Fisher-Rao metric recently introduced independently by Chizat et al. and Kondratyev et al. Defined initially through a dynamic formulation, it belongs to this class of metrics and hence automatically benefits from a static Kantorovich formulation.
This paper defines a new transport metric over the space of non-negative measures. This metric interpolates between the quadratic Wasserstein and the Fisher-Rao metrics and generalizes optimal transport to measures … This paper defines a new transport metric over the space of non-negative measures. This metric interpolates between the quadratic Wasserstein and the Fisher-Rao metrics and generalizes optimal transport to measures with different masses. It is defined as a generalization of the dynamical formulation of optimal transport of Benamou and Brenier, by introducing a source term in the continuity equation. The influence of this source term is measured using the Fisher-Rao metric, and is averaged with the transportation term. This gives rise to a convex variational problem defining our metric. Our first contribution is a proof of the existence of geodesics (i.e. solutions to this variational problem). We then show that (generalized) optimal transport and Fisher-Rao metrics are obtained as limiting cases of our metric. Our last theoretical contribution is a proof that geodesics between mixtures of sufficiently close Diracs are made of translating mixtures of Diracs. Lastly, we propose a numerical scheme making use of first order proximal splitting methods and we show an application of this new distance to image interpolation.
This article presents a new class of distances between arbitrary nonnegative Radon measures inspired by optimal transport. These distances are defined by two equivalent alternative formulations: (i) a dynamic formulation … This article presents a new class of distances between arbitrary nonnegative Radon measures inspired by optimal transport. These distances are defined by two equivalent alternative formulations: (i) a dynamic formulation defining the distance as a geodesic distance over the space of measures (ii) a static "Kantorovich" formulation where the distance is the minimum of an optimization problem over pairs of couplings describing the transfer (transport, creation and destruction) of mass between two measures. Both formulations are convex optimization problems, and the ability to switch from one to the other depending on the targeted application is a crucial property of our models. Of particular interest is the Wasserstein-Fisher-Rao metric recently introduced independently by Chizat et al. and Kondratyev et al. Defined initially through a dynamic formulation, it belongs to this class of metrics and hence automatically benefits from a static Kantorovich formulation.