Author Description

Login to generate an author description

Ask a Question About This Mathematician

Concentration inequalities for functions of independent random variables is an area of probability theory that has witnessed a great revolution in the last few decades, and has applications in a … Concentration inequalities for functions of independent random variables is an area of probability theory that has witnessed a great revolution in the last few decades, and has applications in a wide variety of areas such as machine learning, statistics, discrete mathematics, and high-dimensional geometry. Roughly speaking, if a function of many independent random variables does not depend too much on any of the variables then it is concentrated in the sense that with high probability, it is close to its expected value. This book offers a host of inequalities to illustrate this rich theory in an accessible way by covering the key developments and applications in the field. The authors describe the interplay between the probabilistic structure (independence) and a variety of tools ranging from functional inequalities to transportation arguments to information theory. Applications to the study of empirical processes, random projections, random matrix theory, and threshold phenomena are also presented. A self-contained introduction to concentration inequalities, it includes a survey of concentration of sums of independent random variables, variance bounds, the entropy method, and the transportation method. Deep connections with isoperimetric problems are revealed whilst special attention is paid to applications to the supremum of empirical processes. Written by leading experts in the field and containing extensive exercise sections this book will be an invaluable resource for researchers and graduate students in mathematics, theoretical computer science, and engineering.
We consider the problem of estimating $\|s\|^2$ when $s$ belongs to some separable Hilbert space and one observes the Gaussian process $Y(t) = \langles, t\rangle + \sigmaL(t)$, for all $t … We consider the problem of estimating $\|s\|^2$ when $s$ belongs to some separable Hilbert space and one observes the Gaussian process $Y(t) = \langles, t\rangle + \sigmaL(t)$, for all $t \epsilon \mathbb{H}$,where $L$ is some Gaussian isonormal process. This framework allows us in particular to consider the classical “Gaussian sequence model” for which $\mathbb{H} = l_2(\mathbb{N}*)$ and $L(t) = \sum_{\lambda\geq1}t_{\lambda}\varepsilon_{\lambda}$, where $(\varepsilon_{\lambda})_{\lambda\geq1}$ is a sequence of i.i.d. standard normal variables. Our approach consists in considering some at most countable families of finite-dimensional linear subspaces of $\mathbb{H}$ (the models) and then using model selection via some conveniently penalized least squares criterion to build new estimators of $\|s\|^2$. We prove a general nonasymptotic risk bound which allows us to show that such penalized estimators are adaptive on a variety of collections of sets for the parameter $s$, depending on the family of models from which they are built.In particular, in the context of the Gaussian sequence model, a convenient choice of the family of models allows defining estimators which are adaptive over collections of hyperrectangles, ellipsoids, $l_p$-bodies or Besov bodies.We take special care to describe the conditions under which the penalized estimator is efficient when the level of noise $\sigma$ tends to zero. Our construction is an alternative to the one by Efroïmovich and Low for hyperrectangles and provides new results otherwise.
Let $\hat{F}_n$ denote the empirical distribution function for a sample of $n$ i.i.d. random variables with distribution function $F$. In 1956 Dvoretzky, Kiefer and Wolfowitz proved that $P\big(\sqrt{n} \sup_x(\hat{F}_n(x) - … Let $\hat{F}_n$ denote the empirical distribution function for a sample of $n$ i.i.d. random variables with distribution function $F$. In 1956 Dvoretzky, Kiefer and Wolfowitz proved that $P\big(\sqrt{n} \sup_x(\hat{F}_n(x) - F(x)) > \lambda\big) \leq C \exp(-2\lambda^2),$ where $C$ is some unspecified constant. We show that $C$ can be taken as 1 (as conjectured by Birnbaum and McCarty in 1958), provided that $\exp(-2\lambda^2) \leq \frac{1}{2}$. In particular, the two-sided inequality $P\big(\sqrt{n} \sup_x|\hat{F}_n(x) - F(x)| > \lambda\big) \leq 2 \exp(-2\lambda^2)$ holds without any restriction on $\lambda$. In the one-sided as well as in the two-sided case, the constants cannot be further improved.
We propose some explicit values for the constants involved in the exponential concentration inequalities for empirical processes which are due to Talagrand. It has been shown by Ledoux that deviation … We propose some explicit values for the constants involved in the exponential concentration inequalities for empirical processes which are due to Talagrand. It has been shown by Ledoux that deviation inequalities for empirical processes could be obtained by iteration of logarithmic Sobolev type inequalities. Our approach follows closely that of Ledoux. The improvements that we get with respect to Ledoux's work are based on refinements of his entropy inequalities and computations.
We investigate a new methodology, worked out by Ledoux and Massart, to prove concentration-of-measure inequalities. The method is based on certain modified logarithmic Sobolev inequalities. We provide some very simple … We investigate a new methodology, worked out by Ledoux and Massart, to prove concentration-of-measure inequalities. The method is based on certain modified logarithmic Sobolev inequalities. We provide some very simple and general ready-to-use inequalities. One of these inequalities may be considered as an exponential version of the Efron--Stein inequality. The main purpose of this paper is to point out the simplicity and the generality of the approach. We show how the new method can recover many of Talagrand's revolutionary inequalities and provide new applications in a variety of problems including Rademacher averages, Rademacher chaos, the number of certain small subgraphs in a random graph, and the minimum of the empirical risk in some statistical estimation problems.
Nous presentons quelques applications d'inegalites de con- centration a la resolution de problemes de selection de modeles en statis- tique.Nous etudions en detail deux exemples pour lesquels cette ap- proche … Nous presentons quelques applications d'inegalites de con- centration a la resolution de problemes de selection de modeles en statis- tique.Nous etudions en detail deux exemples pour lesquels cette ap- proche s'avere particulirement fructueuse.Nous considerons tout d'abord le classique mais delicat probleme du choix d'un bon histogramme.Nous presentons un extrait de travail de Castellan sur la question, mettant en evidence que la structure meme des inegalites de concentration de Tala- grand pour des processus empiriques influence directement la construc- tion d'un critere de selection de type Akaike modifie.Nous presentons egalement un nouveau theoreme de selection de modeles bien adapte a la resolution de problemes d'apprentissage.Ce resultat per met de reinterpre- ter et d'ameliorer la methode dite de minimisation structurelle du risque due a Vapnik.ABSTRACT.-The purpose of this paper is to illustrate the power of con- centration inequalities by presenting some striking applications to various model selection problems in statistics.We shall study in details two main examples.We shall consider the old-standing problem of optimal selection of an histogram (following the lines of Castellan's work on this topic) for which the structure of Talagrand's concentration inequalities for empir- ical processes directly influences the construction of a modified Akaike criterion.We shall also present a new model selection theorem which can be applied to improve on Vapnik's structural minimization of the risk method for the statistical learning problem of pattern recognition.
We propose a general theorem providing upper bounds for the risk of an empirical risk minimizer (ERM).We essentially focus on the binary classification framework. We extend Tsybakov’s analysis of the … We propose a general theorem providing upper bounds for the risk of an empirical risk minimizer (ERM).We essentially focus on the binary classification framework. We extend Tsybakov’s analysis of the risk of an ERM under margin type conditions by using concentration inequalities for conveniently weighted empirical processes. This allows us to deal with ways of measuring the “size” of a class of classifiers other than entropy with bracketing as in Tsybakov’s work. In particular, we derive new risk bounds for the ERM when the classification rules belong to some VC-class under margin conditions and discuss the optimality of these bounds in a minimax sense.
Let $\varphi$ be a smooth function of $k + 2$ variables. We shall investigate in this paper the rates of convergence of estimators of $T(f) = \int\varphi(f(x), f'(x), \ldots, f^{(k)}(x), … Let $\varphi$ be a smooth function of $k + 2$ variables. We shall investigate in this paper the rates of convergence of estimators of $T(f) = \int\varphi(f(x), f'(x), \ldots, f^{(k)}(x), x) dx$ when $f$ belongs to some class of densities of smoothness $s$. We prove that, when $s \geq 2k + \frac{1}{4}$, one can define an estimator $\hat{T}_n$ of $T(f)$, based on $n$ i.i.d. observations of density $f$ on the real line, which converges at the semiparametric rate $1/ \sqrt n$. On the other hand, when $s < 2k + \frac{1}{4}, T(f)$ cannot be estimated at a rate faster than $n^{-\gamma}$ with $\gamma = 4(s - k)/\lbrack 4s + 1\rbrack$. We shall also provide some extensions to the multidimensional case. Those results extend previous works of Levit, of Bickel and Ritov and of Donoho and Nussbaum on estimation of quadratic functionals.
We derive a new general concentration-of-measure inequality. The concentration inequality applies, among others, to configuration functions as defined by Talagrand and also to combinatorial entropies such as the logarithm of … We derive a new general concentration-of-measure inequality. The concentration inequality applies, among others, to configuration functions as defined by Talagrand and also to combinatorial entropies such as the logarithm of the number of increasing subsequences in a random permutation and to Vapnik-Chervonenkis (VC) entropies. The results find direct applications in statistical learning theory, substantiating the possibility to use the empirical VC entropy in penalization techniques. © 2000 John Wiley & Sons, Inc. Random Struct. Alg., 16: 277–292, 2000
A general method for obtaining moment inequalities for functions of independent random variables is presented. It is a generalization of the entropy method which has been used to derive concentration … A general method for obtaining moment inequalities for functions of independent random variables is presented. It is a generalization of the entropy method which has been used to derive concentration inequalities for such functions [Boucheron, Lugosi and Massart Ann. Probab. 31 (2003) 1583–1614], and is based on a generalized tensorization inequality due to Latała and Oleszkiewicz [Lecture Notes in Math. 1745 (2000) 147–168]. The new inequalities prove to be a versatile tool in a wide range of applications. We illustrate the power of the method by showing how it can be used to effortlessly re-derive classical inequalities including Rosenthal and Kahane–Khinchine-type inequalities for sums of independent random variables, moment inequalities for suprema of empirical processes and moment inequalities for Rademacher chaos and U-statistics. Some of these corollaries are apparently new. In particular, we generalize Talagrand's exponential inequality for Rademacher chaos of order 2 to any order. We also discuss applications for other complex functions of independent random variables, such as suprema of Boolean polynomials which include, as special cases, subgraph counting problems in random graphs.
The support vector machine (SVM) algorithm is well known to the computer learning community for its very good practical results. The goal of the present paper is to study this … The support vector machine (SVM) algorithm is well known to the computer learning community for its very good practical results. The goal of the present paper is to study this algorithm from a statistical perspective, using tools of concentration theory and empirical processes. Our main result builds on the observation made by other authors that the SVM can be viewed as a statistical regularization procedure. From this point of view, it can also be interpreted as a model selection principle using a penalized criterion. It is then possible to adapt general methods related to model selection in this framework to study two important points: (1) what is the minimum penalty and how does it compare to the penalty actually used in the SVM algorithm; (2) is it possible to obtain ``oracle inequalities'' in that setting, for the specific loss function used in the SVM algorithm? We show that the answer to the latter question is positive and provides relevant insight to the former. Our result shows that it is possible to obtain fast rates of convergence for SVMs.
Penalization procedures often suffer from their dependence on multiplying factors, whose optimal values are either unknown or hard to estimate from the data. We propose a completely data-driven calibration algorithm … Penalization procedures often suffer from their dependence on multiplying factors, whose optimal values are either unknown or hard to estimate from the data. We propose a completely data-driven calibration algorithm for this parameter in the least-squares regression framework, without assuming a particular shape for the penalty. Our algorithm relies on the concept of minimal penalty, recently introduced by Birge and Massart (2007) in the context of penalized least squares for Gaussian homoscedastic regression. On the positive side, the minimal penalty can be evaluated from the data themselves, leading to a data-driven estimation of an optimal penalty which can be used in practice; on the negative side, their approach heavily relies on the homoscedastic Gaussian nature of their stochastic framework. The purpose of this paper is twofold: stating a more general heuristics for designing a data-driven penalty (the slope heuristics) and proving that it works for penalized least-squares regression with a random design, even for heteroscedastic non-Gaussian data. For technical reasons, some exact mathematical results will be proved only for regressogram bin-width selection. This is at least a first step towards further results, since the approach and the method that we use are indeed general.
Let $x_1, \ldots, x_n$ be independent random variables with uniform distribution over $\lbrack 0, 1\rbrack$, defined on a rich enough probability space $\Omega$. Denoting by $\hat{\mathbb{F}}_n$ the empirical distribution function … Let $x_1, \ldots, x_n$ be independent random variables with uniform distribution over $\lbrack 0, 1\rbrack$, defined on a rich enough probability space $\Omega$. Denoting by $\hat{\mathbb{F}}_n$ the empirical distribution function associated with these observations and by $\alpha_n$ the empirical Brownian bridge $\alpha_n(t) = \sqrt n(\hat{\mathbb{F}}_n(t) - t)$, Komlos, Major and Tusnady (KMT) showed in 1975 that a Brownian bridge $\mathbb{B}^0$ (depending on $n$) may be constructed on $\Omega$ in such a way that the uniform deviation $\|\alpha_n - \mathbb{B}^0\|_\infty$ between $\alpha_n$ and $\mathbb{B}^0$ is of order of $\log(n)/\sqrt n$ in probability. In this paper, we prove that a Poisson bridge $\mathbb{L}^0_n$ may be constructed on $\Omega$ (note that this construction is not the usual one) in such a way that the uniform deviations between any two of the three processes $\alpha_n, \mathbb{L}^0_n$ and $\mathbb{B}^0$ are of order of $\log(n)/\sqrt n$ in probability. Moreover, we give explicit exponential bounds for the error terms, intended for asymptotic as well as nonasymptotic use.
Let $P$ be the Lebesque measure on the unit cube in $\mathbb{R}^d$ and $Z_n$ be the centered and normalized empirical process associated with $n$ independent observations with common law $P$. … Let $P$ be the Lebesque measure on the unit cube in $\mathbb{R}^d$ and $Z_n$ be the centered and normalized empirical process associated with $n$ independent observations with common law $P$. Given a collection of Borel sets $\mathscr{J}$ in $\mathbb{R}^d$, it is known since Dudley's work that if $\mathscr{J}$ is not too large (e.g., either $\mathscr{J}$) is a Vapnik-Cervonenkis class (VC class) or $\mathscr{J}$ fulfills a suitable "entropy with bracketing" condition), then $(Z_n)$ may be strongly approximated by some sequence of Brownian bridges indexed by $\mathscr{J}$, uniformly over $\mathscr{J}$ with some rate $b_n$. We apply the one-dimensional dyadic scheme previously used by Komlos, Major and Tusnady (KMT) to get as good rates of approximation as possible in the above general multidimensional situation. The most striking result is that, up to a possible power of $\log(n), b_n$ may be taken as $n^{-1/2d}$ which is the best possible rate, when $\mathscr{J}$ is the class of Euclidean balls (this is the KMT result when $d = 1$ and the lower bounds are due to Beck when $d \geq 2$). We also obtain some related results for the set-indexed partial-sum processes.
We prove a new exponential inequality for the Kaplan–Meier estimator of a distribution function in a right censored data model. This inequality is of the same type as the Dvoretzky–Kiefer–Wolfowitz … We prove a new exponential inequality for the Kaplan–Meier estimator of a distribution function in a right censored data model. This inequality is of the same type as the Dvoretzky–Kiefer–Wolfowitz inequality for the empirical distribution function in the non-censored case. Our approach is based on Duhamel equation which allows to use empirical process theory. Nous prouvons une nouvelle inégalité exponentielle pour l'estimateur de Kaplan–Meier d'une fonction de répartition dans un modèle de données censurées à droite. Cette inégalité est du même type que l'inégalité de Dvoretzky–Kiefer–Wolfowitz pour la fonction de répartition empirique dans le cas non censuré. Notre approche est basée sur l'équation de Duhamel qui nous permet d'utiliser la théorie des processus empiriques.
We prove some new concentration inequalities for self-bounding functions using the entropy method. As an application, we recover Talagrand's convex distance inequality. The new Bernstein-like inequalities for self-bounding functions are … We prove some new concentration inequalities for self-bounding functions using the entropy method. As an application, we recover Talagrand's convex distance inequality. The new Bernstein-like inequalities for self-bounding functions are derived thanks to a careful analysis of the so-called Herbst argument. The latter involves comparison results between solutions of differential inequalities that may be interesting in their own right.
Discussion of ``2004 IMS Medallion Lecture: Local Rademacher complexities and oracle inequalities in risk minimization'' by V. Koltchinskii [arXiv:0708.0083] Discussion of ``2004 IMS Medallion Lecture: Local Rademacher complexities and oracle inequalities in risk minimization'' by V. Koltchinskii [arXiv:0708.0083]
While many efforts have been made to prove that the Lasso behaves like a variable selection procedure at the price of strong (though unavoidable) assumptions on the geometric structure of … While many efforts have been made to prove that the Lasso behaves like a variable selection procedure at the price of strong (though unavoidable) assumptions on the geometric structure of these variables, much less attention has been paid to the oracle inequalities for the Lasso involving the ℓ1-norm of the target vector. Such inequalities proved in the literature show that, provided that the regularization parameter is properly chosen, the Lasso approximately mimics the deterministic Lasso. Some of them do not require any assumption at all, neither on the structure of the variables nor on the regression function. Our first purpose here is to provide a conceptually very simple result in this direction in the framework of Gaussian models with non-random regressors. Our second purpose is to propose a new estimator particularly adapted to deal with infinite countable dictionaries. This estimator is constructed as an ℓ0-penalized estimator among a sequence of Lasso estimators associated to a dyadic sequence of growing truncated dictionaries. The selection procedure is choosing automatically the best level of truncation of the dictionary so as to make the best tradeoff between approximation, ℓ1-regularization and sparsity. From a theoretical point of view, we shall provide an oracle inequality satisfied by this selected Lasso estimator. The oracle inequalities presented in this paper are obtained via the application of a general theorem of model selection among a collection of nonlinear models which is a direct consequence of the Gaussian concentration inequality. The key idea that enables us to apply this general theorem is to see ℓ1-regularization as a model selection procedure among ℓ1-balls.
Distances to compact sets are widely used in the field of Topological Data Analysis for inferring geometric and topological features from point clouds. In this context, the distance to a … Distances to compact sets are widely used in the field of Topological Data Analysis for inferring geometric and topological features from point clouds. In this context, the distance to a probability measure (DTM) has been introduced by Chazal et al., 2011b as a robust alternative to the distance a compact set. In practice, the DTM can be estimated by its empirical counterpart, that is the distance to the empirical measure (DTEM). In this paper we give a tight control of the deviation of the DTEM. Our analysis relies on a local analysis of empirical processes. In particular, we show that the rate of convergence of the DTEM directly depends on the regularity at zero of a particular quantile function which contains some local information about the geometry of the support. This quantile function is the relevant quantity to describe precisely how difficult is a geometric inference problem. Several numerical experiments illustrate the convergence of the DTEM and also confirm that our bounds are tight.
The Lasso has attracted the attention of many authors these last years. While many efforts have been made to prove that the Lasso behaves like a variable selection procedure at … The Lasso has attracted the attention of many authors these last years. While many efforts have been made to prove that the Lasso behaves like a variable selection procedure at the price of strong (though unavoidable) assumptions on the geometric structure of these variables, much less attention has been paid to the analysis of the performance of the Lasso as a regularization algorithm. Our first purpose here is to provide a conceptually very simple result in this direction. We shall prove that, provided that the regularization parameter is properly chosen, the Lasso works almost as well as the deterministic Lasso. This result does not require any assumption at all, neither on the structure of the variables nor on the regression function. Our second purpose is to introduce a new estimator particularly adapted to deal with infinite countable dictionaries. This estimator is constructed as an l0-penalized estimator among a sequence of Lasso estimators associated to a dyadic sequence of growing truncated dictionaries. The selection procedure automatically chooses the best level of truncation of the dictionary so as to make the best tradeoff between approximation, l1-regularization and sparsity. From a theoretical point of view, we shall provide an oracle inequality satisfied by this selected Lasso estimator. The oracle inequalities established for the Lasso and the selected Lasso estimators shall enable us to derive rates of convergence on a wide class of functions, showing that these estimators perform at least as well as greedy algorithms. Besides, we shall prove that the rates of convergence achieved by the selected Lasso estimator are optimal in the orthonormal case by bounding from below the minimax risk on some Besov bodies. Finally, some theoretical results about the performance of the Lasso for infinite uncountable dictionaries will be studied in the specific framework of neural networks. All the oracle inequalities presented in this paper are obtained via the application of a single general theorem of model selection among a collection of nonlinear models which is a direct consequence of the Gaussian concentration inequality. The key idea that enables us to apply this general theorem is to see l1-regularization as a model selection procedure among l1-balls.
The optimal coupling between a variable with the Bin(n,1/2) distribution and a normal random variable lies at the heart of the proof of the KMT Theorem for the empirical distribution … The optimal coupling between a variable with the Bin(n,1/2) distribution and a normal random variable lies at the heart of the proof of the KMT Theorem for the empirical distribution function. Tusnády's Lemma (published in 1977 in his dissertation and in Hungarian) provides an inequality with explicit absolute constants which says that for this coupling, the distance between the random variables remains bounded in probability. In the appendix of a joint work with Jean Bretagnolle (1989), we have proposed a proof of Tusnády's Lemma which though elementary is highly technical and considered as rather obscure, at least this is what we have understood from several conversations with motivated readers. The purpose of this paper is to provide an alternative proof which is still based on elementary computations but which we hope to be simpler and more illuminating. This new proof also leads to a slight improvement on the original result in terms of constants. Le couplage optimal d'une variable binomiale Bin(n,1/2) et d'une variable gaussienne est au coeur de la preuve du Théorème de Komlós, Major et Tusnády pour la fonction de répartition empirique. Le Lemme de Tusnády (publié en 1977 dans sa thèse et en Hongrois) fournit une inégalité comportant des constantes absolues explicites, exprimant que l'écart entre ces variables convenablement couplées reste borné en probabilité. En appendice d'un article écrit en collaboration avec Jean Bretagnolle (1989), nous avons proposé une preuve du Lemme de Tusnády qui pour être élémentaire n'en est pas moins très technique et considérée comme plutôt obscure, c'est du moins ce que nous avons compris des quelques conversations que nous avons eues avec des lecteurs motivés. Le but de cet article est de proposer une nouvelle preuve, fondée elle aussi sur des calculs élémentaires mais que nous espérons plus simple et plus limpide. Cette preuve possède également le mérite de conduire à une amélioration (modeste) du résultat original au niveau des constantes.
Kernel density estimation is a well known method involving a smoothing parameter (the bandwidth) that needs to be tuned by the user. Although this method has been widely used, the … Kernel density estimation is a well known method involving a smoothing parameter (the bandwidth) that needs to be tuned by the user. Although this method has been widely used, the bandwidth selection remains a challenging issue in terms of balancing algorithmic performance and statistical relevance. The purpose of this paper is to study a recently developed bandwidth selection method, called Penalized Comparison to Overfitting (PCO). We first provide new theoretical guarantees by proving that PCO performed with non-diagonal bandwidth matrices is optimal in the oracle and minimax approaches. PCO is then compared to other usual bandwidth selection methods (at least those which are implemented in the R-package) for univariate and also multivariate kernel density estimation on the basis of intensive simulation studies. In particular, cross-validation and plug-in criteria are numerically investigated and compared to PCO. The take home message is that PCO can outperform the classical methods without algorithmic additional cost.
Kernel density estimation is a well known method involving a smoothing parameter (the bandwidth) that needs to be tuned by the user. Although this method has been widely used the … Kernel density estimation is a well known method involving a smoothing parameter (the bandwidth) that needs to be tuned by the user. Although this method has been widely used the bandwidth selection remains a challenging issue in terms of balancing algorithmic performance and statistical relevance. The purpose of this paper is to compare a recently developped bandwidth selection method for kernel density estimation to those which are commonly used by now (at least those which are implemented in the R-package). This new method is called Penalized Comparison to Overfitting (PCO). It has been proposed by some of the authors of this paper in a previous work devoted to its statistical relevance from a purely theoretical perspective. It is compared here to other usual bandwidth selection methods for univariate and also multivariate kernel density estimation on the basis of intensive simulation studies. In particular, cross-validation and plug-in criteria are numerically investigated and compared to PCO. The take home message is that PCO can outperform the classical methods without algorithmic additionnal cost.
<!-- *** Custom HTML *** --> Nemirovski's inequality states that given independent and centered at expectation random vectors $X_{1},\ldots,X_{n}$ with values in $\ell^p(\mathbb{R}^d)$, there exists some constant $C(p,d)$ such that … <!-- *** Custom HTML *** --> Nemirovski's inequality states that given independent and centered at expectation random vectors $X_{1},\ldots,X_{n}$ with values in $\ell^p(\mathbb{R}^d)$, there exists some constant $C(p,d)$ such that \[\mathbb{E}\Vert S_n\Vert _p^2\le C(p,d)\sum_{i=1}^{n}\mathbb{E}\Vert X_i\Vert _p^2.\] Furthermore $C(p,d)$ can be taken as $\kappa(p\wedge \log(d))$. Two cases were studied further in [ <i>Am. Math. Mon.</i> <b>117</b>(2) (2010) 138–160]: general finite-dimensional Banach spaces and the special case $\ell^{\infty}(\mathbb{R}^{d})$. We show that in these two cases, it is possible to replace the quantity $\sum_{i=1}^n\mathbb{E}\Vert X_i\Vert _p^2$ by a smaller one without changing the order of magnitude of the constant when $d$ becomes large. In the spirit of [ <i>Am. Math. Mon.</i> <b>117</b>(2) (2010) 138–160], our approach is probabilistic. The derivation of our version of Nemirovski's inequality indeed relies on concentration inequalities.
In a recent paper Birge and Massart (2006) have introduced the notion of minimal penalty in the context of penalized least squares for Gaussian regression. They have shown that for … In a recent paper Birge and Massart (2006) have introduced the notion of minimal penalty in the context of penalized least squares for Gaussian regression. They have shown that for several model selection problems, simply multiplying by 2 the minimal penalty leads to some (nearly) optimal penalty in the sense that it approximately minimizes the resulting oracle inequality. Interestingly, the minimal penalty can be evaluated from the data themselves which leads to a data-driven choice of the penalty that one can use in practice. Unfortunately their approach heavily relies on the Gaussian nature of the stochastic framework that they consider. Our purpose in this paper is twofold: stating a heuristics to design a data-driven penalty (the slope heuristics) which is not sensitive to the Gaussian assumption as in (Birge and Massart, 2006) and proving that it works for penalized least squares random design regression. As a matter of fact, we could prove some precise mathematical results only for histogram bin-width selection. For some technical reasons which are explained in the paper, we could not work at the level of generality that we were expecting but still this is a first step towards further results and even if the mathematical results hold in some specific framework, the approach and the method that we use are indeed general.
This paper is concerned with adaptive nonparametric estimation using the Goldenshluger-Lepski selection method. This estimator selection method is based on pairwise comparisons between estimators with respect to some loss function. … This paper is concerned with adaptive nonparametric estimation using the Goldenshluger-Lepski selection method. This estimator selection method is based on pairwise comparisons between estimators with respect to some loss function. The method also involves a penalty term that typically needs to be large enough in order that the method works (in the sense that one can prove some oracle type inequality for the selected estimator). In the case of density estimation with kernel estimators and a quadratic loss, we show that the procedure fails if the penalty term is chosen smaller than some critical value for the penalty: the minimal penalty. More precisely we show that the quadratic risk of the selected estimator explodes when the penalty is below this critical value while it stays under control when the penalty is above this critical value. This kind of phase transition phenomenon for penalty calibration has already been observed and proved for penalized model selection methods in various contexts but appears here for the first time for the Goldenshluger-Lepski pairwise comparison method. Some simulations illustrate the theoretical results and lead to some hints on how to use the theory to calibrate the method in practice.
DISCUSSION OF “LEAST ANGLE REGRESSION” BY EFRONET AL.By Jean-Michel Loubes and Pascal MassartUniversit´e Paris-SudThe issue of model selection has drawn the attention of both applied andtheoretical statisticians for a long … DISCUSSION OF “LEAST ANGLE REGRESSION” BY EFRONET AL.By Jean-Michel Loubes and Pascal MassartUniversit´e Paris-SudThe issue of model selection has drawn the attention of both applied andtheoretical statisticians for a long time. Indeed, there has been an enor-mous range of contribution in model selection proposals, including work byAkaike (1973), Mallows (1973), Foster and George (1994), Birg´e and Mas-sart (2001a) and Abramovich, Benjamini, Donoho and Johnstone (2000).Over the last decade, modern computer-driven methods have been devel-oped such as All Subsets, Forward Selection, Forward Stagewise or Lasso.Such methods are useful in the setting of the standard linear model, wherewe observe noisy data and wish to predict the response variable using onlya few covariates, since they provide automatically linear models that fit thedata. The procedure described in this paper is, on the one hand, numeri-cally very efficient and, on the other hand, very general, since, with slightmodifications, it enables us to recover the estimates given by the Lasso andStagewise.1. Estimation procedure. The “LARS” method is based on a recursiveprocedure selecting, at each step, the covariates having largest absolute cor-relation with the response y. In the case of an orthogonal design, the esti-mates can then be viewed as an l
We define a general V-fold cross-validation type method based on robust tests, which is an extension of the hold-out defined by Birg{é} [7, Section 9]. We give some theoretical results … We define a general V-fold cross-validation type method based on robust tests, which is an extension of the hold-out defined by Birg{é} [7, Section 9]. We give some theoretical results showing that, under some weak assumptions on the considered statistical procedures, our selected estimator satisfies an oracle type inequality. We also introduce a fast algorithm that implements our method. Moreover we show in our simulations that this V-fold performs generally well for estimating a density for different sample sizes, and can handle well-known problems, such as binwidth selection for histograms or bandwidth selection for kernels. We finally provide a comparison with other classical V-fold methods and study empirically the influence of the value of V on the risk.
We present a new general concentration-of-measure inequality and illustrate its power by applications in random combinatorics. The results find direct applications in some problems of learning theory. We present a new general concentration-of-measure inequality and illustrate its power by applications in random combinatorics. The results find direct applications in some problems of learning theory.
In this chapter we first review some elementary facts about tail probabilities. Then, we describe the so-called Cramér–Chernoff method. We single out two types of tail behaviors, sub-Gaussian and sub-gamma … In this chapter we first review some elementary facts about tail probabilities. Then, we describe the so-called Cramér–Chernoff method. We single out two types of tail behaviors, sub-Gaussian and sub-gamma random variables. A simple useful inequality is presented for bounding the expected maximum of random variables. Hoeffding’s inequality, Bennett’s inequality and Bernstein’s inequality are shown and proved. We describe the Johnson–Lindenstrauss lemma as an interesting application of concentration of sums of independent random variables.
Positive-unlabeled learning (PU learning) is known as a special case of semi-supervised binary classification where only a fraction of positive examples are labeled. The challenge is then to find the … Positive-unlabeled learning (PU learning) is known as a special case of semi-supervised binary classification where only a fraction of positive examples are labeled. The challenge is then to find the correct classifier despite this lack of information. Recently, new methodologies have been introduced to address the case where the probability of being labeled may depend on the covariates. In this paper, we are interested in establishing risk bounds for PU learning under this general assumption. In addition, we quantify the impact of label noise on PU learning compared to standard classification setting. Finally, we provide a lower bound on minimax risk proving that the upper bound is almost optimal.
Distances to compact sets are widely used in the field of Topological Data Analysis for inferring geometric and topological features from point clouds. In this context, the distance to a … Distances to compact sets are widely used in the field of Topological Data Analysis for inferring geometric and topological features from point clouds. In this context, the distance to a probability measure (DTM) has been introduced by Chazal et al. (2011) as a robust alternative to the distance a compact set. In practice, the DTM can be estimated by its empirical counterpart, that is the distance to the empirical measure (DTEM). In this paper we give a tight control of the deviation of the DTEM. Our analysis relies on a local analysis of empirical processes. In particular, we show that the rates of convergence of the DTEM directly depends on the regularity at zero of a particular quantile fonction which contains some local information about the geometry of the support. This quantile function is the relevant quantity to describe precisely how difficult is a geometric inference problem. Several numerical experiments illustrate the convergence of the DTEM and also confirm that our bounds are tight.
Abstract In this chapter we focus our attention on the variance of the supremum of an empirical process. In this relatively simple, problem, we gain insight into some of the … Abstract In this chapter we focus our attention on the variance of the supremum of an empirical process. In this relatively simple, problem, we gain insight into some of the principal phenomena in a transparent way. In the two subsequent chapters technically more challenging exponential concentration inequalities are developed and some tools for bounding the expected value are surveyed. Our main tool is, once again, the Efron–Stein inequality. Various estimates of the variance are derived.
Abstract The purpose of this chapter is to explore further the rich connection between concentration and isoperimetry on the n-dimensional binary cube and also on Rn, equipped by the canonical … Abstract The purpose of this chapter is to explore further the rich connection between concentration and isoperimetry on the n-dimensional binary cube and also on Rn, equipped by the canonical Gaussian measure. We introduce an alternative way of measuring the size of the boundary of a subset of the binary hypercube and prove a corresponding isoperimetric inequality. This isoperimetric result is the consequence of Bobkov’s inequality. One of the most important corollaries of Bobkov’s inequality is the Gaussian isoperimetric theorem. We also derive some further results on threshold widths for certain monotone sets.
Kernel density estimation is a well known method involving a smoothing parameter (the bandwidth) that needs to be tuned by the user. Although this method has been widely used, the … Kernel density estimation is a well known method involving a smoothing parameter (the bandwidth) that needs to be tuned by the user. Although this method has been widely used, the bandwidth selection remains a challenging issue in terms of balancing algorithmic performance and statistical relevance. The purpose of this paper is to study a recently developed bandwidth selection method, called Penalized Comparison to Overfitting (PCO). We first provide new theoretical guarantees by proving that PCO performed with non-diagonal bandwidth matrices is optimal in the oracle and minimax approaches. PCO is then compared to other usual bandwidth selection methods (at least those which are implemented in the R-package) for univariate and also multivariate kernel density estimation on the basis of intensive simulation studies. In particular, cross-validation and plug-in criteria are numerically investigated and compared to PCO. The take home message is that PCO can outperform the classical methods without algorithmic additional cost.
Positive-unlabeled learning (PU learning) is known as a special case of semi-supervised binary classification where only a fraction of positive examples are labeled. The challenge is then to find the … Positive-unlabeled learning (PU learning) is known as a special case of semi-supervised binary classification where only a fraction of positive examples are labeled. The challenge is then to find the correct classifier despite this lack of information. Recently, new methodologies have been introduced to address the case where the probability of being labeled may depend on the covariates. In this paper, we are interested in establishing risk bounds for PU learning under this general assumption. In addition, we quantify the impact of label noise on PU learning compared to standard classification setting. Finally, we provide a lower bound on minimax risk proving that the upper bound is almost optimal.
Kernel density estimation is a well known method involving a smoothing parameter (the bandwidth) that needs to be tuned by the user. Although this method has been widely used the … Kernel density estimation is a well known method involving a smoothing parameter (the bandwidth) that needs to be tuned by the user. Although this method has been widely used the bandwidth selection remains a challenging issue in terms of balancing algorithmic performance and statistical relevance. The purpose of this paper is to compare a recently developped bandwidth selection method for kernel density estimation to those which are commonly used by now (at least those which are implemented in the R-package). This new method is called Penalized Comparison to Overfitting (PCO). It has been proposed by some of the authors of this paper in a previous work devoted to its statistical relevance from a purely theoretical perspective. It is compared here to other usual bandwidth selection methods for univariate and also multivariate kernel density estimation on the basis of intensive simulation studies. In particular, cross-validation and plug-in criteria are numerically investigated and compared to PCO. The take home message is that PCO can outperform the classical methods without algorithmic additionnal cost.
Distances to compact sets are widely used in the field of Topological Data Analysis for inferring geometric and topological features from point clouds. In this context, the distance to a … Distances to compact sets are widely used in the field of Topological Data Analysis for inferring geometric and topological features from point clouds. In this context, the distance to a probability measure (DTM) has been introduced by Chazal et al., 2011b as a robust alternative to the distance a compact set. In practice, the DTM can be estimated by its empirical counterpart, that is the distance to the empirical measure (DTEM). In this paper we give a tight control of the deviation of the DTEM. Our analysis relies on a local analysis of empirical processes. In particular, we show that the rate of convergence of the DTEM directly depends on the regularity at zero of a particular quantile function which contains some local information about the geometry of the support. This quantile function is the relevant quantity to describe precisely how difficult is a geometric inference problem. Several numerical experiments illustrate the convergence of the DTEM and also confirm that our bounds are tight.
This paper is concerned with adaptive nonparametric estimation using the Goldenshluger-Lepski selection method. This estimator selection method is based on pairwise comparisons between estimators with respect to some loss function. … This paper is concerned with adaptive nonparametric estimation using the Goldenshluger-Lepski selection method. This estimator selection method is based on pairwise comparisons between estimators with respect to some loss function. The method also involves a penalty term that typically needs to be large enough in order that the method works (in the sense that one can prove some oracle type inequality for the selected estimator). In the case of density estimation with kernel estimators and a quadratic loss, we show that the procedure fails if the penalty term is chosen smaller than some critical value for the penalty: the minimal penalty. More precisely we show that the quadratic risk of the selected estimator explodes when the penalty is below this critical value while it stays under control when the penalty is above this critical value. This kind of phase transition phenomenon for penalty calibration has already been observed and proved for penalized model selection methods in various contexts but appears here for the first time for the Goldenshluger-Lepski pairwise comparison method. Some simulations illustrate the theoretical results and lead to some hints on how to use the theory to calibrate the method in practice.
We define a general V-fold cross-validation type method based on robust tests, which is an extension of the hold-out defined by Birg{é} [7, Section 9]. We give some theoretical results … We define a general V-fold cross-validation type method based on robust tests, which is an extension of the hold-out defined by Birg{é} [7, Section 9]. We give some theoretical results showing that, under some weak assumptions on the considered statistical procedures, our selected estimator satisfies an oracle type inequality. We also introduce a fast algorithm that implements our method. Moreover we show in our simulations that this V-fold performs generally well for estimating a density for different sample sizes, and can handle well-known problems, such as binwidth selection for histograms or bandwidth selection for kernels. We finally provide a comparison with other classical V-fold methods and study empirically the influence of the value of V on the risk.
This paper is concerned with adaptive nonparametric estimation using the Goldenshluger-Lepski selection method. This estimator selection method is based on pairwise comparisons between estimators with respect to some loss function. … This paper is concerned with adaptive nonparametric estimation using the Goldenshluger-Lepski selection method. This estimator selection method is based on pairwise comparisons between estimators with respect to some loss function. The method also involves a penalty term that typically needs to be large enough in order that the method works (in the sense that one can prove some oracle type inequality for the selected estimator). In the case of density estimation with kernel estimators and a quadratic loss, we show that the procedure fails if the penalty term is chosen smaller than some critical value for the penalty: the minimal penalty. More precisely we show that the quadratic risk of the selected estimator explodes when the penalty is below this critical value while it stays under control when the penalty is above this critical value. This kind of phase transition phenomenon for penalty calibration has already been observed and proved for penalized model selection methods in various contexts but appears here for the first time for the Goldenshluger-Lepski pairwise comparison method. Some simulations illustrate the theoretical results and lead to some hints on how to use the theory to calibrate the method in practice.
Distances to compact sets are widely used in the field of Topological Data Analysis for inferring geometric and topological features from point clouds. In this context, the distance to a … Distances to compact sets are widely used in the field of Topological Data Analysis for inferring geometric and topological features from point clouds. In this context, the distance to a probability measure (DTM) has been introduced by Chazal et al. (2011) as a robust alternative to the distance a compact set. In practice, the DTM can be estimated by its empirical counterpart, that is the distance to the empirical measure (DTEM). In this paper we give a tight control of the deviation of the DTEM. Our analysis relies on a local analysis of empirical processes. In particular, we show that the rates of convergence of the DTEM directly depends on the regularity at zero of a particular quantile fonction which contains some local information about the geometry of the support. This quantile function is the relevant quantity to describe precisely how difficult is a geometric inference problem. Several numerical experiments illustrate the convergence of the DTEM and also confirm that our bounds are tight.
Abstract In this chapter we focus our attention on the variance of the supremum of an empirical process. In this relatively simple, problem, we gain insight into some of the … Abstract In this chapter we focus our attention on the variance of the supremum of an empirical process. In this relatively simple, problem, we gain insight into some of the principal phenomena in a transparent way. In the two subsequent chapters technically more challenging exponential concentration inequalities are developed and some tools for bounding the expected value are surveyed. Our main tool is, once again, the Efron–Stein inequality. Various estimates of the variance are derived.
Abstract The purpose of this chapter is to explore further the rich connection between concentration and isoperimetry on the n-dimensional binary cube and also on Rn, equipped by the canonical … Abstract The purpose of this chapter is to explore further the rich connection between concentration and isoperimetry on the n-dimensional binary cube and also on Rn, equipped by the canonical Gaussian measure. We introduce an alternative way of measuring the size of the boundary of a subset of the binary hypercube and prove a corresponding isoperimetric inequality. This isoperimetric result is the consequence of Bobkov’s inequality. One of the most important corollaries of Bobkov’s inequality is the Gaussian isoperimetric theorem. We also derive some further results on threshold widths for certain monotone sets.
Bounding the expected value of the supremum of an empirical process is a central object of the study of empirical processes and the purpose of this chapter is to present … Bounding the expected value of the supremum of an empirical process is a central object of the study of empirical processes and the purpose of this chapter is to present elements of this rich theory. We discuss the perhaps most important basic technique for obtaining sharp upper bounds for suprema of empirical processes, the so-called chaining argument. We investigate various scenarios and prove bounds for VC classes Rudelson’s inequality, the Klartag–Mendeslon theorem, and introduce the techniques of peeling and re-weighting.
Abstract In this chapter we prove a few inequalities known as logarithmic Sobolev inequalities and present Herbst’s argument to prove concentration inequalities. We also establish a collection of closely related … Abstract In this chapter we prove a few inequalities known as logarithmic Sobolev inequalities and present Herbst’s argument to prove concentration inequalities. We also establish a collection of closely related results, starting from the so-called Bonami–Beckner inequality. We close this chapter by invoking Gaussian hypercontractive inequalities to prove a challenging tail bound for the largest eigenvalue of random matrices from the Gaussian unitary ensemble.
In this chapter we first review some elementary facts about tail probabilities. Then, we describe the so-called Cramér–Chernoff method. We single out two types of tail behaviors, sub-Gaussian and sub-gamma … In this chapter we first review some elementary facts about tail probabilities. Then, we describe the so-called Cramér–Chernoff method. We single out two types of tail behaviors, sub-Gaussian and sub-gamma random variables. A simple useful inequality is presented for bounding the expected maximum of random variables. Hoeffding’s inequality, Bennett’s inequality and Bernstein’s inequality are shown and proved. We describe the Johnson–Lindenstrauss lemma as an interesting application of concentration of sums of independent random variables.
<!-- *** Custom HTML *** --> Nemirovski's inequality states that given independent and centered at expectation random vectors $X_{1},\ldots,X_{n}$ with values in $\ell^p(\mathbb{R}^d)$, there exists some constant $C(p,d)$ such that … <!-- *** Custom HTML *** --> Nemirovski's inequality states that given independent and centered at expectation random vectors $X_{1},\ldots,X_{n}$ with values in $\ell^p(\mathbb{R}^d)$, there exists some constant $C(p,d)$ such that \[\mathbb{E}\Vert S_n\Vert _p^2\le C(p,d)\sum_{i=1}^{n}\mathbb{E}\Vert X_i\Vert _p^2.\] Furthermore $C(p,d)$ can be taken as $\kappa(p\wedge \log(d))$. Two cases were studied further in [ <i>Am. Math. Mon.</i> <b>117</b>(2) (2010) 138–160]: general finite-dimensional Banach spaces and the special case $\ell^{\infty}(\mathbb{R}^{d})$. We show that in these two cases, it is possible to replace the quantity $\sum_{i=1}^n\mathbb{E}\Vert X_i\Vert _p^2$ by a smaller one without changing the order of magnitude of the constant when $d$ becomes large. In the spirit of [ <i>Am. Math. Mon.</i> <b>117</b>(2) (2010) 138–160], our approach is probabilistic. The derivation of our version of Nemirovski's inequality indeed relies on concentration inequalities.
Concentration inequalities for functions of independent random variables is an area of probability theory that has witnessed a great revolution in the last few decades, and has applications in a … Concentration inequalities for functions of independent random variables is an area of probability theory that has witnessed a great revolution in the last few decades, and has applications in a wide variety of areas such as machine learning, statistics, discrete mathematics, and high-dimensional geometry. Roughly speaking, if a function of many independent random variables does not depend too much on any of the variables then it is concentrated in the sense that with high probability, it is close to its expected value. This book offers a host of inequalities to illustrate this rich theory in an accessible way by covering the key developments and applications in the field. The authors describe the interplay between the probabilistic structure (independence) and a variety of tools ranging from functional inequalities to transportation arguments to information theory. Applications to the study of empirical processes, random projections, random matrix theory, and threshold phenomena are also presented. A self-contained introduction to concentration inequalities, it includes a survey of concentration of sums of independent random variables, variance bounds, the entropy method, and the transportation method. Deep connections with isoperimetric problems are revealed whilst special attention is paid to applications to the supremum of empirical processes. Written by leading experts in the field and containing extensive exercise sections this book will be an invaluable resource for researchers and graduate students in mathematics, theoretical computer science, and engineering.
While many efforts have been made to prove that the Lasso behaves like a variable selection procedure at the price of strong (though unavoidable) assumptions on the geometric structure of … While many efforts have been made to prove that the Lasso behaves like a variable selection procedure at the price of strong (though unavoidable) assumptions on the geometric structure of these variables, much less attention has been paid to the oracle inequalities for the Lasso involving the ℓ1-norm of the target vector. Such inequalities proved in the literature show that, provided that the regularization parameter is properly chosen, the Lasso approximately mimics the deterministic Lasso. Some of them do not require any assumption at all, neither on the structure of the variables nor on the regression function. Our first purpose here is to provide a conceptually very simple result in this direction in the framework of Gaussian models with non-random regressors. Our second purpose is to propose a new estimator particularly adapted to deal with infinite countable dictionaries. This estimator is constructed as an ℓ0-penalized estimator among a sequence of Lasso estimators associated to a dyadic sequence of growing truncated dictionaries. The selection procedure is choosing automatically the best level of truncation of the dictionary so as to make the best tradeoff between approximation, ℓ1-regularization and sparsity. From a theoretical point of view, we shall provide an oracle inequality satisfied by this selected Lasso estimator. The oracle inequalities presented in this paper are obtained via the application of a general theorem of model selection among a collection of nonlinear models which is a direct consequence of the Gaussian concentration inequality. The key idea that enables us to apply this general theorem is to see ℓ1-regularization as a model selection procedure among ℓ1-balls.
The Lasso has attracted the attention of many authors these last years. While many efforts have been made to prove that the Lasso behaves like a variable selection procedure at … The Lasso has attracted the attention of many authors these last years. While many efforts have been made to prove that the Lasso behaves like a variable selection procedure at the price of strong (though unavoidable) assumptions on the geometric structure of these variables, much less attention has been paid to the analysis of the performance of the Lasso as a regularization algorithm. Our first purpose here is to provide a conceptually very simple result in this direction. We shall prove that, provided that the regularization parameter is properly chosen, the Lasso works almost as well as the deterministic Lasso. This result does not require any assumption at all, neither on the structure of the variables nor on the regression function. Our second purpose is to introduce a new estimator particularly adapted to deal with infinite countable dictionaries. This estimator is constructed as an l0-penalized estimator among a sequence of Lasso estimators associated to a dyadic sequence of growing truncated dictionaries. The selection procedure automatically chooses the best level of truncation of the dictionary so as to make the best tradeoff between approximation, l1-regularization and sparsity. From a theoretical point of view, we shall provide an oracle inequality satisfied by this selected Lasso estimator. The oracle inequalities established for the Lasso and the selected Lasso estimators shall enable us to derive rates of convergence on a wide class of functions, showing that these estimators perform at least as well as greedy algorithms. Besides, we shall prove that the rates of convergence achieved by the selected Lasso estimator are optimal in the orthonormal case by bounding from below the minimax risk on some Besov bodies. Finally, some theoretical results about the performance of the Lasso for infinite uncountable dictionaries will be studied in the specific framework of neural networks. All the oracle inequalities presented in this paper are obtained via the application of a single general theorem of model selection among a collection of nonlinear models which is a direct consequence of the Gaussian concentration inequality. The key idea that enables us to apply this general theorem is to see l1-regularization as a model selection procedure among l1-balls.
We prove some new concentration inequalities for self-bounding functions using the entropy method. As an application, we recover Talagrand's convex distance inequality. The new Bernstein-like inequalities for self-bounding functions are … We prove some new concentration inequalities for self-bounding functions using the entropy method. As an application, we recover Talagrand's convex distance inequality. The new Bernstein-like inequalities for self-bounding functions are derived thanks to a careful analysis of the so-called Herbst argument. The latter involves comparison results between solutions of differential inequalities that may be interesting in their own right.
The support vector machine (SVM) algorithm is well known to the computer learning community for its very good practical results. The goal of the present paper is to study this … The support vector machine (SVM) algorithm is well known to the computer learning community for its very good practical results. The goal of the present paper is to study this algorithm from a statistical perspective, using tools of concentration theory and empirical processes. Our main result builds on the observation made by other authors that the SVM can be viewed as a statistical regularization procedure. From this point of view, it can also be interpreted as a model selection principle using a penalized criterion. It is then possible to adapt general methods related to model selection in this framework to study two important points: (1) what is the minimum penalty and how does it compare to the penalty actually used in the SVM algorithm; (2) is it possible to obtain ``oracle inequalities'' in that setting, for the specific loss function used in the SVM algorithm? We show that the answer to the latter question is positive and provides relevant insight to the former. Our result shows that it is possible to obtain fast rates of convergence for SVMs.
In a recent paper Birge and Massart (2006) have introduced the notion of minimal penalty in the context of penalized least squares for Gaussian regression. They have shown that for … In a recent paper Birge and Massart (2006) have introduced the notion of minimal penalty in the context of penalized least squares for Gaussian regression. They have shown that for several model selection problems, simply multiplying by 2 the minimal penalty leads to some (nearly) optimal penalty in the sense that it approximately minimizes the resulting oracle inequality. Interestingly, the minimal penalty can be evaluated from the data themselves which leads to a data-driven choice of the penalty that one can use in practice. Unfortunately their approach heavily relies on the Gaussian nature of the stochastic framework that they consider. Our purpose in this paper is twofold: stating a heuristics to design a data-driven penalty (the slope heuristics) which is not sensitive to the Gaussian assumption as in (Birge and Massart, 2006) and proving that it works for penalized least squares random design regression. As a matter of fact, we could prove some precise mathematical results only for histogram bin-width selection. For some technical reasons which are explained in the paper, we could not work at the level of generality that we were expecting but still this is a first step towards further results and even if the mathematical results hold in some specific framework, the approach and the method that we use are indeed general.
Penalization procedures often suffer from their dependence on multiplying factors, whose optimal values are either unknown or hard to estimate from the data. We propose a completely data-driven calibration algorithm … Penalization procedures often suffer from their dependence on multiplying factors, whose optimal values are either unknown or hard to estimate from the data. We propose a completely data-driven calibration algorithm for this parameter in the least-squares regression framework, without assuming a particular shape for the penalty. Our algorithm relies on the concept of minimal penalty, recently introduced by Birge and Massart (2007) in the context of penalized least squares for Gaussian homoscedastic regression. On the positive side, the minimal penalty can be evaluated from the data themselves, leading to a data-driven estimation of an optimal penalty which can be used in practice; on the negative side, their approach heavily relies on the homoscedastic Gaussian nature of their stochastic framework. The purpose of this paper is twofold: stating a more general heuristics for designing a data-driven penalty (the slope heuristics) and proving that it works for penalized least-squares regression with a random design, even for heteroscedastic non-Gaussian data. For technical reasons, some exact mathematical results will be proved only for regressogram bin-width selection. This is at least a first step towards further results, since the approach and the method that we use are indeed general.
Discussion of ``2004 IMS Medallion Lecture: Local Rademacher complexities and oracle inequalities in risk minimization'' by V. Koltchinskii [arXiv:0708.0083] Discussion of ``2004 IMS Medallion Lecture: Local Rademacher complexities and oracle inequalities in risk minimization'' by V. Koltchinskii [arXiv:0708.0083]
We propose a general theorem providing upper bounds for the risk of an empirical risk minimizer (ERM).We essentially focus on the binary classification framework. We extend Tsybakov’s analysis of the … We propose a general theorem providing upper bounds for the risk of an empirical risk minimizer (ERM).We essentially focus on the binary classification framework. We extend Tsybakov’s analysis of the risk of an ERM under margin type conditions by using concentration inequalities for conveniently weighted empirical processes. This allows us to deal with ways of measuring the “size” of a class of classifiers other than entropy with bracketing as in Tsybakov’s work. In particular, we derive new risk bounds for the ERM when the classification rules belong to some VC-class under margin conditions and discuss the optimality of these bounds in a minimax sense.
A general method for obtaining moment inequalities for functions of independent random variables is presented. It is a generalization of the entropy method which has been used to derive concentration … A general method for obtaining moment inequalities for functions of independent random variables is presented. It is a generalization of the entropy method which has been used to derive concentration inequalities for such functions [Boucheron, Lugosi and Massart Ann. Probab. 31 (2003) 1583–1614], and is based on a generalized tensorization inequality due to Latała and Oleszkiewicz [Lecture Notes in Math. 1745 (2000) 147–168]. The new inequalities prove to be a versatile tool in a wide range of applications. We illustrate the power of the method by showing how it can be used to effortlessly re-derive classical inequalities including Rosenthal and Kahane–Khinchine-type inequalities for sums of independent random variables, moment inequalities for suprema of empirical processes and moment inequalities for Rademacher chaos and U-statistics. Some of these corollaries are apparently new. In particular, we generalize Talagrand's exponential inequality for Rademacher chaos of order 2 to any order. We also discuss applications for other complex functions of independent random variables, such as suprema of Boolean polynomials which include, as special cases, subgraph counting problems in random graphs.
DISCUSSION OF “LEAST ANGLE REGRESSION” BY EFRONET AL.By Jean-Michel Loubes and Pascal MassartUniversit´e Paris-SudThe issue of model selection has drawn the attention of both applied andtheoretical statisticians for a long … DISCUSSION OF “LEAST ANGLE REGRESSION” BY EFRONET AL.By Jean-Michel Loubes and Pascal MassartUniversit´e Paris-SudThe issue of model selection has drawn the attention of both applied andtheoretical statisticians for a long time. Indeed, there has been an enor-mous range of contribution in model selection proposals, including work byAkaike (1973), Mallows (1973), Foster and George (1994), Birg´e and Mas-sart (2001a) and Abramovich, Benjamini, Donoho and Johnstone (2000).Over the last decade, modern computer-driven methods have been devel-oped such as All Subsets, Forward Selection, Forward Stagewise or Lasso.Such methods are useful in the setting of the standard linear model, wherewe observe noisy data and wish to predict the response variable using onlya few covariates, since they provide automatically linear models that fit thedata. The procedure described in this paper is, on the one hand, numeri-cally very efficient and, on the other hand, very general, since, with slightmodifications, it enables us to recover the estimates given by the Lasso andStagewise.1. Estimation procedure. The “LARS” method is based on a recursiveprocedure selecting, at each step, the covariates having largest absolute cor-relation with the response y. In the case of an orthogonal design, the esti-mates can then be viewed as an l
We investigate a new methodology, worked out by Ledoux and Massart, to prove concentration-of-measure inequalities. The method is based on certain modified logarithmic Sobolev inequalities. We provide some very simple … We investigate a new methodology, worked out by Ledoux and Massart, to prove concentration-of-measure inequalities. The method is based on certain modified logarithmic Sobolev inequalities. We provide some very simple and general ready-to-use inequalities. One of these inequalities may be considered as an exponential version of the Efron--Stein inequality. The main purpose of this paper is to point out the simplicity and the generality of the approach. We show how the new method can recover many of Talagrand's revolutionary inequalities and provide new applications in a variety of problems including Rademacher averages, Rademacher chaos, the number of certain small subgraphs in a random graph, and the minimum of the empirical risk in some statistical estimation problems.
The optimal coupling between a variable with the Bin(n,1/2) distribution and a normal random variable lies at the heart of the proof of the KMT Theorem for the empirical distribution … The optimal coupling between a variable with the Bin(n,1/2) distribution and a normal random variable lies at the heart of the proof of the KMT Theorem for the empirical distribution function. Tusnády's Lemma (published in 1977 in his dissertation and in Hungarian) provides an inequality with explicit absolute constants which says that for this coupling, the distance between the random variables remains bounded in probability. In the appendix of a joint work with Jean Bretagnolle (1989), we have proposed a proof of Tusnády's Lemma which though elementary is highly technical and considered as rather obscure, at least this is what we have understood from several conversations with motivated readers. The purpose of this paper is to provide an alternative proof which is still based on elementary computations but which we hope to be simpler and more illuminating. This new proof also leads to a slight improvement on the original result in terms of constants. Le couplage optimal d'une variable binomiale Bin(n,1/2) et d'une variable gaussienne est au coeur de la preuve du Théorème de Komlós, Major et Tusnády pour la fonction de répartition empirique. Le Lemme de Tusnády (publié en 1977 dans sa thèse et en Hongrois) fournit une inégalité comportant des constantes absolues explicites, exprimant que l'écart entre ces variables convenablement couplées reste borné en probabilité. En appendice d'un article écrit en collaboration avec Jean Bretagnolle (1989), nous avons proposé une preuve du Lemme de Tusnády qui pour être élémentaire n'en est pas moins très technique et considérée comme plutôt obscure, c'est du moins ce que nous avons compris des quelques conversations que nous avons eues avec des lecteurs motivés. Le but de cet article est de proposer une nouvelle preuve, fondée elle aussi sur des calculs élémentaires mais que nous espérons plus simple et plus limpide. Cette preuve possède également le mérite de conduire à une amélioration (modeste) du résultat original au niveau des constantes.
We consider the problem of estimating $\|s\|^2$ when $s$ belongs to some separable Hilbert space and one observes the Gaussian process $Y(t) = \langles, t\rangle + \sigmaL(t)$, for all $t … We consider the problem of estimating $\|s\|^2$ when $s$ belongs to some separable Hilbert space and one observes the Gaussian process $Y(t) = \langles, t\rangle + \sigmaL(t)$, for all $t \epsilon \mathbb{H}$,where $L$ is some Gaussian isonormal process. This framework allows us in particular to consider the classical “Gaussian sequence model” for which $\mathbb{H} = l_2(\mathbb{N}*)$ and $L(t) = \sum_{\lambda\geq1}t_{\lambda}\varepsilon_{\lambda}$, where $(\varepsilon_{\lambda})_{\lambda\geq1}$ is a sequence of i.i.d. standard normal variables. Our approach consists in considering some at most countable families of finite-dimensional linear subspaces of $\mathbb{H}$ (the models) and then using model selection via some conveniently penalized least squares criterion to build new estimators of $\|s\|^2$. We prove a general nonasymptotic risk bound which allows us to show that such penalized estimators are adaptive on a variety of collections of sets for the parameter $s$, depending on the family of models from which they are built.In particular, in the context of the Gaussian sequence model, a convenient choice of the family of models allows defining estimators which are adaptive over collections of hyperrectangles, ellipsoids, $l_p$-bodies or Besov bodies.We take special care to describe the conditions under which the penalized estimator is efficient when the level of noise $\sigma$ tends to zero. Our construction is an alternative to the one by Efroïmovich and Low for hyperrectangles and provides new results otherwise.
We derive a new general concentration-of-measure inequality. The concentration inequality applies, among others, to configuration functions as defined by Talagrand and also to combinatorial entropies such as the logarithm of … We derive a new general concentration-of-measure inequality. The concentration inequality applies, among others, to configuration functions as defined by Talagrand and also to combinatorial entropies such as the logarithm of the number of increasing subsequences in a random permutation and to Vapnik-Chervonenkis (VC) entropies. The results find direct applications in statistical learning theory, substantiating the possibility to use the empirical VC entropy in penalization techniques. © 2000 John Wiley & Sons, Inc. Random Struct. Alg., 16: 277–292, 2000
We propose some explicit values for the constants involved in the exponential concentration inequalities for empirical processes which are due to Talagrand. It has been shown by Ledoux that deviation … We propose some explicit values for the constants involved in the exponential concentration inequalities for empirical processes which are due to Talagrand. It has been shown by Ledoux that deviation inequalities for empirical processes could be obtained by iteration of logarithmic Sobolev type inequalities. Our approach follows closely that of Ledoux. The improvements that we get with respect to Ledoux's work are based on refinements of his entropy inequalities and computations.
Nous presentons quelques applications d'inegalites de con- centration a la resolution de problemes de selection de modeles en statis- tique.Nous etudions en detail deux exemples pour lesquels cette ap- proche … Nous presentons quelques applications d'inegalites de con- centration a la resolution de problemes de selection de modeles en statis- tique.Nous etudions en detail deux exemples pour lesquels cette ap- proche s'avere particulirement fructueuse.Nous considerons tout d'abord le classique mais delicat probleme du choix d'un bon histogramme.Nous presentons un extrait de travail de Castellan sur la question, mettant en evidence que la structure meme des inegalites de concentration de Tala- grand pour des processus empiriques influence directement la construc- tion d'un critere de selection de type Akaike modifie.Nous presentons egalement un nouveau theoreme de selection de modeles bien adapte a la resolution de problemes d'apprentissage.Ce resultat per met de reinterpre- ter et d'ameliorer la methode dite de minimisation structurelle du risque due a Vapnik.ABSTRACT.-The purpose of this paper is to illustrate the power of con- centration inequalities by presenting some striking applications to various model selection problems in statistics.We shall study in details two main examples.We shall consider the old-standing problem of optimal selection of an histogram (following the lines of Castellan's work on this topic) for which the structure of Talagrand's concentration inequalities for empir- ical processes directly influences the construction of a modified Akaike criterion.We shall also present a new model selection theorem which can be applied to improve on Vapnik's structural minimization of the risk method for the statistical learning problem of pattern recognition.
We prove a new exponential inequality for the Kaplan–Meier estimator of a distribution function in a right censored data model. This inequality is of the same type as the Dvoretzky–Kiefer–Wolfowitz … We prove a new exponential inequality for the Kaplan–Meier estimator of a distribution function in a right censored data model. This inequality is of the same type as the Dvoretzky–Kiefer–Wolfowitz inequality for the empirical distribution function in the non-censored case. Our approach is based on Duhamel equation which allows to use empirical process theory. Nous prouvons une nouvelle inégalité exponentielle pour l'estimateur de Kaplan–Meier d'une fonction de répartition dans un modèle de données censurées à droite. Cette inégalité est du même type que l'inégalité de Dvoretzky–Kiefer–Wolfowitz pour la fonction de répartition empirique dans le cas non censuré. Notre approche est basée sur l'équation de Duhamel qui nous permet d'utiliser la théorie des processus empiriques.
We present a new general concentration-of-measure inequality and illustrate its power by applications in random combinatorics. The results find direct applications in some problems of learning theory. We present a new general concentration-of-measure inequality and illustrate its power by applications in random combinatorics. The results find direct applications in some problems of learning theory.
Let $\varphi$ be a smooth function of $k + 2$ variables. We shall investigate in this paper the rates of convergence of estimators of $T(f) = \int\varphi(f(x), f'(x), \ldots, f^{(k)}(x), … Let $\varphi$ be a smooth function of $k + 2$ variables. We shall investigate in this paper the rates of convergence of estimators of $T(f) = \int\varphi(f(x), f'(x), \ldots, f^{(k)}(x), x) dx$ when $f$ belongs to some class of densities of smoothness $s$. We prove that, when $s \geq 2k + \frac{1}{4}$, one can define an estimator $\hat{T}_n$ of $T(f)$, based on $n$ i.i.d. observations of density $f$ on the real line, which converges at the semiparametric rate $1/ \sqrt n$. On the other hand, when $s < 2k + \frac{1}{4}, T(f)$ cannot be estimated at a rate faster than $n^{-\gamma}$ with $\gamma = 4(s - k)/\lbrack 4s + 1\rbrack$. We shall also provide some extensions to the multidimensional case. Those results extend previous works of Levit, of Bickel and Ritov and of Donoho and Nussbaum on estimation of quadratic functionals.
We derive a new general concentration-of-measure inequality. The concentration inequality applies, among others, to configuration functions as defined by Talagrand and also to combinatorial entropies such as the logarithm of … We derive a new general concentration-of-measure inequality. The concentration inequality applies, among others, to configuration functions as defined by Talagrand and also to combinatorial entropies such as the logarithm of the number of increasing subsequences in a random permutation and to Vapnik-Chervonenkis (VC) entropies. The results find direct applications in statistical learning theory, substantiating the possibility to use the empirical VC entropy in penalization techniques. © 2000 John Wiley & Sons, Inc. Random Struct. Alg., 16: 277–292, 2000
We propose some explicit values for the constants involved in the exponential concentration inequalities for empirical processes which are due to Talagrand. It has been shown by Ledoux that deviation … We propose some explicit values for the constants involved in the exponential concentration inequalities for empirical processes which are due to Talagrand. It has been shown by Ledoux that deviation inequalities for empirical processes could be obtained by iteration of logarithmic Sobolev type inequalities. Our approach follows closely that of Ledoux. The improvements that we get with respect to Ledoux's work are based on refinements of his entropy inequalities and computations.
We present a new and simple approach to some of the deviation inequalities for product measures deeply investigated by M. Talagrand in the recent years. Our method is based on … We present a new and simple approach to some of the deviation inequalities for product measures deeply investigated by M. Talagrand in the recent years. Our method is based on functional inequalities of Poincaré and logarithmic Sobolev type and iteration of these inequalities. In particular, we establish with theses tools sharp deviation inequalities from the mean on norms of sums of independent random vectors and empirical processes. Concentration for the Hamming distance may also be deduced from this approach.
We discuss the interpretation of C p -plots and show how they can be calibrated in several ways. We comment on the practice of using the display as a basis … We discuss the interpretation of C p -plots and show how they can be calibrated in several ways. We comment on the practice of using the display as a basis for formal selection of a subset-regression model, and extend the range of application of the device to encompass arbitrary linear estimates of the regression coefficients, for example Ridge estimates.
Density estimation is a commonly used test case for nonparametric estimation methods. We explore the asymptotic properties of estimators based on thresholding of empirical wavelet coefficients. Minimax rates of convergence … Density estimation is a commonly used test case for nonparametric estimation methods. We explore the asymptotic properties of estimators based on thresholding of empirical wavelet coefficients. Minimax rates of convergence are studied over a large range of Besov function classes $B_{\sigma pq}$ and for a range of global $L'_p$ error measures, $1 \leq p' < \infty$. A single wavelet threshold estimator is asymptotically minimax within logarithmic terms simultaneously over a range of spaces and error measures. In particular, when $p' > p$, some form of nonlinearity is essential, since the minimax linear estimators are suboptimal by polynomial powers of n. A second approach, using an approximation of a Gaussian white-noise model in a Mallows metric, is used to attain exactly optimal rates of convergence for quadratic error $(p' = 2)$.
Nous presentons quelques applications d'inegalites de con- centration a la resolution de problemes de selection de modeles en statis- tique.Nous etudions en detail deux exemples pour lesquels cette ap- proche … Nous presentons quelques applications d'inegalites de con- centration a la resolution de problemes de selection de modeles en statis- tique.Nous etudions en detail deux exemples pour lesquels cette ap- proche s'avere particulirement fructueuse.Nous considerons tout d'abord le classique mais delicat probleme du choix d'un bon histogramme.Nous presentons un extrait de travail de Castellan sur la question, mettant en evidence que la structure meme des inegalites de concentration de Tala- grand pour des processus empiriques influence directement la construc- tion d'un critere de selection de type Akaike modifie.Nous presentons egalement un nouveau theoreme de selection de modeles bien adapte a la resolution de problemes d'apprentissage.Ce resultat per met de reinterpre- ter et d'ameliorer la methode dite de minimisation structurelle du risque due a Vapnik.ABSTRACT.-The purpose of this paper is to illustrate the power of con- centration inequalities by presenting some striking applications to various model selection problems in statistics.We shall study in details two main examples.We shall consider the old-standing problem of optimal selection of an histogram (following the lines of Castellan's work on this topic) for which the structure of Talagrand's concentration inequalities for empir- ical processes directly influences the construction of a modified Akaike criterion.We shall also present a new model selection theorem which can be applied to improve on Vapnik's structural minimization of the risk method for the statistical learning problem of pattern recognition.
Consider independent identically distributed observations whose distribution depends on a parameter $\theta$. Measure the distance between two parameter points $\theta_1, \theta_2$ by the Hellinger distance $h(\theta_1, \theta_2)$. Suppose that for … Consider independent identically distributed observations whose distribution depends on a parameter $\theta$. Measure the distance between two parameter points $\theta_1, \theta_2$ by the Hellinger distance $h(\theta_1, \theta_2)$. Suppose that for $n$ observations there is a good but not perfect test of $\theta_0$ against $\theta_n$. Then $n^{\frac{1}{2}}h(\theta_0, \theta_n)$ stays away from zero and infinity. The usual parametric examples, regular or irregular, also have the property that there are estimates $\hat{\theta}_n$ such that $n^{\frac{1}{2}}h(\hat{\theta}_n, \theta_0)$ stays bounded in probability, so that rates of separation for tests and estimates are essentially the same. The present paper shows that need not be true in general but is correct under certain metric dimensionality assumptions on the parameter set. It is then shown that these assumptions imply convergence at the required rate of the Bayes estimates or maximum probability estimates.
Classification can be considered as nonparametric estimation of sets, where the risk is defined by means of a specific distance between sets associated with misclassification error. It is shown that … Classification can be considered as nonparametric estimation of sets, where the risk is defined by means of a specific distance between sets associated with misclassification error. It is shown that the rates of convergence of classifiers depend on two parameters: the complexity of the class of candidate sets and the margin parameter. The dependence is explicitly given, indicating that optimal fast rates approaching $O(n^{-1})$ can be attained, where n is the sample size, and that the proposed classifiers have the property of robustness to the margin. The main result of the paper concerns optimal aggregation of classifiers: we suggest a classifier that automatically adapts both to the complexity and to the margin, and attains the optimal fast rates, up to a logarithmic factor.
We attempt to recover an unknown function from noisy, sampled data. Using orthonormal bases of compactly supported wavelets, we develop a nonlinear method which works in the wavelet domain by … We attempt to recover an unknown function from noisy, sampled data. Using orthonormal bases of compactly supported wavelets, we develop a nonlinear method which works in the wavelet domain by simple nonlinear shrinkage of the empirical wavelet coefficients. The shrinkage can be tuned to be nearly minimax over any member of a wide range of Triebel- and Besov-type smoothness constraints and asymptotically mini-max over Besov bodies with $p \leq q$. Linear estimates cannot achieve even the minimax rates over Triebel and Besov classes with $p<2$, so the method can significantly outperform every linear method (e.g., kernel, smoothing spline, sieve in a minimax sense). Variants of our method based on simple threshold nonlinear estimators are nearly minimax. Our method possesses the interpretation of spatial adaptivity; it reconstructs using a kernel which may vary in shape and bandwidth from point to point, depending on the data. Least favorable distributions for certain of the Triebel and Besov scales generate objects with sparse wavelet transforms. Many real objects have similarly sparse transforms, which suggests that these minimax results are relevant for practical problems. Sequels to this paper, which was first drafted in November 1990, discuss practical implementation, spatial adaptation properties, universal near minimaxity and applications to inverse problems.
I study the typical configuration of a Cp plot when the number of variables in the regression problem is large and there are many weak effects. I show that a … I study the typical configuration of a Cp plot when the number of variables in the regression problem is large and there are many weak effects. I show that a particular configuration that is very commonly seen can arise in a simple way. I give a formula by means of which the risk incurred by the "minimum CP " rule can be estimated.
We derive inequalities of the form $\Delta (P, Q) \leq H(P|R) + H(Q|R)$ which hold for every choice of probability measures P, Q, R, where $H(P|R)$ denotes the relative entropy … We derive inequalities of the form $\Delta (P, Q) \leq H(P|R) + H(Q|R)$ which hold for every choice of probability measures P, Q, R, where $H(P|R)$ denotes the relative entropy of $P$ with respect to $R$ and $\Delta (P, Q)$ stands for a coupling type "distance" between $P$ and $Q$. Using the chain rule for relative entropies and then specializing to $Q$ with a given support we recover some of Talagrand's concentration of measure inequalities for product spaces.
Majorizing measures provide bounds for the supremum of stochastic processes. They represent the most general possible form of the chaining argument going back to Kolmogorov. Majorizing measures arose from the … Majorizing measures provide bounds for the supremum of stochastic processes. They represent the most general possible form of the chaining argument going back to Kolmogorov. Majorizing measures arose from the theory of Gaussian processes, but they now have applications far beyond this setting. The fundamental question is the construction of these measures. This paper focuses on the tools that have been developed for this purpose and, in particular, the use of geometric ideas. Applications are given to several natural problems where entropy methods are powerless.
SUMMARY With ideal spatial adaptation, an oracle furnishes information about how best to adapt a spatially variable estimator, whether piecewise constant, piecewise polynomial, variable knot spline, or variable bandwidth kernel, … SUMMARY With ideal spatial adaptation, an oracle furnishes information about how best to adapt a spatially variable estimator, whether piecewise constant, piecewise polynomial, variable knot spline, or variable bandwidth kernel, to the unknown function. Estimation with the aid of an oracle offers dramatic advantages over traditional linear estimation by nonadaptive kernels; however, it is a priori unclear whether such performance can be obtained by a procedure relying on the data alone. We describe a new principle for spatially-adaptive estimation: selective wavelet reconstruction. We show that variable-knot spline fits and piecewise-polynomial fits, when equipped with an oracle to select the knots, are not dramatically more powerful than selective wavelet reconstruction with an oracle. We develop a practical spatially adaptive method, RiskShrink, which works by shrinkage of empirical wavelet coefficients. RiskShrink mimics the performance of an oracle for selective wavelet reconstruction as well as it is possible to do so. A new inequality in multivariate normal decision theory which we call the oracle inequality shows that attained performance differs from ideal performance by at most a factor of approximately 2 log n, where n is the sample size. Moreover no estimator can give a better guarantee than this. Within the class of spatially adaptive procedures, RiskShrink is essentially optimal. Relying only on the data, it comes within a factor log 2 n of the performance of piecewise polynomial and variableknot spline methods equipped with an oracle. In contrast, it is unknown how or if piecewise polynomial methods could be made to function this well when denied access to an oracle and forced to rely on data alone.
Discriminant analysis for two data sets in $\mathbb{R}^d$ with probability densities $f$ and $g$ can be based on the estimation of the set $G = \{x: f(x) \geq g(x)\}$. We … Discriminant analysis for two data sets in $\mathbb{R}^d$ with probability densities $f$ and $g$ can be based on the estimation of the set $G = \{x: f(x) \geq g(x)\}$. We consider applications where it is appropriate to assume that the region $G$ has a smooth boundary or belongs to another nonparametric class of sets. In particular, this assumption makes sense if discrimination is used as a data analytic tool. Decision rules based on minimization of empirical risk over the whole class of sets and over sieves are considered. Their rates of convergence are obtained. We show that these rules achieve optimal rates for estimation of $G$ and optimal rates of convergence for Bayes risks. An interesting conclusion is that the optimal rates for Bayes risks can be very fast, in particular, faster than the “parametric” root-$n$ rate. These fast rates cannot be guaranteed for plug-in rules.
Concentration functions and inequalities Isoperimetric and functional examples Concentration and geometry Concentration in product spaces Entropy and concentration Transportation cost inequalities Sharp bounds of Gaussian and empirical processes Selected applications … Concentration functions and inequalities Isoperimetric and functional examples Concentration and geometry Concentration in product spaces Entropy and concentration Transportation cost inequalities Sharp bounds of Gaussian and empirical processes Selected applications References Index.
The concentration of measure phenomenon is product spaces is a far-reaching abstract generalization of the classical exponential inequalities for sums of independent random variables. We attempt to explain in the … The concentration of measure phenomenon is product spaces is a far-reaching abstract generalization of the classical exponential inequalities for sums of independent random variables. We attempt to explain in the simplest possible terms the basic concepts underlying this phenomenon, the basic method to prove concentration inequalities and the meaning of several of the most useful inequalities.
$C_p, C_L$, cross-validation and generalized cross-validation are useful data-driven techniques for selecting a good estimate from a proposed class of linear estimates. The asymptotic behaviors of these procedures are studied. … $C_p, C_L$, cross-validation and generalized cross-validation are useful data-driven techniques for selecting a good estimate from a proposed class of linear estimates. The asymptotic behaviors of these procedures are studied. Some easily interpretable conditions are derived to demonstrate the asymptotic optimality. It is argued that cross-validation and generalized cross-validation can be viewed as some special ways of applying $C_L$. Applications in nearest-neighbor nonparametric regression and in model selection are discussed in detail.
In the framework of an abstract statistical model we propose a procedure for selecting an estimator from a given family of linear estimators. We derive an upper bound on the … In the framework of an abstract statistical model we propose a procedure for selecting an estimator from a given family of linear estimators. We derive an upper bound on the risk of the selected estimator and demonstrate how this result can be used in order to develop minimax and adaptive minimax estimators in specific nonparametric estimation problems.
We address the problem of density estimation with $\mathbb{L}_s$-loss by selection of kernel estimators. We develop a selection procedure and derive corresponding $\mathbb{L}_s$-risk oracle inequalities. It is shown that the … We address the problem of density estimation with $\mathbb{L}_s$-loss by selection of kernel estimators. We develop a selection procedure and derive corresponding $\mathbb{L}_s$-risk oracle inequalities. It is shown that the proposed selection rule leads to the estimator being minimax adaptive over a scale of the anisotropic Nikol'skii classes. The main technical tools used in our derivations are uniform bounds on the $\mathbb{L}_s$-norms of empirical processes developed recently by Goldenshluger and Lepski [Ann. Probab. (2011), to appear].
Abstract This paper proves a number of inequalities which improve on existing upper limits to the probability distribution of the sum of independent random variables. The inequalities presented require knowledge … Abstract This paper proves a number of inequalities which improve on existing upper limits to the probability distribution of the sum of independent random variables. The inequalities presented require knowledge only of the variance of the sum and the means and bounds of the component random variables. They are applicable when the number of component random variables is small and/or have different distributions. Figures show the improvement on existing inequalities.
A summary is not available for this content so a preview has been provided. Please use the Get access link above for information on how to access this content. A summary is not available for this content so a preview has been provided. Please use the Get access link above for information on how to access this content.
A class 𝒮 of subsets of a set X is called a Vapnik-Cervonenkis class if the growth of the function Δ 𝒮 :r→ Sup {|A∩||A⊂X,|A|=r} is polynomial ; these classes … A class 𝒮 of subsets of a set X is called a Vapnik-Cervonenkis class if the growth of the function Δ 𝒮 :r→ Sup {|A∩||A⊂X,|A|=r} is polynomial ; these classes have proved to be useful in Statistics and Probability (see for example Vapnik, Cervonenkis [V.N. Vapnik, A.YA. Cervonenkis, Theor. Prob. Appl., 16 (1971), 264-280], Dudley [R.M. Dudley, Ann. of Prob., 6 (1978), 899-929]).
In this paper, we study the problem of pointwise estimation of a multivariate function. We develop a general pointwise estimation procedure that is based on selection of estimators from a … In this paper, we study the problem of pointwise estimation of a multivariate function. We develop a general pointwise estimation procedure that is based on selection of estimators from a large parameterized collection. An upper bound on the pointwise risk is established and it is shown that the proposed selection procedure specialized for different collections of estimators leads to minimax and adaptive minimax estimators in various settings.
A general method for obtaining moment inequalities for functions of independent random variables is presented. It is a generalization of the entropy method which has been used to derive concentration … A general method for obtaining moment inequalities for functions of independent random variables is presented. It is a generalization of the entropy method which has been used to derive concentration inequalities for such functions [Boucheron, Lugosi and Massart Ann. Probab. 31 (2003) 1583–1614], and is based on a generalized tensorization inequality due to Latała and Oleszkiewicz [Lecture Notes in Math. 1745 (2000) 147–168]. The new inequalities prove to be a versatile tool in a wide range of applications. We illustrate the power of the method by showing how it can be used to effortlessly re-derive classical inequalities including Rosenthal and Kahane–Khinchine-type inequalities for sums of independent random variables, moment inequalities for suprema of empirical processes and moment inequalities for Rademacher chaos and U-statistics. Some of these corollaries are apparently new. In particular, we generalize Talagrand's exponential inequality for Rademacher chaos of order 2 to any order. We also discuss applications for other complex functions of independent random variables, such as suprema of Boolean polynomials which include, as special cases, subgraph counting problems in random graphs.
Tukey's jackknife estimate of variance for a statistic $S(X_1, X_2, \cdots, X_n)$ which is a symmetric function of i.i.d. random variables $X_i$, is investigated using an ANOVA-like decomposition of $S$. … Tukey's jackknife estimate of variance for a statistic $S(X_1, X_2, \cdots, X_n)$ which is a symmetric function of i.i.d. random variables $X_i$, is investigated using an ANOVA-like decomposition of $S$. It is shown that the jackknife variance estimate tends always to be biased upwards, a theorem to this effect being proved for the natural jackknife estimate of $\operatorname{Var} S(X_1, X_2, \cdots, X_{n-1})$ based on $X_1, X_2, \cdots, X_n$.
SUMMARY We propose a new method for estimation in linear models. The ‘lasso’ minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients … SUMMARY We propose a new method for estimation in linear models. The ‘lasso’ minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant. Because of the nature of this constraint it tends to produce some coefficients that are exactly 0 and hence gives interpretable models. Our simulation studies suggest that the lasso enjoys some of the favourable properties of both subset selection and ridge regression. It produces interpretable models like subset selection and exhibits the stability of ridge regression. There is also an interesting relationship with recent work in adaptive function estimation by Donoho and Johnstone. The lasso idea is quite general and can be applied in a variety of statistical models: extensions to generalized regression models and tree-based models are briefly described.
We exhibit an approximate equivalence between the Lasso estimator and Dantzig selector. For both methods we derive parallel oracle inequalities for the prediction risk in the general nonparametric regression model, … We exhibit an approximate equivalence between the Lasso estimator and Dantzig selector. For both methods we derive parallel oracle inequalities for the prediction risk in the general nonparametric regression model, as well as bounds on the $\ell_p$ estimation loss for $1\le p\le 2$ in the linear model when the number of variables can be much larger than the sample size.
The problem of selecting one of a number of models of different dimensions is treated by finding its Bayes solution, and evaluating the leading terms of its asymptotic expansion. These … The problem of selecting one of a number of models of different dimensions is treated by finding its Bayes solution, and evaluating the leading terms of its asymptotic expansion. These terms are a valid large-sample criterion beyond the Bayesian context, since they do not depend on the a priori distribution.
This paper is devoted, in the main, to proving the asymptotic minimax character of the sample distribution function (d.f.) for estimating an unknown d.f. in $\mathscr{F}$ or $\mathscr{F}_c$ (defined in … This paper is devoted, in the main, to proving the asymptotic minimax character of the sample distribution function (d.f.) for estimating an unknown d.f. in $\mathscr{F}$ or $\mathscr{F}_c$ (defined in Section 1) for a wide variety of weight functions. Section 1 contains definitions and a discussion of measurability considerations. Lemma 2 of Section 2 is an essential tool in our proofs and seems to be of interest per se; for example, it implies the convergence of the moment generating function of $G_n$ to that of $G$ (definitions in (2.1)). In Section 3 the asymptotic minimax character is proved for a fundamental class of weight functions which are functions of the maximum deviation between estimating and true d.f. In Section 4 a device (of more general applicability in decision theory) is employed which yields the asymptotic minimax result for a wide class of weight functions of this character as a consequence of the results of Section 3 for weight functions of the fundamental class. In Section 5 the asymptotic minimax character is proved for a class of integrated weight functions. A more general class of weight functions for which the asymptotic minimax character holds is discussed in Section 6. This includes weight functions for which the risk function of the sample d.f. is not a constant over $\mathscr{F}_c.$ Most weight functions of practical interest are included in the considerations of Sections 3 to 6. Section 6 also includes a discussion of multinomial estimation problems for which the asymptotic minimax character of the classical estimator is contained in our results. Finally, Section 7 includes a general discussion of minimization of symmetric convex or monotone functionals of symmetric random elements, with special consideration of the "tied-down" Wiener process, and with a heuristic proof of the results of Sections 3, 4, 5, and much of Section 6.
We propose a general theorem providing upper bounds for the risk of an empirical risk minimizer (ERM).We essentially focus on the binary classification framework. We extend Tsybakov’s analysis of the … We propose a general theorem providing upper bounds for the risk of an empirical risk minimizer (ERM).We essentially focus on the binary classification framework. We extend Tsybakov’s analysis of the risk of an ERM under margin type conditions by using concentration inequalities for conveniently weighted empirical processes. This allows us to deal with ways of measuring the “size” of a class of classifiers other than entropy with bracketing as in Tsybakov’s work. In particular, we derive new risk bounds for the ERM when the classification rules belong to some VC-class under margin conditions and discuss the optimality of these bounds in a minimax sense.
There is a simple inequality by Pinsker between variational distance and informational divergence of probability measures defined on arbitrary probability spaces. We shall consider probability measures on sequences taken from … There is a simple inequality by Pinsker between variational distance and informational divergence of probability measures defined on arbitrary probability spaces. We shall consider probability measures on sequences taken from countable alphabets, and derive, from Pinsker's inequality, bounds on the $\bar{d}$-distance by informational divergence. Such bounds can be used to prove the "concentration of measure" phenomenon for some nonproduct distributions.