Asymptotics for lasso-type estimators

Type: Article
Publication Date: 2000-10-01
Citations: 1338
DOI: https://doi.org/10.1214/aos/1015957397

Abstract

We consider the asymptotic behavior ofregression estimators that minimize the residual sum of squares plus a penalty proportional to $\sum|\beta_j|^{\gamma}$. for some $\gamma > 0$. These estimators include the Lasso as a special case when $\gamma = 1$. Under appropriate conditions, we show that the limiting distributions can have positive probability mass at 0 when the true value of the parameter is 0.We also consider asymptotics for “nearly singular” designs.

Locations

  • The Annals of Statistics

Ask a Question About This Paper

Summary

Login to see paper summary

We derive expressions for the finite-sample distribution of the Lasso estimator in the context of a linear regression model with normally distributed errors in low as well as in high … We derive expressions for the finite-sample distribution of the Lasso estimator in the context of a linear regression model with normally distributed errors in low as well as in high dimensions by exploiting the structure of the optimization problem defining the estimator. In low dimensions we assume full rank of the regressor matrix and present expressions for the cumulative distribution function as well as the densities of the absolutely continuous parts of the estimator. Additionally, we establish an explicit formula for the correspondence between the Lasso and the least-squares estimator. We derive analogous results for the distribution in less explicit form in high dimensions where we make no assumptions on the regressor matrix at all. In this setting, we also investigate the model selection properties of the Lasso and illustrate that which models may potentially be selected by the estimator might be completely independent of the observed response vector.
We derive expressions for the finite-sample distribution of the Lasso estimator in the context of a linear regression model in low as well as in high dimensions by exploiting the … We derive expressions for the finite-sample distribution of the Lasso estimator in the context of a linear regression model in low as well as in high dimensions by exploiting the structure of the optimization problem defining the estimator.In low dimensions, we assume full rank of the regressor matrix and present expressions for the cumulative distribution function as well as the densities of the absolutely continuous parts of the estimator.Our results are presented for the case of normally distributed errors, but do not hinge on this assumption and can easily be generalized.Additionally, we establish an explicit formula for the correspondence between the Lasso and the least-squares estimator.We derive analogous results for the distribution in less explicit form in high dimensions where we make no assumptions on the regressor matrix at all.In this setting, we also investigate the model selection properties of the Lasso and show that possibly only a subset of models might be selected by the estimator, completely independently of the observed response vector.Finally, we present a condition for uniqueness of the estimator that is necessary as well as sufficient.
We study the asymptotic properties of Lasso+mLS and Lasso+Ridge under the sparse high-dimensional linear regression model: Lasso selecting predictors and then modified Least Squares (mLS) or Ridge estimating their coefficients. … We study the asymptotic properties of Lasso+mLS and Lasso+Ridge under the sparse high-dimensional linear regression model: Lasso selecting predictors and then modified Least Squares (mLS) or Ridge estimating their coefficients. First, we propose a valid inference procedure for parameter estimation based on parametric residual bootstrap after Lasso+mLS and Lasso+Ridge. Second, we derive the asymptotic unbiasedness of Lasso+mLS and Lasso+Ridge. More specifically, we show that their biases decay at an exponential rate and they can achieve the oracle convergence rate of $s/n$ (where $s$ is the number of nonzero regression coefficients and $n$ is the sample size) for mean squared error (MSE). Third, we show that Lasso+mLS and Lasso+Ridge are asymptotically normal. They have an oracle property in the sense that they can select the true predictors with probability converging to 1 and the estimates of nonzero parameters have the same asymptotic normal distribution that they would have if the zero parameters were known in advance. In fact, our analysis is not limited to adopting Lasso in the selection stage, but is applicable to any other model selection criteria with exponentially decay rates of the probability of selecting wrong models.
We study the asymptotic properties of Lasso+mLS and Lasso+ Ridge under the sparse high-dimensional linear regression model: Lasso selecting predictors and then modified Least Squares (mLS) or Ridge estimating their … We study the asymptotic properties of Lasso+mLS and Lasso+ Ridge under the sparse high-dimensional linear regression model: Lasso selecting predictors and then modified Least Squares (mLS) or Ridge estimating their coefficients. First, we propose a valid inference procedure for parameter estimation based on parametric residual bootstrap after Lasso+ mLS and Lasso+Ridge. Second, we derive the asymptotic unbiasedness of Lasso+mLS and Lasso+Ridge. More specifically, we show that their biases decay at an exponential rate and they can achieve the oracle convergence rate of $s/n$ (where $s$ is the number of nonzero regression coefficients and $n$ is the sample size) for mean squared error (MSE). Third, we show that Lasso+mLS and Lasso+Ridge are asymptotically normal. They have an oracle property in the sense that they can select the true predictors with probability converging to $1$ and the estimates of nonzero parameters have the same asymptotic normal distribution that they would have if the zero parameters were known in advance. In fact, our analysis is not limited to adopting Lasso in the selection stage, but is applicable to any other model selection criteria with exponentially decay rates of the probability of selecting wrong models.
In this article, we derive the asymptotic distribution of the bootstrapped Lasso estimator of the regression parameter in a multiple linear regression model. It is shown that under some mild … In this article, we derive the asymptotic distribution of the bootstrapped Lasso estimator of the regression parameter in a multiple linear regression model. It is shown that under some mild regularity conditions on the design vectors and the regularization parameter, the bootstrap approximation converges weakly to a random measure. The convergence result rigorously establishes a previously known heuristic formula for the limit distribution of the bootstrapped Lasso estimator. It is also shown that when one or more components of the regression parameter vector are zero, the bootstrap may fail to be consistent.
We study the finite sample behavior of Lasso-based inference methods such as post double Lasso and debiased Lasso. Empirically and theoretically, we show that these methods can exhibit substantial omitted … We study the finite sample behavior of Lasso-based inference methods such as post double Lasso and debiased Lasso. Empirically and theoretically, we show that these methods can exhibit substantial omitted variable biases (OVBs) due to Lasso not selecting relevant controls. This phenomenon can be systematic in finite samples and occur even when the coefficients are very sparse and the sample size is large and larger than the number of controls. Therefore, relying on the existing asymptotic inference theory can be problematic in empirical applications. We compare the Lasso-based inference methods to modern high-dimensional OLS-based methods and provide practical guidance.
We develop a framework for post-selection inference with the lasso. At the core of our framework is a result that characterizes the exact (non-asymptotic) distribution of linear combinations/contrasts of truncated … We develop a framework for post-selection inference with the lasso. At the core of our framework is a result that characterizes the exact (non-asymptotic) distribution of linear combinations/contrasts of truncated normal random variables. This result allows us to (i) obtain honest confidence intervals for the selected coefficients that account for the selection procedure, and (ii) devise a test statistic that has an exact (non-asymptotic) Unif(0,1) distribution when all relevant variables have been included in the model.
Large-scale empirical data, the sample size and the dimension are high, often exhibit various characteristics. For example, the noise term follows unknown distributions or the model is very sparse that … Large-scale empirical data, the sample size and the dimension are high, often exhibit various characteristics. For example, the noise term follows unknown distributions or the model is very sparse that the number of critical variables is fixed while dimensionality grows with $n$. We consider the model selection problem of lasso for this kind of data. We investigate both theoretical guarantees and simulations, and show that the lasso is robust for various kinds of data.
We study the asymptotic properties of adaptive lasso estimators when some components of the parameter of interest β are strictly different than zero, while other components may be zero or … We study the asymptotic properties of adaptive lasso estimators when some components of the parameter of interest β are strictly different than zero, while other components may be zero or may converge to zero with rate n-δ, with δ>0, where n denotes the sample size. To achieve this objective, we analyze the convergence/divergence rates of each term in the first-order conditions of adaptive lasso estimators. First, we derive conditions that allow selecting tuning parameters in order to ensure that adaptive lasso estimates of n-δ-components indeed collapse to zero. Second, in this case, we also derive asymptotic distributions of adaptive lasso estimators for nonzero components. When δ>1/2, we obtain the usual n1/2-asymptotic normal distribution, while when 0δ ≤ 1/2, we show nδ-consistency combined with (biased) n1/2-δ-asymptotic normality for nonzero components. We call these properties, Extended Oracle Properties. These results allow practitioners to exclude in their model the asymptotically negligible variables and make inferences on the asymptotically relevant variables.
Applying standard statistical methods after model selection may yield inefficient estimators and hypothesis tests that fail to achieve nominal type-I error rates. The main issue is the fact that the … Applying standard statistical methods after model selection may yield inefficient estimators and hypothesis tests that fail to achieve nominal type-I error rates. The main issue is the fact that the post-selection distribution of the data differs from the original distribution. In particular, the observed data is constrained to lie in a subset of the original sample space that is determined by the selected model. This often makes the post-selection likelihood of the observed data intractable and maximum likelihood inference difficult. In this work, we get around the intractable likelihood by generating noisy unbiased estimates of the post-selection score function and using them in a stochastic ascent algorithm that yields correct post-selection maximum likelihood estimates. We apply the proposed technique to the problem of estimating linear models selected by the lasso. In an asymptotic analysis the resulting estimates are shown to be consistent for the selected parameters and to have a limiting truncated normal distribution. Confidence intervals constructed based on the asymptotic distribution obtain close to nominal coverage rates in all simulation settings considered, and the point estimates are shown to be superior to the lasso estimates when the true model is sparse.
Applying standard statistical methods after model selection may yield inefficient estimators and hypothesis tests that fail to achieve nominal type-I error rates. The main issue is the fact that the … Applying standard statistical methods after model selection may yield inefficient estimators and hypothesis tests that fail to achieve nominal type-I error rates. The main issue is the fact that the post-selection distribution of the data differs from the original distribution. In particular, the observed data is constrained to lie in a subset of the original sample space that is determined by the selected model. This often makes the post-selection likelihood of the observed data intractable and maximum likelihood inference difficult. In this work, we get around the intractable likelihood by generating noisy unbiased estimates of the post-selection score function and using them in a stochastic ascent algorithm that yields correct post-selection maximum likelihood estimates. We apply the proposed technique to the problem of estimating linear models selected by the lasso. In an asymptotic analysis the resulting estimates are shown to be consistent for the selected parameters and to have a limiting truncated normal distribution. Confidence intervals constructed based on the asymptotic distribution obtain close to nominal coverage rates in all simulation settings considered, and the point estimates are shown to be superior to the lasso estimates when the true model is sparse.
We consider estimation of conditional hazard functions and densities over the class of multivariate c\`adl\`ag functions with uniformly bounded sectional variation norm when data are either fully observed or subject … We consider estimation of conditional hazard functions and densities over the class of multivariate c\`adl\`ag functions with uniformly bounded sectional variation norm when data are either fully observed or subject to right-censoring. We demonstrate that the empirical risk minimizer is either not well-defined or not consistent for estimation of conditional hazard functions and densities. Under a smoothness assumption about the data-generating distribution, a highly-adaptive lasso estimator based on a particular data-adaptive sieve achieves the same convergence rate as has been shown to hold for the empirical risk minimizer in settings where the latter is well-defined. We use this result to study a highly-adaptive lasso estimator of a conditional hazard function based on right-censored data. We also propose a new conditional density estimator and derive its convergence rate. Finally, we show that the result is of interest also for settings where the empirical risk minimizer is well-defined, because the highly-adaptive lasso depends on a much smaller number of basis function than the empirical risk minimizer.
A great deal of interest has recently focused on conducting inference on the parameters in a high-dimensional linear model. In this paper, we consider a simple and very naïve two-step … A great deal of interest has recently focused on conducting inference on the parameters in a high-dimensional linear model. In this paper, we consider a simple and very naïve two-step procedure for this task, in which we (i) fit a lasso model in order to obtain a subset of the variables, and (ii) fit a least squares model on the lasso-selected set. Conventional statistical wisdom tells us that we cannot make use of the standard statistical inference tools for the resulting least squares model (such as confidence intervals and p-values), since we peeked at the data twice: once in running the lasso, and again in fitting the least squares model. However, in this paper, we show that under a certain set of assumptions, with high probability, the set of variables selected by the lasso is identical to the one selected by the noiseless lasso and is hence deterministic. Consequently, the naïve two-step approach can yield asymptotically valid inference. We utilize this finding to develop the naïve confidence interval, which can be used to draw inference on the regression coefficients of the model selected by the lasso, as well as the naïve score test, which can be used to test the hypotheses regarding the full-model regression coefficients.
Shrinkage estimation procedures such as ridge regression and the lasso have been proposed for stabilizing estimation in linear models when high collinearity exists in the design. In this paper, we … Shrinkage estimation procedures such as ridge regression and the lasso have been proposed for stabilizing estimation in linear models when high collinearity exists in the design. In this paper, we consider asymptotic properties of shrinkage estimators in the case of “nearly singular” designs.I thank Hannes Leeb and Benedikt Pötscher and also the referees for their valuable comments. This research was supported by a grant from the Natural Sciences and Engineering Research Council of Canada.
In a linear regression model with fixed dimension, we construct confidence sets for the unknown parameter vector based on the Lasso estimator in finite samples as well as in an … In a linear regression model with fixed dimension, we construct confidence sets for the unknown parameter vector based on the Lasso estimator in finite samples as well as in an asymptotic setup, thereby quantifying estimation uncertainty of this estimator. In finite samples with Gaussian errors and asymptotically in the case where the Lasso estimator is tuned to perform conservative model-selection, we derive formulas for computing the minimal coverage probability over the entire parameter space for a large class of shapes for the confidence sets, thus enabling the construction of valid confidence sets based on the Lasso estimator in these settings. The choice of shape for the confidence sets and comparison with the confidence ellipse based on the least-squares estimator is also discussed. Moreover, in the case where the Lasso estimator is tuned to enable consistent model-selection, we give a simple confidence set with minimal coverage probability converging to one.
A great deal of interest has recently focused on conducting inference on the parameters in a high-dimensional linear model. In this paper, we consider a simple and very naive two-step … A great deal of interest has recently focused on conducting inference on the parameters in a high-dimensional linear model. In this paper, we consider a simple and very naive two-step procedure for this task, in which we (i) fit a lasso model in order to obtain a subset of the variables, and (ii) fit a least squares model on the lasso-selected set. Conventional statistical wisdom tells us that we cannot make use of the standard statistical inference tools for the resulting least squares model (such as confidence intervals and $p$-values), since we peeked at the data twice: once in running the lasso, and again in fitting the least squares model. However, in this paper, we show that under a certain set of assumptions, with high probability, the set of variables selected by the lasso is identical to the one selected by the noiseless lasso and is hence deterministic. Consequently, the naive two-step approach can yield asymptotically valid inference. We utilize this finding to develop the \emph{naive confidence interval}, which can be used to draw inference on the regression coefficients of the model selected by the lasso, as well as the \emph{naive score test}, which can be used to test the hypotheses regarding the full-model regression coefficients.
We consider the problem of variable selection in high-dimensional statistical models where the goal is to report a set of variables, out of many predictors $X_{1},\dotsc ,X_{p}$, that are relevant … We consider the problem of variable selection in high-dimensional statistical models where the goal is to report a set of variables, out of many predictors $X_{1},\dotsc ,X_{p}$, that are relevant to a response of interest. For linear high-dimensional model, where the number of parameters exceeds the number of samples $(p>n)$, we propose a procedure for variables selection and prove that it controls the directional false discovery rate (FDR) below a pre-assigned significance level $q\in [0,1]$. We further analyze the statistical power of our framework and show that for designs with subgaussian rows and a common precision matrix $\Omega \in{\mathbb{R}} ^{p\times p}$, if the minimum nonzero parameter $\theta_{\min }$ satisfies \[\sqrt{n}\theta_{\min }-\sigma \sqrt{2(\max_{i\in [p]}\Omega_{ii})\log \left(\frac{2p}{qs_{0}}\right)}\to \infty \,,\] then this procedure achieves asymptotic power one. Our framework is built upon the debiasing approach and assumes the standard condition $s_{0}=o(\sqrt{n}/(\log p)^{2})$, where $s_{0}$ indicates the number of true positives among the $p$ features. Notably, this framework achieves exact directional FDR control without any assumption on the amplitude of unknown regression parameters, and does not require any knowledge of the distribution of covariates or the noise level. We test our method in synthetic and real data experiments to assess its performance and to corroborate our theoretical results.
We consider the problems of estimation and selection of parameters endowed with a known group structure, when the groups are assumed to be sign-coherent, that is, gathering either nonnegative, nonpositive … We consider the problems of estimation and selection of parameters endowed with a known group structure, when the groups are assumed to be sign-coherent, that is, gathering either nonnegative, nonpositive or null parameters. To tackle this problem, we propose the cooperative-Lasso penalty. We derive the optimality conditions defining the cooperative-Lasso estimate for generalized linear models, and propose an efficient active set algorithm suited to high-dimensional problems. We study the asymptotic consistency of the estimator in the linear regression setup and derive its irrepresentable conditions, which are milder than the ones of the group-Lasso regarding the matching of groups with the sparsity pattern of the true parameters. We also address the problem of model selection in linear regression by deriving an approximation of the degrees of freedom of the cooperative-Lasso estimator. Simulations comparing the proposed estimator to the group and sparse group-Lasso comply with our theoretical results, showing consistent improvements in support recovery for sign-coherent groups. We finally propose two examples illustrating the wide applicability of the cooperative-Lasso: first to the processing of ordinal variables, where the penalty acts as a monotonicity prior; second to the processing of genomic data, where the set of differentially expressed probes is enriched by incorporating all the probes of the microarray that are related to the corresponding genes.
We consider statistical inference for a single coordinate of regression coefficients in high-dimensional linear models. Recently, the debiased estimators are popularly used for constructing confidence intervals and hypothesis testing in … We consider statistical inference for a single coordinate of regression coefficients in high-dimensional linear models. Recently, the debiased estimators are popularly used for constructing confidence intervals and hypothesis testing in high-dimensional models. However, some representative numerical experiments show that they tend to be biased for large coefficients, especially when the number of large coefficients dominates the number of small coefficients. In this paper, we propose a modified debiased Lasso estimator based on bootstrap. Let us denote the proposed estimator BS-DB for short. We show that, under the irrepresentable condition and other mild technical conditions, the BS-DB has smaller order of bias than the debiased Lasso in existence of a large proportion of strong signals. If the irrepresentable condition does not hold, the BS-DB is guaranteed to perform no worse than the debiased Lasso asymptotically. Confidence intervals based on the BS-DB are proposed and proved to be asymptotically valid under mild conditions. Our study on the inference problems integrates the properties of the Lasso on variable selection and estimation novelly. The superior performance of the BS-DB over the debiased Lasso is demonstrated via extensive numerical studies.
We propose a new class of non‐convex penalties based on data depth functions for multitask sparse penalized regression. These penalties quantify the relative position of rows of the coefficient matrix … We propose a new class of non‐convex penalties based on data depth functions for multitask sparse penalized regression. These penalties quantify the relative position of rows of the coefficient matrix from a fixed distribution centred at the origin. We derive the theoretical properties of an approximate one‐step sparse estimator of the coefficient matrix using local linear approximation of the penalty function and provide an algorithm for its computation. For the orthogonal design and independent responses, the resulting thresholding rule enjoys near‐minimax optimal risk performance, similar to the adaptive lasso (Zou, H (2006), ‘The adaptive lasso and its oracle properties’, Journal of the American Statistical Association , 101, 1418–1429). A simulation study and real data analysis demonstrate its effectiveness compared with some of the present methods that provide sparse solutions in multitask regression. Copyright © 2018 John Wiley & Sons, Ltd.
In this paper, a sequential change point detection method is developed to monitor structural change in smoothly clipped absolute deviation (SCAD) penalized quantile regression (SPQR) models. The asymptotic properties of … In this paper, a sequential change point detection method is developed to monitor structural change in smoothly clipped absolute deviation (SCAD) penalized quantile regression (SPQR) models. The asymptotic properties of the test statistic are derived from the null and alternative hypotheses. In order to improve the performance of the SPQR method, we propose a post-SCAD penalized quantile regression estimator (P-SPQR) for high-dimensional data. We examined the finite sample properties of the proposed methods via Monte Carlo studies under different scenarios. A real data application is provided to demonstrate the effectiveness of the method.
Abstract This paper is motivated from an HIV‐1 drug resistance study where we encounter three analytical challenges: to analyze data with an informative subsample, to take into account the weak … Abstract This paper is motivated from an HIV‐1 drug resistance study where we encounter three analytical challenges: to analyze data with an informative subsample, to take into account the weak signals, and to detect important signals and also conduct statistical inference. We start with an initial estimation method, which adopts a penalized pairwise conditional likelihood approach for variable selection. This initial estimator incorporates the informative subsample issue. To accounting for the effect of weak signals, we use a key idea of partial ridge regression. We also propose a one‐step estimation method for each of the signal coefficients and then construct confidence intervals accordingly. We apply the proposed method to the Stanford HIV‐1 drug resistance study and compare the results with existing approaches. We also conduct comprehensive simulation studies to demonstrate the superior performance of our proposed method.
We consider the problem of estimating the parameters of a linear univariate autoregressive model with sub-Gaussian innovations from a limited sequence of consecutive observations. Assuming that the parameters are compressible, … We consider the problem of estimating the parameters of a linear univariate autoregressive model with sub-Gaussian innovations from a limited sequence of consecutive observations. Assuming that the parameters are compressible, we analyze the performance of the $\ell_1$-regularized least squares as well as a greedy estimator of the parameters and characterize the sampling trade-offs required for stable recovery in the non-asymptotic regime. In particular, we show that for a fixed sparsity level, stable recovery of AR parameters is possible when the number of samples scale sub-linearly with the AR order. Our results improve over existing sampling complexity requirements in AR estimation using the LASSO, when the sparsity level scales faster than the square root of the model order. We further derive sufficient conditions on the sparsity level that guarantee the minimax optimality of the $\ell_1$-regularized least squares estimate. Applying these techniques to simulated data as well as real-world datasets from crude oil prices and traffic speed data confirm our predicted theoretical performance gains in terms of estimation accuracy and model selection.
This article introduces a new type of linear regression model with regularization. Each predictor is conditionally truncated through the presence of unknown thresholds. The new model, called the two-way truncated … This article introduces a new type of linear regression model with regularization. Each predictor is conditionally truncated through the presence of unknown thresholds. The new model, called the two-way truncated linear regression model (TWT-LR), is not only viewed as a nonlinear generalization of a linear model but is also a much more flexible model with greatly enhanced interpretability and applicability. The TWT-LR model performs classifications through thresholds similar to the tree-based methods and conducts inferences that are the same as the classical linear model on different segments. In addition, the innovative penalization, called the extremely thresholding penalty (ETP), is applied to thresholds. The ETP is independent of the values of regression coefficients and does not require any normalizations of regressors. The TWT-LR-ETP model detects thresholds at a wide range, including the two extreme ends where data are sparse. Under suitable conditions, both the estimators for coefficients and thresholds are consistent, with the convergence rate for threshold estimators being faster than n. Furthermore, the estimators for coefficients are asymptotically normal for fixed dimension p. It is demonstrated in simulations and real data analyses that the TWT-LR-ETP model illustrates various threshold features and provides better estimation and prediction results than existing models. Supplementary materials for this article are available online.
Summary Variable selection plays an important role in high dimensional statistical modelling which nowadays appears in many areas and is key to various scientific discoveries. For problems of large scale … Summary Variable selection plays an important role in high dimensional statistical modelling which nowadays appears in many areas and is key to various scientific discoveries. For problems of large scale or dimensionality p, accuracy of estimation and computational cost are two top concerns. Recently, Candes and Tao have proposed the Dantzig selector using L1-regularization and showed that it achieves the ideal risk up to a logarithmic factor log(p). Their innovative procedure and remarkable result are challenged when the dimensionality is ultrahigh as the factor log(p) can be large and their uniform uncertainty principle can fail. Motivated by these concerns, we introduce the concept of sure screening and propose a sure screening method that is based on correlation learning, called sure independence screening, to reduce dimensionality from high to a moderate scale that is below the sample size. In a fairly general asymptotic framework, correlation learning is shown to have the sure screening property for even exponentially growing dimensionality. As a methodological extension, iterative sure independence screening is also proposed to enhance its finite sample performance. With dimension reduced accurately from high to below sample size, variable selection can be improved on both speed and accuracy, and can then be accomplished by a well-developed method such as smoothly clipped absolute deviation, the Dantzig selector, lasso or adaptive lasso. The connections between these penalized least squares methods are also elucidated.
The problem of consistently estimating the sparsity pattern of a vector beta* isin R <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">p</sup> based on observations contaminated by noise arises in various contexts, including signal denoising, … The problem of consistently estimating the sparsity pattern of a vector beta* isin R <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">p</sup> based on observations contaminated by noise arises in various contexts, including signal denoising, sparse approximation, compressed sensing, and model selection. We analyze the behavior of l <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sub> -constrained quadratic programming (QP), also referred to as the Lasso, for recovering the sparsity pattern. Our main result is to establish precise conditions on the problem dimension p, the number k of nonzero elements in beta*, and the number of observations n that are necessary and sufficient for sparsity pattern recovery using the Lasso. We first analyze the case of observations made using deterministic design matrices and sub-Gaussian additive noise, and provide sufficient conditions for support recovery and l <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">infin</sub> -error bounds, as well as results showing the necessity of incoherence and bounds on the minimum value. We then turn to the case of random designs, in which each row of the design is drawn from a N (0, Sigma) ensemble. For a broad class of Gaussian ensembles satisfying mutual incoherence conditions, we compute explicit values of thresholds 0 < thetas <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">l</sub> (Sigma) les thetas <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">u</sub> (Sigma) < +infin with the following properties: for any delta > 0, if n > 2 (thetas <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">u</sub> + delta) klog (p- k), then the Lasso succeeds in recovering the sparsity pattern with probability converging to one for large problems, whereas for n < 2 (thetas <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">l</sub> - delta)klog (p - k), then the probability of successful recovery converges to zero. For the special case of the uniform Gaussian ensemble (Sigma = I <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">ptimesp</sub> ), we show that thetas <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">l</sub> = thetas< <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">u</sub> = 1, so that the precise threshold n = 2 klog(p- k) is exactly determined.
We study the effective degrees of freedom of the lasso in the framework of Stein’s unbiased risk estimation (SURE). We show that the number of nonzero coefficients is an unbiased … We study the effective degrees of freedom of the lasso in the framework of Stein’s unbiased risk estimation (SURE). We show that the number of nonzero coefficients is an unbiased estimate for the degrees of freedom of the lasso—a conclusion that requires no special assumption on the predictors. In addition, the unbiased estimator is shown to be asymptotically consistent. With these results on hand, various model selection criteria—Cp, AIC and BIC—are available, which, along with the LARS algorithm, provide a principled and efficient approach to obtaining the optimal lasso fit with the computational effort of a single ordinary least-squares fit.
This paper considers the problem of selection of weights for averaging across least squares estimates obtained from a set of models. Existing model average methods are based on exponential Akaike … This paper considers the problem of selection of weights for averaging across least squares estimates obtained from a set of models. Existing model average methods are based on exponential Akaike information criterion (AIC) and Bayesian information criterion (BIC) weights. In distinction, this paper proposes selecting the weights by minimizing a Mallows criterion, the latter an estimate of the average squared error from the model average fit. We show that our new Mallows model average (MMA) estimator is asymptotically optimal in the sense of achieving the lowest possible squared error in a class of discrete model average estimators. In a simulation experiment we show that the MMA estimator compares favorably with those based on AIC and BIC weights. The proof of the main result is an application of the work of Li (1987).
We study the asymptotic properties of the adaptive Lasso estimators in sparse, high-dimensional, linear regression models when the number of covariates may increase with the sample size. We consider variable … We study the asymptotic properties of the adaptive Lasso estimators in sparse, high-dimensional, linear regression models when the number of covariates may increase with the sample size. We consider variable selection using the adap- tive Lasso, where the L1 norms in the penalty are re-weighted by data-dependent weights. We show that, if a reasonable initial estimator is available, under ap- propriate conditions, the adaptive Lasso correctly selects covariates with nonzero coefficients with probability converging to one, and that theestimators of nonzero coefficients have the same asymptotic distribution they would have if the zero co- efficients were known in advance. Thus, the adaptive Lasso hasan oracle property in the sense of Fan and Li (2001) and Fan and Peng (2004). In addition, under a partial orthogonality condition in which the covariates with zero coefficients are weakly correlated with the covariates with nonzero coefficients, marginal regression can be used to obtain the initial estimator. With this initial estimator, the adaptive Lasso has the oracle property even when the number of covariates is much larger than the sample size.
This paper studies oracle properties of ℓ1-penalized least squares in nonparametric regression setting with random design. We show that the penalized least squares estimator satisfies sparsity oracle inequalities, i.e., bounds … This paper studies oracle properties of ℓ1-penalized least squares in nonparametric regression setting with random design. We show that the penalized least squares estimator satisfies sparsity oracle inequalities, i.e., bounds in terms of the number of non-zero components of the oracle vector. The results are valid even when the dimension of the model is (much) larger than the sample size and the regression matrix is not positive definite. They can be applied to high-dimensional linear regression, to nonparametric adaptive regression estimation and to the problem of aggregation of arbitrary estimators.
We consider the least-square linear regression problem with regularization by the l1-norm, a problem usually referred to as the Lasso. In this paper, we present a detailed asymptotic analysis of … We consider the least-square linear regression problem with regularization by the l1-norm, a problem usually referred to as the Lasso. In this paper, we present a detailed asymptotic analysis of model consistency of the Lasso. For various decays of the regularization parameter, we compute asymptotic equivalents of the probability of correct model selection (i.e., variable selection). For a specific rate decay, we show that the Lasso selects all the variables that should enter the model with probability tending to one exponentially fast, while it selects all other variables with strictly positive probability. We show that this property implies that if we run the Lasso for several bootstrapped replications of a given sample, then intersecting the supports of the Lasso bootstrap estimates leads to consistent model selection. This novel variable selection algorithm, referred to as the Bolasso, is compared favorably to other linear regression methods on synthetic data and datasets from the UCI machine learning repository.
We establish estimation and model selection consistency, prediction and estimation bounds and persistence for the group-lasso estimator and model selector proposed by Yuan and Lin (2006) for least squares problems … We establish estimation and model selection consistency, prediction and estimation bounds and persistence for the group-lasso estimator and model selector proposed by Yuan and Lin (2006) for least squares problems when the covariates have a natural grouping structure. We consider the case of a fixed-dimensional parameter space with increasing sample size and the double asymptotic scenario where the model complexity changes with the sample size.
Time series analysis is widely used in the fields of economics, ecology and medicine. Robust variable selection procedures through penalized regression have been gaining increased attention. In our work, a … Time series analysis is widely used in the fields of economics, ecology and medicine. Robust variable selection procedures through penalized regression have been gaining increased attention. In our work, a robust penalized regression estimator based on exponential squared loss for autoregressive (AR) models is proposed and discussed. The objective model with adaptive Lasso penalty realizes variable selection and parameter estimation simultaneously. Under some regular conditions, we establish the asymptotic and "Oracle" properties of the proposed estimator. In particular, the induced non-convex and non-differentiable mathematical programming problem offers challenges for solving algorithms. To solve this problem efficiently, we specially design a block coordinate descent (BCD) algorithm equipped with concave-convex process (CCCP) and provide a convergence guarantee. Numerical simulation studies are carried out to show that the proposed method is particularly robust and applicable compared with some recent methods when there are different types of noise or different intensity of noise. Furthermore, an application on a dataset of daily minimum temperature in Melbourne over 1981–1990 is performed.
Data simulation shows that there is a variable disturbance phenomenon in the variable selection process of lasso. Because of this phenomenon, the step-by-step forward selection algorithm and the step-by-step backward … Data simulation shows that there is a variable disturbance phenomenon in the variable selection process of lasso. Because of this phenomenon, the step-by-step forward selection algorithm and the step-by-step backward selection algorithm cannot solve the lasso problem. Order of magnitude of lasso and lars is compared in the paper, discusses them from the perspective of set theory, and points out the inevitability of variable disturbance in the variable selection process of lasso and future work.
High dimensional data are rapidly growing in many domains due to the development of technological advances which helps collect data with a large number of variables to better understand a … High dimensional data are rapidly growing in many domains due to the development of technological advances which helps collect data with a large number of variables to better understand a given phenomenon of interest. Particular examples appear in genomics, fMRI data analysis, large-scale healthcare analytics, text/image analysis and astronomy. In the last two decades regularisation approaches have become the methods of choice for analysing such high dimensional data. This paper aims to study the performance of regularisation methods, including the recently proposed method called de-biased lasso, for the analysis of high dimensional data under different sparse and non-sparse situations. Our investigation concerns prediction, parameter estimation and variable selection. We particularly study the effects of correlated variables, covariate location and effect size which have not been well investigated. We find that correlated data when associated with important variables improve those common regularisation methods in all aspects, and that the level of sparsity can be reflected not only from the number of important variables but also from their overall effect size and locations. The latter may be seen under a non-sparse data structure. We demonstrate that the de-biased lasso performs well especially in low dimensional data, however it still suffers from issues, such as multicollinearity and multiple hypothesis testing, similar to the classical regression methods.
Upper and lower bounds are derived for the Gaussian mean width of a convex hull of $M$ points intersected with a Euclidean ball of a given radius. The upper bound … Upper and lower bounds are derived for the Gaussian mean width of a convex hull of $M$ points intersected with a Euclidean ball of a given radius. The upper bound holds for any collection of extreme points bounded in Euclidean norm. The upper bound and the lower bound match up to a multiplicative constant whenever the extreme points satisfy a one sided Restricted Isometry Property. An appealing aspect of the upper bound is that no assumption on the covariance structure of the extreme points is needed. This aspect is especially useful to study regression problems with anisotropic design distributions. We provide applications of this bound to the Lasso estimator in fixed-design regression, the Empirical Risk Minimizer in the anisotropic persistence problem, and the convex aggregation problem in density estimation.
We study the distribution of hard-, soft-, and adaptive soft-thresholding estimators within a linear regression model where the number of parameters k can depend on sample size n and may … We study the distribution of hard-, soft-, and adaptive soft-thresholding estimators within a linear regression model where the number of parameters k can depend on sample size n and may diverge with n. In addition to the case of known error-variance, we define and study versions of the estimators when the error-variance is unknown. We derive the finite-sample distribution of each estimator and study its behavior in the large-sample limit, also investigating the effects of having to estimate the variance when the degrees of freedom n-k does not tend to infinity or tends to infinity very slowly. Our analysis encompasses both the case where the estimators are tuned to perform consistent model selection and the case where the estimators are tuned to perform conservative model selection. Furthermore, we discuss consistency, uniform consistency and derive the uniform convergence rate under either type of tuning.
Summary We study quantile trend filtering, a recently proposed method for nonparametric quantile regression, with the goal of generalizing existing risk bounds for the usual trend-filtering estimators that perform mean … Summary We study quantile trend filtering, a recently proposed method for nonparametric quantile regression, with the goal of generalizing existing risk bounds for the usual trend-filtering estimators that perform mean regression. We study both the penalized and the constrained versions, of order $r \geqslant 1$, of univariate quantile trend filtering. Our results show that both the constrained and the penalized versions of order $r \geqslant 1$ attain the minimax rate up to logarithmic factors, when the $(r-1)$th discrete derivative of the true vector of quantiles belongs to the class of bounded-variation signals. Moreover, we show that if the true vector of quantiles is a discrete spline with a few polynomial pieces, then both versions attain a near-parametric rate of convergence. Corresponding results for the usual trend-filtering estimators are known to hold only when the errors are sub-Gaussian. In contrast, our risk bounds are shown to hold under minimal assumptions on the error variables. In particular, no moment assumptions are needed and our results hold under heavy-tailed errors. Our proof techniques are general, and thus can potentially be used to study other nonparametric quantile regression methods. To illustrate this generality, we employ our proof techniques to obtain new results for multivariate quantile total-variation denoising and high-dimensional quantile linear regression.
Wasserstein distributionally robust optimization estimators are obtained as solutions of min-max problems in which the statistician selects a parameter minimizing the worst-case loss among all probability models within a certain … Wasserstein distributionally robust optimization estimators are obtained as solutions of min-max problems in which the statistician selects a parameter minimizing the worst-case loss among all probability models within a certain distance (in a Wasserstein sense) from the underlying empirical measure. While motivated by the need to identify optimal model parameters or decision choices that are robust to model misspecification, these distributionally robust estimators recover a wide range of regularized estimators, including square-root lasso and support vector machines, among others, as particular cases. This paper studies the asymptotic normality of these distributionally robust estimators as well as the properties of an optimal (in a suitable sense) confidence region induced by the Wasserstein distributionally robust optimization formulation. In addition, key properties of min-max distributionally robust optimization problems are also studied, for example, we show that distributionally robust estimators regularize the loss based on its derivative and we also derive general sufficient conditions which show the equivalence between the min-max distributionally robust optimization problem and the corresponding max-min formulation.
Asymptotic lower bounds for estimation play a fundamental role in assessing the quality of statistical procedures. In this paper we propose a framework for obtaining semi-parametric efficiency bounds for sparse … Asymptotic lower bounds for estimation play a fundamental role in assessing the quality of statistical procedures. In this paper we propose a framework for obtaining semi-parametric efficiency bounds for sparse high-dimensional models, where the dimension of the parameter is larger than the sample size. We adopt a semi-parametric point of view: we concentrate on one dimensional functions of a high-dimensional parameter. We follow two different approaches to reach the lower bounds: asymptotic Cramer-Rao bounds and Le Cam's type of analysis. Both these approaches allow us to define a class of asymptotically unbiased or regular estimators for which a lower bound is derived. Consequently, we show that certain estimators obtained by de-sparsifying (or de-biasing) an $\ell_1$-penalized M-estimator are asymptotically unbiased and achieve the lower bound on the variance: thus in this sense they are asymptotically efficient. The paper discusses in detail the linear regression model and the Gaussian graphical model.
In the field of survey statistics, finite population quantities are often estimated based on complex survey data. In this thesis, estimation of the finite population total of a study variable … In the field of survey statistics, finite population quantities are often estimated based on complex survey data. In this thesis, estimation of the finite population total of a study variable is considered. The study variable is available for the sample and is supplemented by auxiliary information, which is available for every element in the finite population. Following a model-assisted framework, estimators are constructed that exploit the relationship which may exist between the study variable and ancillary data. These estimators have good design properties regardless of model accuracy. Nonparametric survey regression estimation is applicable in natural resource surveys where the relationship between the auxiliary information and study variable is complex and of an unknown form. Breidt, Claeskens, and Opsomer (2005) proposed a penalized spline survey regression estimator and studied its properties when the number of knots is fixed. To build on their work, the asymptotic properties of the penalized spline regression estimator are considered when the number of knots goes to infinity and the locations of the knots are allowed to change. The estimator is shown to be design consistent and asymptotically design unbiased. In the course of the proof, a result is established on the uniform convergence in probability of the survey-weighted quantile estimators. This result is obtained by deriving a survey-weighted Hoeffding inequality for bounded random variables. A variance estimator is proposed and shown to be design consistent for the asymptotic mean squared error. Simulation results demonstrate the usefulness of the asymptotic approximations. Also in natural resource surveys, a substantial amount of auxiliary information, typically derived from remotely-sensed imagery and organized in the form of spatial layers in a geographic information system (GIS), is available. Some of this ancillary data may be extraneous and a sparse model would be appropriate. Model selection methods are therefore warranted. The ‘least absolute shrinkage and selection operator’ (lasso), presented by Tibshirani (1996), conducts model selection and parameter estimation simultaneously by penalizing the sum of the absolute values of the model coefficients. A survey-weighted lasso criterion, which accounts for the sampling design, is derived and a survey-weighted lasso estimator is presented. The root-n design consistency of the estimator and a central limit theorem result are proved. Several variants of the survey-weighted lasso estimator are constructed. In particular, a calibration estimator and a ridge regression approximation estimator are constructed to produce lasso weights that can be applied to several study variables. Simulation studies show the lasso estimators are more efficient than the regression estimator when the true model is sparse. The lasso estimators are used to estimate the proportion of tree canopy cover for a region of Utah. Under a joint design-model framework, the survey-weighted lasso coefficients are shown to be root-N consistent for the parameters of the superpopulation model and a central limit theorem result is found. The methodology is applied to estimate the risk factors for the Zika virus from an epidemiological survey on the island of Yap. A logistic survey-weighted lasso regression model is fit to the data and important covariates are identified.
Fan and Li propose a family of variable selection methods via penalized likelihood using concave penalty functions. The nonconcave penalized likelihood estimators enjoy the oracle properties, but maximizing the penalized … Fan and Li propose a family of variable selection methods via penalized likelihood using concave penalty functions. The nonconcave penalized likelihood estimators enjoy the oracle properties, but maximizing the penalized likelihood function is computationally challenging, because the objective function is nondifferentiable and nonconcave. In this article, we propose a new unified algorithm based on the local linear approximation (LLA) for maximizing the penalized likelihood for a broad class of concave penalty functions. Convergence and other theoretical properties of the LLA algorithm are established. A distinguished feature of the LLA algorithm is that at each LLA step, the LLA estimator can naturally adopt a sparse representation. Thus, we suggest using the one-step LLA estimator from the LLA algorithm as the final estimates. Statistically, we show that if the regularization parameter is appropriately chosen, the one-step LLA estimates enjoy the oracle properties with good initial estimators. Computationally, the one-step LLA estimation methods dramatically reduce the computational cost in maximizing the nonconcave penalized likelihood. We conduct some Monte Carlo simulation to assess the finite sample performance of the one-step sparse estimation methods. The results are very encouraging.
K-fold cross-validation (CV) is widely adopted as a model selection criterion. In K-fold CV, folds are used for model construction and the hold-out fold is allocated to model validation. This … K-fold cross-validation (CV) is widely adopted as a model selection criterion. In K-fold CV, folds are used for model construction and the hold-out fold is allocated to model validation. This implies model construction is more emphasised than the model validation procedure. However, some studies have revealed that more emphasis on the validation procedure may result in improved model selection. Specifically, leave-m-out CV with n samples may achieve variable-selection consistency when m/n approaches to 1. In this study, a new CV method is proposed within the framework of K-fold CV. The proposed method uses folds of the data for model validation, while the other fold is for model construction. This provides predicted values for each observation. These values are averaged to produce a final predicted value. Then, the model selection based on the averaged predicted values can reduce variation in the assessment due to the averaging. The variable-selection consistency of the suggested method is established. Its advantage over K-fold CV with finite samples are examined under linear, non-linear, and high-dimensional models.
In multiple regression problems when covariates can be naturally grouped, it is important to carry out feature selection at the group and within-group individual variable levels simultaneously. The existing methods, … In multiple regression problems when covariates can be naturally grouped, it is important to carry out feature selection at the group and within-group individual variable levels simultaneously. The existing methods, including the lasso and group lasso, are designed for either variable selection or group selection, but not for both. We propose a group bridge approach that is capable of simultaneous selection at both the group and within-group individual variable levels. The proposed approach is a penalized regularization method that uses a specially designed group bridge penalty. It has the oracle group selection property, in that it can correctly select important groups with probability converging to one. In contrast, the group lasso and group least angle regression methods in general do not possess such an oracle property in group selection. Simulation studies indicate that the group bridge has superior performance in group and individual variable selection relative to several existing methods.
Given a data set with many features observed in a large number of conditions, it is desirable to fuse and aggregate conditions which are similar to ease the interpretation and … Given a data set with many features observed in a large number of conditions, it is desirable to fuse and aggregate conditions which are similar to ease the interpretation and extract the main characteristics of the data. This paper presents a multidimensional fusion penalty framework to address this question when the number of conditions is large. If the fusion penalty is encoded by an $\ell_q$-norm, we prove for uniform weights that the path of solutions is a tree which is suitable for interpretability. For the $\ell_1$ and $\ell_\infty$-norms, the path is piecewise linear and we derive a homotopy algorithm to recover exactly the whole tree structure. For weighted $\ell_1$-fusion penalties, we demonstrate that distance-decreasing weights lead to balanced tree structures. For a subclass of these weights that we call "exponentially adaptive", we derive an $\mathcal{O}(n\log(n))$ homotopy algorithm and we prove an asymptotic oracle property. This guarantees that we recover the underlying structure of the data efficiently both from a statistical and a computational point of view. We provide a fast implementation of the homotopy algorithm for the single feature case, as well as an efficient embedded cross-validation procedure that takes advantage of the tree structure of the path of solutions. Our proposal outperforms its competing procedures on simulations both in terms of timings and prediction accuracy. As an example we consider phenotypic data: given one or several traits, we reconstruct a balanced tree structure and assess its agreement with the known taxonomy.
Simultaneously, finding multiple influential variables and controlling the false discovery rate (FDR) for linear regression models is a fundamental problem. We here propose the Gaussian Mirror (GM) method, which creates … Simultaneously, finding multiple influential variables and controlling the false discovery rate (FDR) for linear regression models is a fundamental problem. We here propose the Gaussian Mirror (GM) method, which creates for each predictor variable a pair of mirror variables by adding and subtracting a randomly generated Gaussian perturbation, and proceeds with a certain regression method, such as the ordinary least-square or the Lasso (the mirror variables can also be created after selection). The mirror variables naturally lead to test statistics effective for controlling the FDR. Under a mild assumption on the dependence among the covariates, we show that the FDR can be controlled at any designated level asymptotically. We also demonstrate through extensive numerical studies that the GM method is more powerful than many existing methods for selecting relevant variables subject to FDR control, especially for cases when the covariates are highly correlated and the influential variables are not overly sparse.
We establish oracle inequalities for a version of the Lasso in high-dimensional fixed effects dynamic panel data models. The inequalities are valid for the coefficients of the dynamic and exogenous … We establish oracle inequalities for a version of the Lasso in high-dimensional fixed effects dynamic panel data models. The inequalities are valid for the coefficients of the dynamic and exogenous regressors. Separate oracle inequalities are derived for the fixed effects. Next, we show how one can conduct uniformly valid inference on the parameters of the model and construct a uniformly valid estimator of the asymptotic covariance matrix which is robust to conditional heteroskedasticity in the error terms. Allowing for conditional heteroskedasticity is important in dynamic models as the conditional error variance may be nonconstant over time and depend on the covariates. Furthermore, our procedure allows for inference on high-dimensional subsets of the parameter vector of an increasing cardinality. We show that the confidence bands resulting from our procedure are asymptotically honest and contract at the optimal rate. This rate is different for the fixed effects than for the remaining parts of the parameter vector.
1.1. Introduction.- 1.2. Outer Integrals and Measurable Majorants.- 1.3. Weak Convergence.- 1.4. Product Spaces.- 1.5. Spaces of Bounded Functions.- 1.6. Spaces of Locally Bounded Functions.- 1.7. The Ball Sigma-Field and … 1.1. Introduction.- 1.2. Outer Integrals and Measurable Majorants.- 1.3. Weak Convergence.- 1.4. Product Spaces.- 1.5. Spaces of Bounded Functions.- 1.6. Spaces of Locally Bounded Functions.- 1.7. The Ball Sigma-Field and Measurability of Suprema.- 1.8. Hilbert Spaces.- 1.9. Convergence: Almost Surely and in Probability.- 1.10. Convergence: Weak, Almost Uniform, and in Probability.- 1.11. Refinements.- 1.12. Uniformity and Metrization.- 2.1. Introduction.- 2.2. Maximal Inequalities and Covering Numbers.- 2.3. Symmetrization and Measurability.- 2.4. Glivenko-Cantelli Theorems.- 2.5. Donsker Theorems.- 2.6. Uniform Entropy Numbers.- 2.7. Bracketing Numbers.- 2.8. Uniformity in the Underlying Distribution.- 2.9. Multiplier Central Limit Theorems.- 2.10. Permanence of the Donsker Property.- 2.11. The Central Limit Theorem for Processes.- 2.12. Partial-Sum Processes.- 2.13. Other Donsker Classes.- 2.14. Tail Bounds.- 3.1. Introduction.- 3.2. M-Estimators.- 3.3. Z-Estimators.- 3.4. Rates of Convergence.- 3.5. Random Sample Size, Poissonization and Kac Processes.- 3.6. The Bootstrap.- 3.7. The Two-Sample Problem.- 3.8. Independence Empirical Processes.- 3.9. The Delta-Method.- 3.10. Contiguity.- 3.11. Convolution and Minimax Theorems.- A. Appendix.- A.1. Inequalities.- A.2. Gaussian Processes.- A.2.1. Inequalities and Gaussian Comparison.- A.2.2. Exponential Bounds.- A.2.3. Majorizing Measures.- A.2.4. Further Results.- A.3. Rademacher Processes.- A.4. Isoperimetric Inequalities for Product Measures.- A.5. Some Limit Theorems.- A.6. More Inequalities.- A.6.1. Binomial Random Variables.- A.6.2. Multinomial Random Vectors.- A.6.3. Rademacher Sums.- Notes.- References.- Author Index.- List of Symbols.
In this paper Chow and Robbins' (1965) sequential theory has been extended to construct a confidence region with prescribed maximum width and prescribed coverage probability for the linear regression parameters … In this paper Chow and Robbins' (1965) sequential theory has been extended to construct a confidence region with prescribed maximum width and prescribed coverage probability for the linear regression parameters under weaker conditions than Srivastava (1967), Albert (1966), and Gleser (1965). An extension to multivariate case has also been carried out.
Abstract Proposed by Tibshirani, the least absolute shrinkage and selection operator (LASSO) estimates a vector of regression coefficients by minimizing the residual sum of squares subject to a constraint on … Abstract Proposed by Tibshirani, the least absolute shrinkage and selection operator (LASSO) estimates a vector of regression coefficients by minimizing the residual sum of squares subject to a constraint on the l 1-norm of the coefficient vector. The LASSO estimator typically has one or more zero elements and thus shares characteristics of both shrinkage estimation and variable selection. In this article we treat the LASSO as a convex programming problem and derive its dual. Consideration of the primal and dual problems together leads to important new insights into the characteristics of the LASSO estimator and to an improved method for estimating its covariance matrix. Using these results we also develop an efficient algorithm for computing LASSO estimates which is usable even in cases where the number of regressors exceeds the number of observations. An S-Plus library based on this algorithm is available from StatLib.
The Cox regression model for censored survival data specifies that covariates have a proportional effect on the hazard function of the life-time distribution of an individual. In this paper we … The Cox regression model for censored survival data specifies that covariates have a proportional effect on the hazard function of the life-time distribution of an individual. In this paper we discuss how this model can be extended to a model where covariate processes have a proportional effect on the intensity process of a multivariate counting process. This permits a statistical regression analysis of the intensity of a recurrent event allowing for complicated censoring patterns and time dependent covariates. Furthermore, this formulation gives rise to proofs with very simple structure using martingale techniques for the asymptotic properties of the estimators from such a model. Finally an example of a statistical analysis is included.
We establish a new functional central limit theorem for empirical processes indexed by classes of functions. In a neighborhood of a fixed parameter point, an $n^{-1/3}$ rescaling of the parameter … We establish a new functional central limit theorem for empirical processes indexed by classes of functions. In a neighborhood of a fixed parameter point, an $n^{-1/3}$ rescaling of the parameter is compensated for by an $n^{2/3}$ rescaling of the empirical measure, resulting in a limiting Gaussian process. By means of a modified continuous mapping theorem for the location of the maximizing value, we deduce limit theorems for several statistics defined by maximization or constrained minimization of a process derived from the empirical measure. These statistics include the short, Rousseeuw's least median of squares estimator, Manski's maximum score estimator, and the maximum likelihood estimator for a monotone density. The limit theory depends on a simple new sufficient condition for a Gaussian process to achieve its maximum almost surely at a unique point.
Abstract : This report gives the most comprehensive and detailed treatment to date of some of the most powerful mathematical programming techniques currently known--sequential unconstrained methods for constrained minimization problems … Abstract : This report gives the most comprehensive and detailed treatment to date of some of the most powerful mathematical programming techniques currently known--sequential unconstrained methods for constrained minimization problems in Euclidean n-space--giving many new results not published elsewhere. It provides a fresh presentation of nonlinear programming theory, a detailed review of other unconstrained methods, and a development of the latest algorithms for unconstrained minimization. (Author)
The LAD estimator of the vector parameter in a linear regression is defined by minimizing the sum of the absolute values of the residuals. This paper provides a direct proof … The LAD estimator of the vector parameter in a linear regression is defined by minimizing the sum of the absolute values of the residuals. This paper provides a direct proof of asymptotic normality for the LAD estimator. The main theorem assumes deterministic carriers. The extension to random carriers includes the case of autoregressions whose error terms have finite second moments. For a first-order autoregression with Cauchy errors the LAD estimator is shown to converge at a 1/ n rate.
Limit theorems for an $M$-estimate constrained to lie in a closed subset of $\mathbb{R}^d$ are given under two different sets of regularity conditions. A consistent sequence of global optimizers converges … Limit theorems for an $M$-estimate constrained to lie in a closed subset of $\mathbb{R}^d$ are given under two different sets of regularity conditions. A consistent sequence of global optimizers converges under Chernoff regularity of the parameter set. A $\sqrt n$-consistent sequence of local optimizers converges under Clarke regularity of the parameter set. In either case the asymptotic distribution is a projection of a normal random vector on the tangent cone of the parameter set at the true parameter value. Limit theorems for the optimal value are also obtained, agreeing with Chernoff's result in the case of maximum likelihood with global optimizers.
Chemometrics is a field of chemistry that studies the application of statistical methods to chemical data analysis. In addition to borrowing many techniques from the statistics and engineering literatures, chemometrics … Chemometrics is a field of chemistry that studies the application of statistical methods to chemical data analysis. In addition to borrowing many techniques from the statistics and engineering literatures, chemometrics itself has given rise to several new data-analytical methods. This article examines two methods commonly used in chemometrics for predictive modeling—partial least squares and principal components regression—from a statistical perspective. The goal is to try to understand their apparent successes and in what situations they can be expected to work well and to compare them with other statistical methods intended for those situations. These methods include ordinary least squares, variable subset selection, and ridge regression.
Abstract Bridge regression, a special family of penalized regressions of a penalty function Σ|βj|γ with γ ≤ 1, considered. A general approach to solve for the bridge estimator is developed. … Abstract Bridge regression, a special family of penalized regressions of a penalty function Σ|βj|γ with γ ≤ 1, considered. A general approach to solve for the bridge estimator is developed. A new algorithm for the lasso (γ = 1) is obtained by studying the structure of the bridge estimators. The shrinkage parameter γ and the tuning parameter λ are selected via generalized cross-validation (GCV). Comparison between the bridge model (γ ≤ 1) and several other shrinkage models, namely the ordinary least squares regression (λ = 0), the lasso (γ = 1) and ridge regression (γ = 2), is made through a simulation study. It is shown that the bridge regression performs well compared to the lasso and ridge regression. These methods are demonstrated through an analysis of a prostate cancer data. Some computational advantages and limitations are discussed.
SUMMARY We propose a new method for estimation in linear models. The ‘lasso’ minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients … SUMMARY We propose a new method for estimation in linear models. The ‘lasso’ minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant. Because of the nature of this constraint it tends to produce some coefficients that are exactly 0 and hence gives interpretable models. Our simulation studies suggest that the lasso enjoys some of the favourable properties of both subset selection and ridge regression. It produces interpretable models like subset selection and exhibits the stability of ridge regression. There is also an interesting relationship with recent work in adaptive function estimation by Donoho and Johnstone. The lasso idea is quite general and can be applied in a variety of statistical models: extensions to generalized regression models and tree-based models are briefly described.
Maximum likelihood ratio theory contributes tremendous success to parametric inferences, due to the fundamental theory of Wilks (1938). Yet, there is no general applicable approach for nonparametric inferences based on … Maximum likelihood ratio theory contributes tremendous success to parametric inferences, due to the fundamental theory of Wilks (1938). Yet, there is no general applicable approach for nonparametric inferences based on function estimation. Maximum likelihood ratio test statistics in general may not exist in nonparametric function estimation setting. Even if they exist, they are hard to find and can not be optimal as shown in this paper. In this paper, we introduce the sieve likelihood statistics to overcome the drawbacks of nonparametric maximum likelihood ratio statistics. New Wilks' phenomenon is unveiled. We demonstrate that the sieve likelihood statistics are asymptotically distribution free and follow X2-distributions under the null hypotheses for a number of useful hypotheses and a variety of useful models including Caussian white noise models, nonparametric regression models, varying coefficient models and generalized varying coefficient models. We further demonstrate that sieve likelihood ratio statistics are asymptotically optimal in the sense that they achieve optimal rates of convergence given by Ingster (1993). They can even be adaptively optimal in the sense of Spokoiny (1996) by using a simple choice of adaptive smoothing parameter. Our work indicates that the sieve likelihood ratio statistics are indeed general and powerful for nonparametric inferences based on function estimation.