In inductive transfer learning, fine-tuning pre-trained convolutional networks substantially outperforms training from scratch. When using fine-tuning, the underlying assumption is that the pre-trained model extracts generic features, which are at …
In inductive transfer learning, fine-tuning pre-trained convolutional networks substantially outperforms training from scratch. When using fine-tuning, the underlying assumption is that the pre-trained model extracts generic features, which are at least partially relevant for solving the target task, but would be difficult to extract from the limited amount of data available on the target task. However, besides the initialization with the pre-trained model and the early stopping, there is no mechanism in fine-tuning for retaining the features learned on the source task. In this paper, we investigate several regularization schemes that explicitly promote the similarity of the final solution with the initial model. We show the benefit of having an explicit inductive bias towards the initial model, and we eventually recommend a simple $L^2$ penalty with the pre-trained model being a reference as the baseline of penalty for transfer learning tasks.
Adaptive ridge is a special form of ridge regression, balancing the quadratic penalization on each parameter of the model. This paper shows the equivalence between adaptive ridge and lasso (least …
Adaptive ridge is a special form of ridge regression, balancing the quadratic penalization on each parameter of the model. This paper shows the equivalence between adaptive ridge and lasso (least absolute shrinkage and selection operator). This equivalence states that both procedures produce the same estimate. Least absolute shrinkage can thus be viewed as a particular quadratic penalization. From this observation, we derive an EM algorithm to compute the lasso solution. We finally present a series of applications of this type of algorithm in regression problems: kernel regression, additive modeling and neural net training.
Adaptive Ridge is a special form of Ridge regression, balancing the quadratic penalization on each parameter of the model. It was shown to be equivalent to Lasso (least absolute shrinkage …
Adaptive Ridge is a special form of Ridge regression, balancing the quadratic penalization on each parameter of the model. It was shown to be equivalent to Lasso (least absolute shrinkage and selection operator), in the sense that both procedures produce the same estimate. Lasso can thus be viewed as a particular quadratic penalizer.
From this observation, we derive a fixed point algorithm to compute the Lasso solution. The analogy provides also a new hyper-parameter for tuning effectively the model complexity. We finally present a series of possible extensions of lasso performing sparse regression in kernel smoothing, additive modeling and neural net training.
This paper considers the problem of estimation and variable selection for large high-dimensional data (high number of predictors p and large sample size N, without excluding the possibility that N …
This paper considers the problem of estimation and variable selection for large high-dimensional data (high number of predictors p and large sample size N, without excluding the possibility that N < p) resulting from an individually matched case-control study. We develop a simple algorithm for the adaptation of the Lasso and related methods to the conditional logistic regression model. Our proposal relies on the simplification of the calculations involved in the likelihood function. Then, the proposed algorithm iteratively solves reweighted Lasso problems using cyclical coordinate descent, computed along a regularization path. This method can handle large problems and deal with sparse features efficiently. We discuss benefits and drawbacks with respect to the existing available implementations. We also illustrate the interest and use of these techniques on a pharmacoepidemiological study of medication use and traffic safety.
We consider the problems of estimation and selection of parameters endowed with a known group structure, when the groups are assumed to be sign-coherent, that is, gathering either nonnegative, nonpositive …
We consider the problems of estimation and selection of parameters endowed with a known group structure, when the groups are assumed to be sign-coherent, that is, gathering either nonnegative, nonpositive or null parameters. To tackle this problem, we propose the cooperative-Lasso penalty. We derive the optimality conditions defining the cooperative-Lasso estimate for generalized linear models, and propose an efficient active set algorithm suited to high-dimensional problems. We study the asymptotic consistency of the estimator in the linear regression setup and derive its irrepresentable conditions, which are milder than the ones of the group-Lasso regarding the matching of groups with the sparsity pattern of the true parameters. We also address the problem of model selection in linear regression by deriving an approximation of the degrees of freedom of the cooperative-Lasso estimator. Simulations comparing the proposed estimator to the group and sparse group-Lasso comply with our theoretical results, showing consistent improvements in support recovery for sign-coherent groups. We finally propose two examples illustrating the wide applicability of the cooperative-Lasso: first to the processing of ordinal variables, where the penalty acts as a monotonicity prior; second to the processing of genomic data, where the set of differentially expressed probes is enriched by incorporating all the probes of the microarray that are related to the corresponding genes.
Hierarchical penalization is a generic framework for incorporating prior information in the fitting of statistical models, when the explicative variables are organized in a hierarchical structure. The penalizer is a …
Hierarchical penalization is a generic framework for incorporating prior information in the fitting of statistical models, when the explicative variables are organized in a hierarchical structure. The penalizer is a convex functional that performs soft selection at the group level, and shrinks variables within each group. This favors solutions with few leading terms in the final combination. The framework, originally derived for taking prior knowledge into account, is shown to be useful in linear regression, when several parameters are used to model the influence of one feature, or in kernel regression, for learning multiple kernels.
Large data sets with many variables provide particular challenges when constructing analytic models. Lasso-related methods provide a useful tool, although one that remains unfamiliar to most epidemiologists.We illustrate the application …
Large data sets with many variables provide particular challenges when constructing analytic models. Lasso-related methods provide a useful tool, although one that remains unfamiliar to most epidemiologists.We illustrate the application of lasso methods in an analysis of the impact of prescribed drugs on the risk of a road traffic crash, using a large French nationwide database (PLoS Med 2010;7:e1000366). In the original case-control study, the authors analyzed each exposure separately. We use the lasso method, which can simultaneously perform estimation and variable selection in a single model. We compare point estimates and confidence intervals using (1) a separate logistic regression model for each drug with a Bonferroni correction and (2) lasso shrinkage logistic regression analysis.Shrinkage regression had little effect on (bias corrected) point estimates, but led to less conservative results, noticeably for drugs with moderate levels of exposure. Carbamates, carboxamide derivative and fatty acid derivative antiepileptics, drugs used in opioid dependence, and mineral supplements of potassium showed stronger associations.Lasso is a relevant method in the analysis of databases with large number of exposures and can be recommended as an alternative to conventional strategies.
We present a novel approach to the formulation and the resolution of sparse Linear Discriminant Analysis (LDA). Our proposal, is based on penalized Optimal Scoring. It has an exact equivalence …
We present a novel approach to the formulation and the resolution of sparse Linear Discriminant Analysis (LDA). Our proposal, is based on penalized Optimal Scoring. It has an exact equivalence with penalized LDA, contrary to the multi-class approaches based on the regression of class indicator that have been proposed so far. Sparsity is obtained thanks to a group-Lasso penalty that selects the same features in all discriminant directions. Our experiments demonstrate that this approach generates extremely parsimonious models without compromising prediction performances. Besides prediction, the resulting sparse discriminant directions are also amenable to low-dimensional representations of data. Our algorithm is highly efficient for medium to large number of variables, and is thus particularly well suited to the analysis of gene expression data.
Camera-based end-to-end driving neural networks bring the promise of a low-cost system that maps camera images to driving control commands. These networks are appealing because they replace laborious hand engineered …
Camera-based end-to-end driving neural networks bring the promise of a low-cost system that maps camera images to driving control commands. These networks are appealing because they replace laborious hand engineered building blocks but their black-box nature makes them difficult to delve in case of failure. Recent works have shown the importance of using an explicit intermediate representation that has the benefits of increasing both the interpretability and the accuracy of networks' decisions. Nonetheless, these camera-based networks reason in camera view where scale is not homogeneous and hence not directly suitable for motion forecasting. In this paper, we introduce a novel monocular camera-only holistic end-to-end trajectory planning network with a Bird-Eye-View (BEV) intermediate representation that comes in the form of binary Occupancy Grid Maps (OGMs). To ease the prediction of OGMs in BEV from camera images, we introduce a novel scheme where the OGMs are first predicted as semantic masks in camera view and then warped in BEV using the homography between the two planes. The key element allowing this transformation to be applied to 3D objects such as vehicles, consists in predicting solely their footprint in camera-view, hence respecting the flat world hypothesis implied by the homography.
This paper tackles the problem of endogenous link prediction for Knowledge Base completion. Knowledge Bases can be represented as directed graphs whose nodes correspond to entities and edges to relationships. …
This paper tackles the problem of endogenous link prediction for Knowledge Base completion. Knowledge Bases can be represented as directed graphs whose nodes correspond to entities and edges to relationships. Previous attempts either consist of powerful systems with high capacity to model complex connectivity patterns, which unfortunately usually end up overfitting on rare relationships, or in approaches that trade capacity for simplicity in order to fairly model all relationships, frequent or not. In this paper, we propose Tatec a happy medium obtained by complementing a high-capacity model with a simpler one, both pre-trained separately and then combined. We present several variants of this model with different kinds of regularization and combination strategies and show that this approach outperforms existing methods on different types of relationships by achieving state-of-the-art results on four benchmarks of the literature.
This paper proposes a new robust regression interpretation of sparse penalties such as the elastic net and the group-lasso. Beyond providing a new viewpoint on these penalization schemes, our approach …
This paper proposes a new robust regression interpretation of sparse penalties such as the elastic net and the group-lasso. Beyond providing a new viewpoint on these penalization schemes, our approach results in a unified optimization strategy. Our evaluation experiments demonstrate that this strategy, implemented on the elastic net, is computationally extremely efficient for small to medium size problems. Our accompanying software solves problems at machine precision in the time required to get a rough estimate with competing state-of-the-art algorithms.
This discussion is a continuation of Tutz and Gertheiss (2016)’s paper, where we focus on the importance of the coding of effects in regularized categorical and ordinal regression. We show …
This discussion is a continuation of Tutz and Gertheiss (2016)’s paper, where we focus on the importance of the coding of effects in regularized categorical and ordinal regression. We show that, though that an appropriate regularization is profitable for any coding, the choice of a relevant coding can prevail over the one of the regularization term for revealing structures. We focus on predictors though the issues raised also apply to responses. We illustrate our point on a classic data set.
Gaussian Graphical Models provide a convenient framework for representing dependencies between variables. Recently, this tool has received a high interest for the discovery of biological networks. The literature focuses on …
Gaussian Graphical Models provide a convenient framework for representing dependencies between variables. Recently, this tool has received a high interest for the discovery of biological networks. The literature focuses on the case where a single network is inferred from a set of measurements, but, as wetlab data is typically scarce, several assays, where the experimental conditions affect interactions, are usually merged to infer a single network. In this paper, we propose two approaches for estimating multiple related graphs, by rendering the closeness assumption into an empirical prior or group penalties. We provide quantitative results demonstrating the benefits of the proposed approaches. The methods presented in this paper are embeded in the R package 'simone' from version 1.0-0 and later.
Learning generic representations with deep networks requires massive training samples and significant computer resources. To learn a new specific task, an important issue is to transfer the generic teacher's representation …
Learning generic representations with deep networks requires massive training samples and significant computer resources. To learn a new specific task, an important issue is to transfer the generic teacher's representation to a student network. In this paper, we propose to use a metric between representations that is based on a functional view of neurons. We use optimal transport to quantify the match between two representations, yielding a distance that embeds some invariances inherent to the representation of deep networks. This distance defines a regularizer promoting the similarity of the student's representation with that of the teacher. Our approach can be used in any learning context where representation transfer is applicable. We experiment here on two standard settings: inductive transfer learning, where the teacher's representation is transferred to a student network of same architecture for a new related task, and knowledge distillation, where the teacher's representation is transferred to a student of simpler architecture for the same task (model compression). Our approach also lends itself to solving new learning problems; we demonstrate this by showing how to directly transfer the teacher's representation to a simpler architecture student for a new related task.
Non-linear performance measures are widely used for the evaluation of learning algorithms. For example, $F$-measure is a commonly used performance measure for classification problems in machine learning and information retrieval …
Non-linear performance measures are widely used for the evaluation of learning algorithms. For example, $F$-measure is a commonly used performance measure for classification problems in machine learning and information retrieval community. We study the theoretical properties of a subset of non-linear performance measures called pseudo-linear performance measures which includes $F$-measure, \emph{Jaccard Index}, among many others. We establish that many notions of $F$-measures and \emph{Jaccard Index} are pseudo-linear functions of the per-class false negatives and false positives for binary, multiclass and multilabel classification. Based on this observation, we present a general reduction of such performance measure optimization problem to cost-sensitive classification problem with unknown costs. We then propose an algorithm with provable guarantees to obtain an approximately optimal classifier for the $F$-measure by solving a series of cost-sensitive classification problems. The strength of our analysis is to be valid on any dataset and any class of classifiers, extending the existing theoretical results on pseudo-linear measures, which are asymptotic in nature. We also establish the multi-objective nature of the $F$-score maximization problem by linking the algorithm with the weighted-sum approach used in multi-objective optimization. We present numerical experiments to illustrate the relative importance of cost asymmetry and thresholding when learning linear classifiers on various $F$-measure optimization tasks.
The conditional logistic regression model is the standard tool for the analysis of epidemiological studies in which one or more cases (the event of interest), are individually matched with one …
The conditional logistic regression model is the standard tool for the analysis of epidemiological studies in which one or more cases (the event of interest), are individually matched with one or more controls (not showing the event). These situations arise, for example, in matched case-control and case-crossover studies.
Usually, regression coefficients are estimated by maximizing the conditional log-likelihood function and variable selection is performed by conventional manual or automatic selection procedures, such as stepwise. These techniques are, however, unsatisfactory in sparse, high-dimensional settings in which penalized methods, such as the lasso (least absolute shrinkage and selection operator) [Tibshirani, 1996], have emerged as an alternative. In particular, the lasso and related methods have recently been adapted to conditional logistic regression [Avalos et al., 2012b].
The R package clogitLasso brings together algorithms to estimate parameters of conditional logistic regression models using lasso, other sparse methods (elastic net, adaptive lasso and bootstrapped versions of lasso). Different criteria are implemented for choosing the regularization term accounting for the dependent nature of data. Resampling methods for evaluating the stability of the selected model are proposed. The most common individually matched study designs are available.
This paper proposes a new interpretation of sparse penalties such as the elastic-net and the group-lasso. Beyond providing a new viewpoint on these penalization schemes, our approach results in a …
This paper proposes a new interpretation of sparse penalties such as the elastic-net and the group-lasso. Beyond providing a new viewpoint on these penalization schemes, our approach results in a unified optimization strategy. Our experiments demonstrate that this strategy, implemented on the elastic-net, is computationally extremely efficient for small to medium size problems. Our accompanying software solves problems very accurately, at machine precision, in the time required to get a rough estimate with competing state-of-the-art algorithms. We illustrate on real and artificial datasets that this accuracy is required to for the correctness of the support of the solution, which is an important element for the interpretability of sparsity-inducing penalties.
Nous proposons une méthode de discrimination non paramétrique conçue pour favoriser l'interprétabilité de la prédiction.D'une part, l'utilisation d'un modèle additif généralisé permet de représenter graphiquement l'effet de chaque variable d'entrée …
Nous proposons une méthode de discrimination non paramétrique conçue pour favoriser l'interprétabilité de la prédiction.D'une part, l'utilisation d'un modèle additif généralisé permet de représenter graphiquement l'effet de chaque variable d'entrée sur la variable de sortie.D'autre part, les paramètres de ce modèle sont estimés par vraisemblance pénalisée, où le terme de régularisation généralise la pénalisation l1 aux fonctions splines.Cette pénalisation favorise les solutions parcimonieuses sélectionnant une partie de l'ensemble des variables d'entrée, tout en permettant une modélisation flexible de la dépendance sur les variables sélectionnées.Nous étudions l'adaptation de différents critères de sélection analytiques à ces modèles, et nous les évaluons sur deux jeux de données réelles.
Les resultats retournes par les separateurs a vaste marge sont souvent utilises comme mesures de confiance pour la classification de nouveaux exemples. Cependant, il n'y a pas de fondement theorique …
Les resultats retournes par les separateurs a vaste marge sont souvent utilises comme mesures de confiance pour la classification de nouveaux exemples. Cependant, il n'y a pas de fondement theorique a cette pratique. C'est pourquoi, lorsque l'incertitude de classification doit etre estimee, il est plus sur d'avoir recours a des classifieurs qui estiment les probabilites conditionnelles des classes. Ici, nous nous concentrons sur l'ambiguite a proximite de la frontiere de decision. Nous proposons une adaptation de l'estimation par maximum de vraisemblance, appliquee a la regression logistique. Le critere propose vise a estimer les probabilites conditionnelles, de maniere precise a l'interieur d'un intervalle defini par l'utilisateur, et moins precise ailleurs. Le modele est aussi parcimonieux, dans le sens ou peu d'exemples contribuent a la solution. L'efficacite du calcul est ainsi amelioree par rapport a la regression logistique. De plus, nos premieres experiences montrent une amelioration des performances par rapport a la regression logistique standard, avec des performances similaires a celles des separateurs a vaste marge.
This paper proposes a new interpretation of sparse penalties such as the elastic-net and the group-lasso. Beyond providing a new viewpoint on these penalization schemes, our approach results in a …
This paper proposes a new interpretation of sparse penalties such as the elastic-net and the group-lasso. Beyond providing a new viewpoint on these penalization schemes, our approach results in a unified optimization strategy. Our experiments demonstrate that this strategy, implemented on the elastic-net, is computationally extremely efficient for small to medium size problems. Our accompanying software solves problems very accurately, at machine precision, in the time required to get a rough estimate with competing state-of-the-art algorithms. We illustrate on real and artificial datasets that this accuracy is required to for the correctness of the support of the solution, which is an important element for the interpretability of sparsity-inducing penalties.
Missing data can be informative. Ignoring this information can lead to misleading conclusions when the data model does not allow information to be extracted from the missing data. We propose …
Missing data can be informative. Ignoring this information can lead to misleading conclusions when the data model does not allow information to be extracted from the missing data. We propose a co-clustering model, based on the Latent Block Model, that aims to take advantage of this nonignorable nonresponses, also known as Missing Not At Random data (MNAR). A variational expectation-maximization algorithm is derived to perform inference and a model selection criterion is presented. We assess the proposed approach on a simulation study, before using our model on the voting records from the lower house of the French Parliament, where our analysis brings out relevant groups of MPs and texts, together with a sensible interpretation of the behavior of non-voters.
Nous presentons un algorithme d'inference variationnelle pour les modeles a blocs stochastiques et a blocs latents pour les graphes creux, qui tire parti de la rarete des aretes pour passer …
Nous presentons un algorithme d'inference variationnelle pour les modeles a blocs stochastiques et a blocs latents pour les graphes creux, qui tire parti de la rarete des aretes pour passer a l'echelle sur un tres grand nombre de noeuds. Cet algorithme est implemente sous la forme d'un module python, appele sparsebm, qui peut traiter des graphes comportant des millions de noeuds. Abstract. We present a variational inference algorithm for the Stochastic Block Model and the Latent Block Model for sparse graphs, which leverages on the sparsity of edges to scale up to a very large number of nodes. This algorithm is implemented as a python package, called sparsebm, which can handle graphs with millions of nodes.
Gaussian Graphical Models provide a convenient framework for representing dependencies between variables. Recently, this tool has received a high interest for the discovery of biological networks. The literature focuses on …
Gaussian Graphical Models provide a convenient framework for representing dependencies between variables. Recently, this tool has received a high interest for the discovery of biological networks. The literature focuses on the case where a single network is inferred from a set of measurements, but, as wetlab data is typically scarce, several assays, where the experimental conditions affect interactions, are usually merged to infer a single network. In this paper, we propose two approaches for estimating multiple related graphs, by rendering the closeness assumption into an empirical prior or group penalties. We provide quantitative results demonstrating the benefits of the proposed approaches. The methods presented in this paper are embeded in the R package 'simone' from version 1.0-0 and later.
In many large-scale classification problems, classes are organized in a known hierarchy, typically represented as a tree expressing the inclusion of classes in superclasses. We introduce a loss for this …
In many large-scale classification problems, classes are organized in a known hierarchy, typically represented as a tree expressing the inclusion of classes in superclasses. We introduce a loss for this type of supervised hierarchical classification. It utilizes the knowledge of the hierarchy to assign each example not only to a class but also to all encompassing superclasses. Applicable to any feedforward architecture with a softmax output layer, this loss is a proper scoring rule, in that its expectation is minimized by the true posterior class probabilities. This property allows us to simultaneously pursue consistent classification objectives between superclasses and fine-grained classes, and eliminates the need for a performance trade-off between different granularities. We conduct an experimental study on three reference benchmarks, in which we vary the size of the training sets to cover a diverse set of learning scenarios. Our approach does not entail any significant additional computational cost compared with the loss of cross-entropy. It improves accuracy and reduces the number of coarse errors, with predicted labels that are distant from ground-truth labels in the tree.
L'apprentissage statistique vise a predire, mais aussi analyser ou interpreter un phenomene. Nous proposons de guider le processus d'apprentissage en integrant une connaissance relative a la facon dont sont organisees …
L'apprentissage statistique vise a predire, mais aussi analyser ou interpreter un phenomene. Nous proposons de guider le processus d'apprentissage en integrant une connaissance relative a la facon dont sont organisees les similarites entre exemples. La connaissance est representee par une de noyaux, une structure arborescente qui permet d'organiser des groupes et sous-groupes distincts de similarites. Si nous pouvons faire l'hypothese que peu de (groupes de) similarites sont pertinentes pour discriminer les observations, notre approche fait emerger les groupes et sous-groupes de similarites pertinentes. Nous proposons ici la premiere solution complete a ce probleme, permettant l'apprentissage d'un separateur a vaste marge (SVM) sur des pyramides de noyaux de hauteur arbitraire. Les ponderations des (groupes de) similarites sont apprises conjointement avec les parametres du SVM, par optimisation d'un critere que nous montrons etre une formulation variationnelle d'un probleme regularise par une norme mixte. Nous illustrons notre approche sur un probleme de reconnaissance d'expressions faciales, ou les caracteristiques des images sont decrites par une pyramide representant l'organisation spatiale et l'echelle des filtres d'ondelettes appliques sur des patchs d'images.
In many large-scale classification problems, classes are organized in a known hierarchy, typically represented as a tree expressing the inclusion of classes in superclasses. We introduce a loss for this …
In many large-scale classification problems, classes are organized in a known hierarchy, typically represented as a tree expressing the inclusion of classes in superclasses. We introduce a loss for this type of supervised hierarchical classification. It utilizes the knowledge of the hierarchy to assign each example not only to a class but also to all encompassing superclasses. Applicable to any feedforward architecture with a softmax output layer, this loss is a proper scoring rule, in that its expectation is minimized by the true posterior class probabilities. This property allows us to simultaneously pursue consistent classification objectives between superclasses and fine-grained classes, and eliminates the need for a performance trade-off between different granularities. We conduct an experimental study on three reference benchmarks, in which we vary the size of the training sets to cover a diverse set of learning scenarios. Our approach does not entail any significant additional computational cost compared with the loss of cross-entropy. It improves accuracy and reduces the number of coarse errors, with predicted labels that are distant from ground-truth labels in the tree.
Camera-based end-to-end driving neural networks bring the promise of a low-cost system that maps camera images to driving control commands. These networks are appealing because they replace laborious hand engineered …
Camera-based end-to-end driving neural networks bring the promise of a low-cost system that maps camera images to driving control commands. These networks are appealing because they replace laborious hand engineered building blocks but their black-box nature makes them difficult to delve in case of failure. Recent works have shown the importance of using an explicit intermediate representation that has the benefits of increasing both the interpretability and the accuracy of networks' decisions. Nonetheless, these camera-based networks reason in camera view where scale is not homogeneous and hence not directly suitable for motion forecasting. In this paper, we introduce a novel monocular camera-only holistic end-to-end trajectory planning network with a Bird-Eye-View (BEV) intermediate representation that comes in the form of binary Occupancy Grid Maps (OGMs). To ease the prediction of OGMs in BEV from camera images, we introduce a novel scheme where the OGMs are first predicted as semantic masks in camera view and then warped in BEV using the homography between the two planes. The key element allowing this transformation to be applied to 3D objects such as vehicles, consists in predicting solely their footprint in camera-view, hence respecting the flat world hypothesis implied by the homography.
Missing data can be informative. Ignoring this information can lead to misleading conclusions when the data model does not allow information to be extracted from the missing data. We propose …
Missing data can be informative. Ignoring this information can lead to misleading conclusions when the data model does not allow information to be extracted from the missing data. We propose a co-clustering model, based on the Latent Block Model, that aims to take advantage of this nonignorable nonresponses, also known as Missing Not At Random data (MNAR). A variational expectation-maximization algorithm is derived to perform inference and a model selection criterion is presented. We assess the proposed approach on a simulation study, before using our model on the voting records from the lower house of the French Parliament, where our analysis brings out relevant groups of MPs and texts, together with a sensible interpretation of the behavior of non-voters.
Nous presentons un algorithme d'inference variationnelle pour les modeles a blocs stochastiques et a blocs latents pour les graphes creux, qui tire parti de la rarete des aretes pour passer …
Nous presentons un algorithme d'inference variationnelle pour les modeles a blocs stochastiques et a blocs latents pour les graphes creux, qui tire parti de la rarete des aretes pour passer a l'echelle sur un tres grand nombre de noeuds. Cet algorithme est implemente sous la forme d'un module python, appele sparsebm, qui peut traiter des graphes comportant des millions de noeuds. Abstract. We present a variational inference algorithm for the Stochastic Block Model and the Latent Block Model for sparse graphs, which leverages on the sparsity of edges to scale up to a very large number of nodes. This algorithm is implemented as a python package, called sparsebm, which can handle graphs with millions of nodes.
Learning generic representations with deep networks requires massive training samples and significant computer resources. To learn a new specific task, an important issue is to transfer the generic teacher's representation …
Learning generic representations with deep networks requires massive training samples and significant computer resources. To learn a new specific task, an important issue is to transfer the generic teacher's representation to a student network. In this paper, we propose to use a metric between representations that is based on a functional view of neurons. We use optimal transport to quantify the match between two representations, yielding a distance that embeds some invariances inherent to the representation of deep networks. This distance defines a regularizer promoting the similarity of the student's representation with that of the teacher. Our approach can be used in any learning context where representation transfer is applicable. We experiment here on two standard settings: inductive transfer learning, where the teacher's representation is transferred to a student network of same architecture for a new related task, and knowledge distillation, where the teacher's representation is transferred to a student of simpler architecture for the same task (model compression). Our approach also lends itself to solving new learning problems; we demonstrate this by showing how to directly transfer the teacher's representation to a simpler architecture student for a new related task.
In inductive transfer learning, fine-tuning pre-trained convolutional networks substantially outperforms training from scratch. When using fine-tuning, the underlying assumption is that the pre-trained model extracts generic features, which are at …
In inductive transfer learning, fine-tuning pre-trained convolutional networks substantially outperforms training from scratch. When using fine-tuning, the underlying assumption is that the pre-trained model extracts generic features, which are at least partially relevant for solving the target task, but would be difficult to extract from the limited amount of data available on the target task. However, besides the initialization with the pre-trained model and the early stopping, there is no mechanism in fine-tuning for retaining the features learned on the source task. In this paper, we investigate several regularization schemes that explicitly promote the similarity of the final solution with the initial model. We show the benefit of having an explicit inductive bias towards the initial model, and we eventually recommend a simple $L^2$ penalty with the pre-trained model being a reference as the baseline of penalty for transfer learning tasks.
This discussion is a continuation of Tutz and Gertheiss (2016)’s paper, where we focus on the importance of the coding of effects in regularized categorical and ordinal regression. We show …
This discussion is a continuation of Tutz and Gertheiss (2016)’s paper, where we focus on the importance of the coding of effects in regularized categorical and ordinal regression. We show that, though that an appropriate regularization is profitable for any coding, the choice of a relevant coding can prevail over the one of the regularization term for revealing structures. We focus on predictors though the issues raised also apply to responses. We illustrate our point on a classic data set.
This paper considers the problem of estimation and variable selection for large high-dimensional data (high number of predictors p and large sample size N, without excluding the possibility that N …
This paper considers the problem of estimation and variable selection for large high-dimensional data (high number of predictors p and large sample size N, without excluding the possibility that N < p) resulting from an individually matched case-control study. We develop a simple algorithm for the adaptation of the Lasso and related methods to the conditional logistic regression model. Our proposal relies on the simplification of the calculations involved in the likelihood function. Then, the proposed algorithm iteratively solves reweighted Lasso problems using cyclical coordinate descent, computed along a regularization path. This method can handle large problems and deal with sparse features efficiently. We discuss benefits and drawbacks with respect to the existing available implementations. We also illustrate the interest and use of these techniques on a pharmacoepidemiological study of medication use and traffic safety.
Non-linear performance measures are widely used for the evaluation of learning algorithms. For example, $F$-measure is a commonly used performance measure for classification problems in machine learning and information retrieval …
Non-linear performance measures are widely used for the evaluation of learning algorithms. For example, $F$-measure is a commonly used performance measure for classification problems in machine learning and information retrieval community. We study the theoretical properties of a subset of non-linear performance measures called pseudo-linear performance measures which includes $F$-measure, \emph{Jaccard Index}, among many others. We establish that many notions of $F$-measures and \emph{Jaccard Index} are pseudo-linear functions of the per-class false negatives and false positives for binary, multiclass and multilabel classification. Based on this observation, we present a general reduction of such performance measure optimization problem to cost-sensitive classification problem with unknown costs. We then propose an algorithm with provable guarantees to obtain an approximately optimal classifier for the $F$-measure by solving a series of cost-sensitive classification problems. The strength of our analysis is to be valid on any dataset and any class of classifiers, extending the existing theoretical results on pseudo-linear measures, which are asymptotic in nature. We also establish the multi-objective nature of the $F$-score maximization problem by linking the algorithm with the weighted-sum approach used in multi-objective optimization. We present numerical experiments to illustrate the relative importance of cost asymmetry and thresholding when learning linear classifiers on various $F$-measure optimization tasks.
This paper tackles the problem of endogenous link prediction for Knowledge Base completion. Knowledge Bases can be represented as directed graphs whose nodes correspond to entities and edges to relationships. …
This paper tackles the problem of endogenous link prediction for Knowledge Base completion. Knowledge Bases can be represented as directed graphs whose nodes correspond to entities and edges to relationships. Previous attempts either consist of powerful systems with high capacity to model complex connectivity patterns, which unfortunately usually end up overfitting on rare relationships, or in approaches that trade capacity for simplicity in order to fairly model all relationships, frequent or not. In this paper, we propose Tatec a happy medium obtained by complementing a high-capacity model with a simpler one, both pre-trained separately and then combined. We present several variants of this model with different kinds of regularization and combination strategies and show that this approach outperforms existing methods on different types of relationships by achieving state-of-the-art results on four benchmarks of the literature.
The conditional logistic regression model is the standard tool for the analysis of epidemiological studies in which one or more cases (the event of interest), are individually matched with one …
The conditional logistic regression model is the standard tool for the analysis of epidemiological studies in which one or more cases (the event of interest), are individually matched with one or more controls (not showing the event). These situations arise, for example, in matched case-control and case-crossover studies.
Usually, regression coefficients are estimated by maximizing the conditional log-likelihood function and variable selection is performed by conventional manual or automatic selection procedures, such as stepwise. These techniques are, however, unsatisfactory in sparse, high-dimensional settings in which penalized methods, such as the lasso (least absolute shrinkage and selection operator) [Tibshirani, 1996], have emerged as an alternative. In particular, the lasso and related methods have recently been adapted to conditional logistic regression [Avalos et al., 2012b].
The R package clogitLasso brings together algorithms to estimate parameters of conditional logistic regression models using lasso, other sparse methods (elastic net, adaptive lasso and bootstrapped versions of lasso). Different criteria are implemented for choosing the regularization term accounting for the dependent nature of data. Resampling methods for evaluating the stability of the selected model are proposed. The most common individually matched study designs are available.
This paper proposes a new robust regression interpretation of sparse penalties such as the elastic net and the group-lasso. Beyond providing a new viewpoint on these penalization schemes, our approach …
This paper proposes a new robust regression interpretation of sparse penalties such as the elastic net and the group-lasso. Beyond providing a new viewpoint on these penalization schemes, our approach results in a unified optimization strategy. Our evaluation experiments demonstrate that this strategy, implemented on the elastic net, is computationally extremely efficient for small to medium size problems. Our accompanying software solves problems at machine precision in the time required to get a rough estimate with competing state-of-the-art algorithms.
This paper proposes a new interpretation of sparse penalties such as the elastic-net and the group-lasso. Beyond providing a new viewpoint on these penalization schemes, our approach results in a …
This paper proposes a new interpretation of sparse penalties such as the elastic-net and the group-lasso. Beyond providing a new viewpoint on these penalization schemes, our approach results in a unified optimization strategy. Our experiments demonstrate that this strategy, implemented on the elastic-net, is computationally extremely efficient for small to medium size problems. Our accompanying software solves problems very accurately, at machine precision, in the time required to get a rough estimate with competing state-of-the-art algorithms. We illustrate on real and artificial datasets that this accuracy is required to for the correctness of the support of the solution, which is an important element for the interpretability of sparsity-inducing penalties.
Large data sets with many variables provide particular challenges when constructing analytic models. Lasso-related methods provide a useful tool, although one that remains unfamiliar to most epidemiologists.We illustrate the application …
Large data sets with many variables provide particular challenges when constructing analytic models. Lasso-related methods provide a useful tool, although one that remains unfamiliar to most epidemiologists.We illustrate the application of lasso methods in an analysis of the impact of prescribed drugs on the risk of a road traffic crash, using a large French nationwide database (PLoS Med 2010;7:e1000366). In the original case-control study, the authors analyzed each exposure separately. We use the lasso method, which can simultaneously perform estimation and variable selection in a single model. We compare point estimates and confidence intervals using (1) a separate logistic regression model for each drug with a Bonferroni correction and (2) lasso shrinkage logistic regression analysis.Shrinkage regression had little effect on (bias corrected) point estimates, but led to less conservative results, noticeably for drugs with moderate levels of exposure. Carbamates, carboxamide derivative and fatty acid derivative antiepileptics, drugs used in opioid dependence, and mineral supplements of potassium showed stronger associations.Lasso is a relevant method in the analysis of databases with large number of exposures and can be recommended as an alternative to conventional strategies.
We consider the problems of estimation and selection of parameters endowed with a known group structure, when the groups are assumed to be sign-coherent, that is, gathering either nonnegative, nonpositive …
We consider the problems of estimation and selection of parameters endowed with a known group structure, when the groups are assumed to be sign-coherent, that is, gathering either nonnegative, nonpositive or null parameters. To tackle this problem, we propose the cooperative-Lasso penalty. We derive the optimality conditions defining the cooperative-Lasso estimate for generalized linear models, and propose an efficient active set algorithm suited to high-dimensional problems. We study the asymptotic consistency of the estimator in the linear regression setup and derive its irrepresentable conditions, which are milder than the ones of the group-Lasso regarding the matching of groups with the sparsity pattern of the true parameters. We also address the problem of model selection in linear regression by deriving an approximation of the degrees of freedom of the cooperative-Lasso estimator. Simulations comparing the proposed estimator to the group and sparse group-Lasso comply with our theoretical results, showing consistent improvements in support recovery for sign-coherent groups. We finally propose two examples illustrating the wide applicability of the cooperative-Lasso: first to the processing of ordinal variables, where the penalty acts as a monotonicity prior; second to the processing of genomic data, where the set of differentially expressed probes is enriched by incorporating all the probes of the microarray that are related to the corresponding genes.
We present a novel approach to the formulation and the resolution of sparse Linear Discriminant Analysis (LDA). Our proposal, is based on penalized Optimal Scoring. It has an exact equivalence …
We present a novel approach to the formulation and the resolution of sparse Linear Discriminant Analysis (LDA). Our proposal, is based on penalized Optimal Scoring. It has an exact equivalence with penalized LDA, contrary to the multi-class approaches based on the regression of class indicator that have been proposed so far. Sparsity is obtained thanks to a group-Lasso penalty that selects the same features in all discriminant directions. Our experiments demonstrate that this approach generates extremely parsimonious models without compromising prediction performances. Besides prediction, the resulting sparse discriminant directions are also amenable to low-dimensional representations of data. Our algorithm is highly efficient for medium to large number of variables, and is thus particularly well suited to the analysis of gene expression data.
This paper proposes a new interpretation of sparse penalties such as the elastic-net and the group-lasso. Beyond providing a new viewpoint on these penalization schemes, our approach results in a …
This paper proposes a new interpretation of sparse penalties such as the elastic-net and the group-lasso. Beyond providing a new viewpoint on these penalization schemes, our approach results in a unified optimization strategy. Our experiments demonstrate that this strategy, implemented on the elastic-net, is computationally extremely efficient for small to medium size problems. Our accompanying software solves problems very accurately, at machine precision, in the time required to get a rough estimate with competing state-of-the-art algorithms. We illustrate on real and artificial datasets that this accuracy is required to for the correctness of the support of the solution, which is an important element for the interpretability of sparsity-inducing penalties.
Gaussian Graphical Models provide a convenient framework for representing dependencies between variables. Recently, this tool has received a high interest for the discovery of biological networks. The literature focuses on …
Gaussian Graphical Models provide a convenient framework for representing dependencies between variables. Recently, this tool has received a high interest for the discovery of biological networks. The literature focuses on the case where a single network is inferred from a set of measurements, but, as wetlab data is typically scarce, several assays, where the experimental conditions affect interactions, are usually merged to infer a single network. In this paper, we propose two approaches for estimating multiple related graphs, by rendering the closeness assumption into an empirical prior or group penalties. We provide quantitative results demonstrating the benefits of the proposed approaches. The methods presented in this paper are embeded in the R package 'simone' from version 1.0-0 and later.
Gaussian Graphical Models provide a convenient framework for representing dependencies between variables. Recently, this tool has received a high interest for the discovery of biological networks. The literature focuses on …
Gaussian Graphical Models provide a convenient framework for representing dependencies between variables. Recently, this tool has received a high interest for the discovery of biological networks. The literature focuses on the case where a single network is inferred from a set of measurements, but, as wetlab data is typically scarce, several assays, where the experimental conditions affect interactions, are usually merged to infer a single network. In this paper, we propose two approaches for estimating multiple related graphs, by rendering the closeness assumption into an empirical prior or group penalties. We provide quantitative results demonstrating the benefits of the proposed approaches. The methods presented in this paper are embeded in the R package 'simone' from version 1.0-0 and later.
L'apprentissage statistique vise a predire, mais aussi analyser ou interpreter un phenomene. Nous proposons de guider le processus d'apprentissage en integrant une connaissance relative a la facon dont sont organisees …
L'apprentissage statistique vise a predire, mais aussi analyser ou interpreter un phenomene. Nous proposons de guider le processus d'apprentissage en integrant une connaissance relative a la facon dont sont organisees les similarites entre exemples. La connaissance est representee par une de noyaux, une structure arborescente qui permet d'organiser des groupes et sous-groupes distincts de similarites. Si nous pouvons faire l'hypothese que peu de (groupes de) similarites sont pertinentes pour discriminer les observations, notre approche fait emerger les groupes et sous-groupes de similarites pertinentes. Nous proposons ici la premiere solution complete a ce probleme, permettant l'apprentissage d'un separateur a vaste marge (SVM) sur des pyramides de noyaux de hauteur arbitraire. Les ponderations des (groupes de) similarites sont apprises conjointement avec les parametres du SVM, par optimisation d'un critere que nous montrons etre une formulation variationnelle d'un probleme regularise par une norme mixte. Nous illustrons notre approche sur un probleme de reconnaissance d'expressions faciales, ou les caracteristiques des images sont decrites par une pyramide representant l'organisation spatiale et l'echelle des filtres d'ondelettes appliques sur des patchs d'images.
Les resultats retournes par les separateurs a vaste marge sont souvent utilises comme mesures de confiance pour la classification de nouveaux exemples. Cependant, il n'y a pas de fondement theorique …
Les resultats retournes par les separateurs a vaste marge sont souvent utilises comme mesures de confiance pour la classification de nouveaux exemples. Cependant, il n'y a pas de fondement theorique a cette pratique. C'est pourquoi, lorsque l'incertitude de classification doit etre estimee, il est plus sur d'avoir recours a des classifieurs qui estiment les probabilites conditionnelles des classes. Ici, nous nous concentrons sur l'ambiguite a proximite de la frontiere de decision. Nous proposons une adaptation de l'estimation par maximum de vraisemblance, appliquee a la regression logistique. Le critere propose vise a estimer les probabilites conditionnelles, de maniere precise a l'interieur d'un intervalle defini par l'utilisateur, et moins precise ailleurs. Le modele est aussi parcimonieux, dans le sens ou peu d'exemples contribuent a la solution. L'efficacite du calcul est ainsi amelioree par rapport a la regression logistique. De plus, nos premieres experiences montrent une amelioration des performances par rapport a la regression logistique standard, avec des performances similaires a celles des separateurs a vaste marge.
Hierarchical penalization is a generic framework for incorporating prior information in the fitting of statistical models, when the explicative variables are organized in a hierarchical structure. The penalizer is a …
Hierarchical penalization is a generic framework for incorporating prior information in the fitting of statistical models, when the explicative variables are organized in a hierarchical structure. The penalizer is a convex functional that performs soft selection at the group level, and shrinks variables within each group. This favors solutions with few leading terms in the final combination. The framework, originally derived for taking prior knowledge into account, is shown to be useful in linear regression, when several parameters are used to model the influence of one feature, or in kernel regression, for learning multiple kernels.
Nous proposons une méthode de discrimination non paramétrique conçue pour favoriser l'interprétabilité de la prédiction.D'une part, l'utilisation d'un modèle additif généralisé permet de représenter graphiquement l'effet de chaque variable d'entrée …
Nous proposons une méthode de discrimination non paramétrique conçue pour favoriser l'interprétabilité de la prédiction.D'une part, l'utilisation d'un modèle additif généralisé permet de représenter graphiquement l'effet de chaque variable d'entrée sur la variable de sortie.D'autre part, les paramètres de ce modèle sont estimés par vraisemblance pénalisée, où le terme de régularisation généralise la pénalisation l1 aux fonctions splines.Cette pénalisation favorise les solutions parcimonieuses sélectionnant une partie de l'ensemble des variables d'entrée, tout en permettant une modélisation flexible de la dépendance sur les variables sélectionnées.Nous étudions l'adaptation de différents critères de sélection analytiques à ces modèles, et nous les évaluons sur deux jeux de données réelles.
Adaptive Ridge is a special form of Ridge regression, balancing the quadratic penalization on each parameter of the model. It was shown to be equivalent to Lasso (least absolute shrinkage …
Adaptive Ridge is a special form of Ridge regression, balancing the quadratic penalization on each parameter of the model. It was shown to be equivalent to Lasso (least absolute shrinkage and selection operator), in the sense that both procedures produce the same estimate. Lasso can thus be viewed as a particular quadratic penalizer.
From this observation, we derive a fixed point algorithm to compute the Lasso solution. The analogy provides also a new hyper-parameter for tuning effectively the model complexity. We finally present a series of possible extensions of lasso performing sparse regression in kernel smoothing, additive modeling and neural net training.
Adaptive ridge is a special form of ridge regression, balancing the quadratic penalization on each parameter of the model. This paper shows the equivalence between adaptive ridge and lasso (least …
Adaptive ridge is a special form of ridge regression, balancing the quadratic penalization on each parameter of the model. This paper shows the equivalence between adaptive ridge and lasso (least absolute shrinkage and selection operator). This equivalence states that both procedures produce the same estimate. Least absolute shrinkage can thus be viewed as a particular quadratic penalization. From this observation, we derive an EM algorithm to compute the lasso solution. We finally present a series of applications of this type of algorithm in regression problems: kernel regression, additive modeling and neural net training.
SUMMARY We propose a new method for estimation in linear models. The ‘lasso’ minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients …
SUMMARY We propose a new method for estimation in linear models. The ‘lasso’ minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant. Because of the nature of this constraint it tends to produce some coefficients that are exactly 0 and hence gives interpretable models. Our simulation studies suggest that the lasso enjoys some of the favourable properties of both subset selection and ridge regression. It produces interpretable models like subset selection and exhibits the stability of ridge regression. There is also an interesting relationship with recent work in adaptive function estimation by Donoho and Johnstone. The lasso idea is quite general and can be applied in a variety of statistical models: extensions to generalized regression models and tree-based models are briefly described.
In model selection, usually a "best" predictor is chosen from a collection ${\hat{\mu}(\cdot, s)}$ of predictors where $\hat{\mu}(\cdot, s)$ is the minimum least-squares predictor in a collection $\mathsf{U}_s$ of predictors. …
In model selection, usually a "best" predictor is chosen from a collection ${\hat{\mu}(\cdot, s)}$ of predictors where $\hat{\mu}(\cdot, s)$ is the minimum least-squares predictor in a collection $\mathsf{U}_s$ of predictors. Here s is a complexity parameter; that is, the smaller s, the lower dimensional/smoother the models in $\mathsf{U}_s$. If $\mathsf{L}$ is the data used to derive the sequence ${\hat{\mu}(\cdot, s)}$, the procedure is called unstable if a small change in $\mathsf{L}$ can cause large changes in ${\hat{\mu}(\cdot, s)}$. With a crystal ball, one could pick the predictor in ${\hat{\mu}(\cdot, s)}$ having minimum prediction error. Without prescience, one uses test sets, cross-validation and so forth. The difference in prediction error between the crystal ball selection and the statistician's choice we call predictive loss. For an unstable procedure the predictive loss is large. This is shown by some analytics in a simple case and by simulation results in a more complex comparison of four different linear regression methods. Unstable procedures can be stabilized by perturbing the data, getting a new predictor sequence ${\hat{\mu'}(\cdot, s)}$ and then averaging over many such predictor sequences.
Abstract Proposed by Tibshirani, the least absolute shrinkage and selection operator (LASSO) estimates a vector of regression coefficients by minimizing the residual sum of squares subject to a constraint on …
Abstract Proposed by Tibshirani, the least absolute shrinkage and selection operator (LASSO) estimates a vector of regression coefficients by minimizing the residual sum of squares subject to a constraint on the l 1-norm of the coefficient vector. The LASSO estimator typically has one or more zero elements and thus shares characteristics of both shrinkage estimation and variable selection. In this article we treat the LASSO as a convex programming problem and derive its dual. Consideration of the primal and dual problems together leads to important new insights into the characteristics of the LASSO estimator and to an improved method for estimating its covariance matrix. Using these results we also develop an efficient algorithm for computing LASSO estimates which is usable even in cases where the number of regressors exceeds the number of observations. An S-Plus library based on this algorithm is available from StatLib.
Summary We consider the problem of selecting grouped variables (factors) for accurate prediction in regression. Such a problem arises naturally in many practical situations with the multifactor analysis-of-variance problem as …
Summary We consider the problem of selecting grouped variables (factors) for accurate prediction in regression. Such a problem arises naturally in many practical situations with the multifactor analysis-of-variance problem as the most important and well-known example. Instead of selecting factors by stepwise backward elimination, we focus on the accuracy of estimation and consider extensions of the lasso, the LARS algorithm and the non-negative garrotte for factor selection. The lasso, the LARS algorithm and the non-negative garrotte are recently proposed regression methods that can be used to select individual variables. We study and propose efficient algorithms for the extensions of these methods for factor selection and show that these extensions give superior performance to the traditional stepwise backward elimination method in factor selection problems. We study the similarities and the differences between these methods. Simulations and real examples are used to illustrate the methods.
The purpose of model selection algorithms such as All Subsets, Forward Selection and Backward Elimination is to choose a linear model on the basis of the same set of data …
The purpose of model selection algorithms such as All Subsets, Forward Selection and Backward Elimination is to choose a linear model on the basis of the same set of data to which the model will be applied. Typically we have available a large collection of possible covariates from which we hope to select a parsimonious set for the efficient prediction of a response variable. Least Angle Regression (LARS), a new model selection algorithm, is a useful and less greedy version of traditional forward selection methods. Three main properties are derived: (1) A simple modification of the LARS algorithm implements the Lasso, an attractive version of ordinary least squares that constrains the sum of the absolute regression coefficients; the LARS modification calculates all possible Lasso estimates for a given problem, using an order of magnitude less computer time than previous methods. (2) A different LARS modification efficiently implements Forward Stagewise linear regression, another promising new model selection method; this connection explains the similar numerical results previously observed for the Lasso and Stagewise, and helps us understand the properties of both methods, which are seen as constrained versions of the simpler LARS algorithm. (3) A simple approximation for the degrees of freedom of a LARS estimate is available, from which we derive a Cp estimate of prediction error; this allows a principled choice among the range of possible LARS estimates. LARS and its variants are computationally efficient: the paper describes a publicly available algorithm that requires only the same order of magnitude of computational effort as ordinary least squares applied to the full set of covariates.
Adaptive Ridge is a special form of Ridge regression, balancing the quadratic penalization on each parameter of the model. It was shown to be equivalent to Lasso (least absolute shrinkage …
Adaptive Ridge is a special form of Ridge regression, balancing the quadratic penalization on each parameter of the model. It was shown to be equivalent to Lasso (least absolute shrinkage and selection operator), in the sense that both procedures produce the same estimate. Lasso can thus be viewed as a particular quadratic penalizer.
From this observation, we derive a fixed point algorithm to compute the Lasso solution. The analogy provides also a new hyper-parameter for tuning effectively the model complexity. We finally present a series of possible extensions of lasso performing sparse regression in kernel smoothing, additive modeling and neural net training.
We develop fast algorithms for estimation of generalized linear models with convex penalties. The models include linear regression, two-class logistic regression, and multinomial regression problems while the penalties include ℓ(1) …
We develop fast algorithms for estimation of generalized linear models with convex penalties. The models include linear regression, two-class logistic regression, and multinomial regression problems while the penalties include ℓ(1) (the lasso), ℓ(2) (ridge regression) and mixtures of the two (the elastic net). The algorithms use cyclical coordinate descent, computed along a regularization path. The methods can handle large problems and can also deal efficiently with sparse features. In comparative timings we find that the new algorithms are considerably faster than competing methods.
Selection of covariates is among the most controversial and difficult tasks in epidemiologic analysis. Correct variable selection addresses the problem of confounding in etiologic research and allows unbiased estimation of …
Selection of covariates is among the most controversial and difficult tasks in epidemiologic analysis. Correct variable selection addresses the problem of confounding in etiologic research and allows unbiased estimation of probabilities in prognostic studies. The aim of this commentary is to assess how often different variable selection techniques were applied in contemporary epidemiologic analysis. It was of particular interest to see whether modern methods such as shrinkage or penalized regression were used in recent publications. Stepwise selection methods remained the predominant method for variable selection in publications in epidemiological journals in 2008. Shrinkage methods were not used in any of the reviewed articles. Editors, reviewers and authors have insufficiently promoted the new, less controversial approaches of variable selection in the biomedical literature, whereas statisticians may not have adequately addressed the method's feasibility.
Sparsity or parsimony of statistical models is crucial for their proper interpretations, as in sciences and social sciences. Model selection is a commonly used method to find such models, but …
Sparsity or parsimony of statistical models is crucial for their proper interpretations, as in sciences and social sciences. Model selection is a commonly used method to find such models, but usually involves a computationally heavy combinatorial search. Lasso (Tibshirani, 1996) is now being used as a computationally feasible alternative to model selection. Therefore it is important to study Lasso for model selection purposes.
In this paper, we prove that a single condition, which we call the Irrepresentable Condition, is almost necessary and sufficient for Lasso to select the true model both in the classical fixed p setting and in the large p setting as the sample size n gets large. Based on these results, sufficient conditions that are verifiable in practice are given to relate to previous works and help applications of Lasso for feature selection and sparse representation.
This Irrepresentable Condition, which depends mainly on the covariance of the predictor variables, states that Lasso selects the true model consistently if and (almost) only if the predictors that are not in the true model are irrepresentable (in a sense to be clarified) by predictors that are in the true model. Furthermore, simulations are carried out to provide insights and understanding of this result.
Adaptive ridge is a special form of ridge regression, balancing the quadratic penalization on each parameter of the model. This paper shows the equivalence between adaptive ridge and lasso (least …
Adaptive ridge is a special form of ridge regression, balancing the quadratic penalization on each parameter of the model. This paper shows the equivalence between adaptive ridge and lasso (least absolute shrinkage and selection operator). This equivalence states that both procedures produce the same estimate. Least absolute shrinkage can thus be viewed as a particular quadratic penalization. From this observation, we derive an EM algorithm to compute the lasso solution. We finally present a series of applications of this type of algorithm in regression problems: kernel regression, additive modeling and neural net training.
This paper investigates correct variable selection in finite samples via ℓ1 and ℓ1+ℓ2 type penalization schemes. The asymptotic consistency of variable selection immediately follows from this analysis. We focus on …
This paper investigates correct variable selection in finite samples via ℓ1 and ℓ1+ℓ2 type penalization schemes. The asymptotic consistency of variable selection immediately follows from this analysis. We focus on logistic and linear regression models. The following questions are central to our paper: given a level of confidence 1−δ, under which assumptions on the design matrix, for which strength of the signal and for what values of the tuning parameters can we identify the true model at the given level of confidence? Formally, if Î is an estimate of the true variable set I*, we study conditions under which ℙ(Î=I*)≥1−δ, for a given sample size n, number of parameters M and confidence 1−δ. We show that in identifiable models, both methods can recover coefficients of size $\frac{1}{\sqrt{n}}$, up to small multiplicative constants and logarithmic factors in M and $\frac{1}{\delta}$. The advantage of the ℓ1+ℓ2 penalization over the ℓ1 is minor for the variable selection problem, for the models we consider here. Whereas the former estimates are unique, and become more stable for highly correlated data matrices as one increases the tuning parameter of the ℓ2 part, too large an increase in this parameter value may preclude variable selection.
Summary We propose a new criterion for model selection in prediction problems. The covariance inflation criterion adjusts the training error by the average covariance of the predictions and responses, when …
Summary We propose a new criterion for model selection in prediction problems. The covariance inflation criterion adjusts the training error by the average covariance of the predictions and responses, when the prediction rule is applied to permuted versions of the data set. This criterion can be applied to general prediction problems (e.g. regression or classification) and to general prediction rules (e.g. stepwise regression, tree-based models and neural nets). As a by-product we obtain a measure of the effective number of parameters used by an adaptive procedure. We relate the covariance inflation criterion to other model selection procedures and illustrate its use in some regression and classification problems. We also revisit the conditional bootstrap approach to model selection.
The Lasso is an attractive technique for regularization and variable selection for high-dimensional data, where the number of predictor variables $p_n$ is potentially much larger than the number of samples …
The Lasso is an attractive technique for regularization and variable selection for high-dimensional data, where the number of predictor variables $p_n$ is potentially much larger than the number of samples $n$. However, it was recently discovered that the sparsity pattern of the Lasso estimator can only be asymptotically identical to the true sparsity pattern if the design matrix satisfies the so-called irrepresentable condition. The latter condition can easily be violated in the presence of highly correlated variables. Here we examine the behavior of the Lasso estimators if the irrepresentable condition is relaxed. Even though the Lasso cannot recover the correct sparsity pattern, we show that the estimator is still consistent in the $\ell_2$-norm sense for fixed designs under conditions on (a) the number $s_n$ of nonzero components of the vector $β_n$ and (b) the minimal singular values of design matrices that are induced by selecting small subsets of variables. Furthermore, a rate of convergence result is obtained on the $\ell_2$ error with an appropriate choice of the smoothing parameter. The rate is shown to be optimal under the condition of bounded maximal and minimal sparse eigenvalues. Our results imply that, with high probability, all important variables are selected. The set of selected variables is a meaningful reduction on the original set of variables. Finally, our results are illustrated with the detection of closely adjacent frequencies, a problem encountered in astrophysics.
We consider the asymptotic behavior ofregression estimators that minimize the residual sum of squares plus a penalty proportional to $\sum|\beta_j|^{\gamma}$. for some $\gamma > 0$. These estimators include the Lasso …
We consider the asymptotic behavior ofregression estimators that minimize the residual sum of squares plus a penalty proportional to $\sum|\beta_j|^{\gamma}$. for some $\gamma > 0$. These estimators include the Lasso as a special case when $\gamma = 1$. Under appropriate conditions, we show that the limiting distributions can have positive probability mass at 0 when the true value of the parameter is 0.We also consider asymptotics for “nearly singular” designs.
Abstract Some research studies in the medical literature use multiple stepwise variable selection (SVS) algorithms to build multivariable models. The purpose of this study is to determine whether the use …
Abstract Some research studies in the medical literature use multiple stepwise variable selection (SVS) algorithms to build multivariable models. The purpose of this study is to determine whether the use of multiple SVS algorithms in tandem (stepwise agreement) is a valid variable selection procedure. Computer simulations were developed to address stepwise agreement. Three popular SVS algorithms were tested (backward elimination, forward selection, and stepwise) on three statistical methods (linear, logistic, and Cox proportional hazards regression). Other simulation parameters explored were the sample size, number of predictors considered, degree of correlation between pairs of predictors, p ‐value‐based entrance and exit criteria, predictor type (normally distributed or binary), and differences between stepwise agreement between any two or all three algorithms. Among stepwise methods, the rate of agreement, agreement on a model including only those predictors truly associated with the outcome, and agreement on a model containing the predictors truly associated with the outcome were measured. These rates were dependent on all simulation parameters. Mostly, the SVS algorithms agreed on a final model, but rarely on a model with only the true predictors. Sample size and candidate predictor pool size are the most influential simulation conditions. To conclude, stepwise agreement is often a poor strategy that gives misleading results and researchers should avoid using multiple SVS algorithms to build multivariable models. More research on the relationship between sample size and variable selection is needed. Published in 2010 by John Wiley & Sons, Ltd.
After screening out inappropriate or doubtful covariates on the basis of background knowledge, one may still be left with many potential confounders. It is then tempting to use statistical variable-selection …
After screening out inappropriate or doubtful covariates on the basis of background knowledge, one may still be left with many potential confounders. It is then tempting to use statistical variable-selection methods to reduce the number used for adjustment. Nonetheless, there is no agreement on how selection should be conducted, and it is well known that conventional selection methods lead to confidence intervals that are too narrow and p values that are too small. Furthermore, theory and simulation evidence have found no selection method to be uniformly superior to adjusting for all well-measured confounders. Nonetheless, control of all measured confounders can lead to problems for conventional model-fitting methods. When these problems occur, one can apply modern techniques such as shrinkage estimation, exposure modeling, or hybrids that combine outcome and exposure modeling. No selection or special software is needed for most of these techniques. It thus appears that statistical confounder selection may be an unnecessary complication in most regression analyses of effects.
A case-control design involving only cases may be used when brief exposure causes a transient change in risk of a rare acute-onset disease. The design resembles a retrospective nonrandomized crossover …
A case-control design involving only cases may be used when brief exposure causes a transient change in risk of a rare acute-onset disease. The design resembles a retrospective nonrandomized crossover study but differs in having only a sample of the base population-time. The average incidence rate ratio for a hypothesized effect period following the exposure is estimable using the Mantel-Haenszel estimator. The duration of the effect period is assumed to be that which maximizes the rate ratio estimate. Self-matching of cases eliminates the threat of control-selection bias and increases efficiency. Pilot data from a study of myocardial infarction onset illustrate the control of within-individual confounding due to temporal association of exposures.
Applied researchers frequently use automated model selection methods, such as backwards variable elimination, to develop parsimonious regression models. Statisticians have criticized the use of these methods for several reasons, amongst …
Applied researchers frequently use automated model selection methods, such as backwards variable elimination, to develop parsimonious regression models. Statisticians have criticized the use of these methods for several reasons, amongst them are the facts that the estimated regression coefficients are biased and that the derived confidence intervals do not have the advertised coverage rates. We developed a method to improve estimation of regression coefficients and confidence intervals which employs backwards variable elimination in multiple bootstrap samples. In a given bootstrap sample, predictor variables that are not selected for inclusion in the final regression model have their regression coefficient set to zero. Regression coefficients are averaged across the bootstrap samples, and non-parametric percentile bootstrap confidence intervals are then constructed for each regression coefficient. We conducted a series of Monte Carlo simulations to examine the performance of this method for estimating regression coefficients and constructing confidence intervals for variables selected using backwards variable elimination. We demonstrated that this method results in confidence intervals with superior coverage compared with those developed from conventional backwards variable elimination. We illustrate the utility of our method by applying it to a large sample of subjects hospitalized with a heart attack.
Summary We introduce a path following algorithm for L1-regularized generalized linear models. The L1-regularization procedure is useful especially because it, in effect, selects variables according to the amount of penalization …
Summary We introduce a path following algorithm for L1-regularized generalized linear models. The L1-regularization procedure is useful especially because it, in effect, selects variables according to the amount of penalization on the L1-norm of the coefficients, in a manner that is less greedy than forward selection–backward deletion. The generalized linear model path algorithm efficiently computes solutions along the entire regularization path by using the predictor–corrector method of convex optimization. Selecting the step length of the regularization parameter is critical in controlling the overall accuracy of the paths; we suggest intuitive and flexible strategies for choosing appropriate values. We demonstrate the implementation with several simulated and real data sets.
In the study of the association of transient drug exposures with acute outcomes, the case-crossover design is an efficient alternative to the case-control approach. This design based exclusively on the …
In the study of the association of transient drug exposures with acute outcomes, the case-crossover design is an efficient alternative to the case-control approach. This design based exclusively on the case series uses within-subject comparisons of drug exposures over time to estimate the rate ratio of the outcome associated with the drug under study. This design inherently removes the biasing effects of unmeasured, time-invariant confounding factors from the estimated rate ratio, but is sensitive to several assumptions. We illustrated the case-crossover design and explored its sensitivity using data from 4028 cases of gastrointestinal bleeding from the General Practice Research Database in assessing the effects of the drug warfarin. We compared the use of different time window lengths to assess exposure and considered the use of a case-time-control design to account for exposure time trends. The case-crossover approach found no excess risk of bleeding with warfarin exposure [rate ratio 0.98; 95% confidence interval (CI): 0.74-1.28] using a 1-month time window. When we restricted the analysis to subjects with truly transient drug exposure, defined by 1 to 3 prescriptions in the previous year, the rate ratio was 2.59 (95% CI: 1.42-4.74). To consider the longer 1-year exposure time window, the case-time-control approach was used and resulted in a rate ratio of 1.72 (95% CI: 1.08-2.43). In conclusion, the case-crossover design is potentially a powerful approach to assess the risk of drugs. This design is, however, highly sensitive to assumptions about intermittency of drug use and the length of the exposure time window, as demonstrated with the example of bleeding associated with warfarin use.
The case-crossover design uses cases only, and compares exposures just prior to the event times to exposures at comparable control, or 'referent' times, in order to assess the effect of …
The case-crossover design uses cases only, and compares exposures just prior to the event times to exposures at comparable control, or 'referent' times, in order to assess the effect of short-term exposure on the risk of a rare event. It has commonly been used to study the effect of air pollution on the risk of various adverse health events. Proper selection of referents is crucial, especially with air pollution exposures, which are shared, highly seasonal, and often have a long-term time trend. Hence, careful referent selection is important to control for time-varying confounders, and in order to ensure that the distribution of exposure is constant across referent times, a key assumption of this method. Yet the referent strategy is important for a more basic reason: the conditional logistic regression estimating equations commonly used are biased when referents are not chosen a priori and are functions of the observed event times. We call this bias in the estimating equations overlap bias. In this paper, we propose a new taxonomy of referent selection strategies in order to emphasize their statistical properties. We give a derivation of overlap bias, explore its magnitude, and consider how the bias depends on properties of the exposure series. We conclude that the bias is usually small, though highly unpredictable, and easily avoided.
Sparse estimation methods are aimed at using or obtaining parsimonious representations of data or models. They were first dedicated to linear variable selection but numerous extensions have now emerged such …
Sparse estimation methods are aimed at using or obtaining parsimonious representations of data or models. They were first dedicated to linear variable selection but numerous extensions have now emerged such as structured sparsity or kernel selection. It turns out that many of the related estimation problems can be cast as convex optimization problems by regularizing the empirical risk with appropriate nonsmooth norms. Optimization with Sparsity-Inducing Penalties presents optimization tools and techniques dedicated to such sparsity-inducing penalties from a general perspective. It covers proximal methods, block-coordinate descent, reweighted ?2-penalized techniques, working-set and homotopy methods, as well as non-convex formulations and extensions, and provides an extensive set of experiments to compare various algorithms from a computational point of view. The presentation of Optimization with Sparsity-Inducing Penalties is essentially based on existing literature, but the process of constructing a general framework leads naturally to new results, connections and points of view. It is an ideal reference on the topic for anyone working in machine learning and related areas.
Abstract In this article, we consider bootstrapping the Lasso estimator of the regression parameter in a multiple linear regression model. It is known that the standard bootstrap method fails to …
Abstract In this article, we consider bootstrapping the Lasso estimator of the regression parameter in a multiple linear regression model. It is known that the standard bootstrap method fails to be consistent. Here, we propose a modified bootstrap method, and show that it provides valid approximation to the distribution of the Lasso estimator, for all possible values of the unknown regression parameter vector, including the case where some of the components are zero. Further, we establish consistency of the modified bootstrap method for estimating the asymptotic bias and variance of the Lasso estimator. We also show that the residual bootstrap can be used to consistently estimate the distribution and variance of the adaptive Lasso estimator. Using the former result, we formulate a novel data-based method for choosing the optimal penalizing parameter for the Lasso using the modified bootstrap. A numerical study is performed to investigate the finite sample performance of the modified bootstrap. The methodology proposed in the article is illustrated with a real data example. Keywords: : Bootstrap variance estimationPenalized regressionRegularizationShrinkage
Abstract Bridge regression, a special family of penalized regressions of a penalty function Σ|βj|γ with γ ≤ 1, considered. A general approach to solve for the bridge estimator is developed. …
Abstract Bridge regression, a special family of penalized regressions of a penalty function Σ|βj|γ with γ ≤ 1, considered. A general approach to solve for the bridge estimator is developed. A new algorithm for the lasso (γ = 1) is obtained by studying the structure of the bridge estimators. The shrinkage parameter γ and the tuning parameter λ are selected via generalized cross-validation (GCV). Comparison between the bridge model (γ ≤ 1) and several other shrinkage models, namely the ordinary least squares regression (λ = 0), the lasso (γ = 1) and ridge regression (γ = 2), is made through a simulation study. It is shown that the bridge regression performs well compared to the lasso and ridge regression. These methods are demonstrated through an analysis of a prostate cancer data. Some computational advantages and limitations are discussed.
In this article we study post-model selection estimators that apply ordinary least squares (OLS) to the model selected by first-step penalized estimators, typically Lasso. It is well known that Lasso …
In this article we study post-model selection estimators that apply ordinary least squares (OLS) to the model selected by first-step penalized estimators, typically Lasso. It is well known that Lasso can estimate the nonparametric regression function at nearly the oracle rate, and is thus hard to improve upon. We show that the OLS post-Lasso estimator performs at least as well as Lasso in terms of the rate of convergence, and has the advantage of a smaller bias. Remarkably, this performance occurs even if the Lasso-based model selection “fails” in the sense of missing some components of the “true” regression model. By the “true” model, we mean the best $s$-dimensional approximation to the nonparametric regression function chosen by the oracle. Furthermore, OLS post-Lasso estimator can perform strictly better than Lasso, in the sense of a strictly faster rate of convergence, if the Lasso-based model selection correctly includes all components of the “true” model as a subset and also achieves sufficient sparsity. In the extreme case, when Lasso perfectly selects the “true” model, the OLS post-Lasso estimator becomes the oracle estimator. An important ingredient in our analysis is a new sparsity bound on the dimension of the model selected by Lasso, which guarantees that this dimension is at most of the same order as the dimension of the “true” model. Our rate results are nonasymptotic and hold in both parametric and nonparametric models. Moreover, our analysis is not limited to the Lasso estimator acting as a selector in the first step, but also applies to any other estimator, for example, various forms of thresholded Lasso, with good rates and good sparsity properties. Our analysis covers both traditional thresholding and a new practical, data-driven thresholding scheme that induces additional sparsity subject to maintaining a certain goodness of fit. The latter scheme has theoretical guarantees similar to those of Lasso or OLS post-Lasso, but it dominates those procedures as well as traditional thresholding in a wide variety of experiments.
We consider the fundamental problem of estimating the mean of a vector y=Xβ+z, where X is an n×p design matrix in which one can have far more variables than observations, …
We consider the fundamental problem of estimating the mean of a vector y=Xβ+z, where X is an n×p design matrix in which one can have far more variables than observations, and z is a stochastic error term—the so-called "p>n" setup. When β is sparse, or, more generally, when there is a sparse subset of covariates providing a close approximation to the unknown mean vector, we ask whether or not it is possible to accurately estimate Xβ using a computationally tractable algorithm. We show that, in a surprisingly wide range of situations, the lasso happens to nearly select the best subset of variables. Quantitatively speaking, we prove that solving a simple quadratic program achieves a squared error within a logarithmic factor of the ideal mean squared error that one would achieve with an oracle supplying perfect information about which variables should and should not be included in the model. Interestingly, our results describe the average performance of the lasso; that is, the performance one can expect in an vast majority of cases where Xβ is a sparse or nearly sparse superposition of variables, but not in all cases. Our results are nonasymptotic and widely applicable, since they simply require that pairs of predictor variables are not too collinear.
We develop fast algorithms for estimation of generalized linear models with convex penalties. The models include linear regression, two-class logistic regression, and multi- nomial regression problems while the penalties include …
We develop fast algorithms for estimation of generalized linear models with convex penalties. The models include linear regression, two-class logistic regression, and multi- nomial regression problems while the penalties include ℓ<sub>1</sub> (the lasso), ℓ<sub>2</sub> (ridge regression) and mixtures of the two (the elastic net). The algorithms use cyclical coordinate descent, computed along a regularization path. The methods can handle large problems and can also deal efficiently with sparse features. In comparative timings we find that the new algorithms are considerably faster than competing methods.
Summary Estimation of structure, such as in variable selection, graphical modelling or cluster analysis, is notoriously difficult, especially for high dimensional data. We introduce stability selection. It is based on …
Summary Estimation of structure, such as in variable selection, graphical modelling or cluster analysis, is notoriously difficult, especially for high dimensional data. We introduce stability selection. It is based on subsampling in combination with (high dimensional) selection algorithms. As such, the method is extremely general and has a very wide range of applicability. Stability selection provides finite sample control for some error rates of false discoveries and hence a transparent principle to choose a proper amount of regularization for structure estimation. Variable selection and structure estimation improve markedly for a range of selection methods if stability selection is applied. We prove for the randomized lasso that stability selection will be variable selection consistent even if the necessary conditions for consistency of the original lasso method are violated. We demonstrate stability selection for variable selection and Gaussian graphical modelling, using real and simulated data.
We consider the class of iterative shrinkage-thresholding algorithms (ISTA) for solving linear inverse problems arising in signal/image processing. This class of methods, which can be viewed as an extension of …
We consider the class of iterative shrinkage-thresholding algorithms (ISTA) for solving linear inverse problems arising in signal/image processing. This class of methods, which can be viewed as an extension of the classical gradient algorithm, is attractive due to its simplicity and thus is adequate for solving large-scale problems even with dense matrix data. However, such methods are also known to converge quite slowly. In this paper we present a new fast iterative shrinkage-thresholding algorithm (FISTA) which preserves the computational simplicity of ISTA but with a global rate of convergence which is proven to be significantly better, both theoretically and practically. Initial promising numerical results for wavelet-based image deblurring demonstrate the capabilities of FISTA which is shown to be faster than ISTA by several orders of magnitude.
The case-crossover study design is a method to assess the effect of transient exposures on the risk of onset of acute events. Control information for each case is based on …
The case-crossover study design is a method to assess the effect of transient exposures on the risk of onset of acute events. Control information for each case is based on his/her past exposure experience, and a self-matched analysis is conducted. Empiric evaluation of five approaches to the analysis of case-crossover data from a study of heavy physical exertion and acute myocardial infarction onset is shown. The data presented are from the Onset Study, a case-crossover study of the determinants of myocardial infarction onset conducted in 45 centers from August 1989 to October 1992. In model 1, exactly one control period (matched on clock-time) was sampled per case. In models 2-4, up to 25 control periods were sampled, and the effect of clock-time on the baseline hazard of infarction was modeled. In model 5, a census of the person-time experienced by each subject over the year preceding the infarction was sampled. The 95% confidence interval for model 1 was 2.7 times wider, and the relative efficiency, defined as v infinity/vM, where vM represents the asymptotic variance estimate of the estimated log relative risk with M control periods sampled per case, was only about 14% of model 5. In models 2-4, the efficiency increased with the number of control periods, regardless of the modeling assumptions. Even with many control periods sampled, models 2-4 achieved only half the efficiency of model 5. The control sampling strategy in any given case-crossover study should be selected with the trade-offs between precision and potential biases of the estimates in mind.
Dimension reduction and variable selection are performed routinely in case-control studies, but the literature on the theoretical aspects of the resulting estimates is scarce. We bring our contribution to this …
Dimension reduction and variable selection are performed routinely in case-control studies, but the literature on the theoretical aspects of the resulting estimates is scarce. We bring our contribution to this literature by studying estimators obtained via ℓ1 penalized likelihood optimization. We show that the optimizers of the ℓ1 penalized retrospective likelihood coincide with the optimizers of the ℓ1 penalized prospective likelihood. This extends the results of Prentice and Pyke (1979), obtained for non-regularized likelihoods. We establish both the sup-norm consistency of the odds ratio, after model selection, and the consistency of subset selection of our estimators. The novelty of our theoretical results consists in the study of these properties under the case-control sampling scheme. Our results hold for selection performed over a large collection of candidate variables, with cardinality allowed to depend and be greater than the sample size. We complement our theoretical results with a novel approach of determining data driven tuning parameters, based on the bisection method. The resulting procedure offers significant computational savings when compared with grid search based methods. All our numerical experiments support strongly our theoretical findings.
We study the effective degrees of freedom of the lasso in the framework of Stein’s unbiased risk estimation (SURE). We show that the number of nonzero coefficients is an unbiased …
We study the effective degrees of freedom of the lasso in the framework of Stein’s unbiased risk estimation (SURE). We show that the number of nonzero coefficients is an unbiased estimate for the degrees of freedom of the lasso—a conclusion that requires no special assumption on the predictors. In addition, the unbiased estimator is shown to be asymptotically consistent. With these results on hand, various model selection criteria—Cp, AIC and BIC—are available, which, along with the LARS algorithm, provide a principled and efficient approach to obtaining the optimal lasso fit with the computational effort of a single ordinary least-squares fit.
Abstract Consider developing a regression model in a context where substantive theory is weak. To focus on an extreme case, suppose that in fact there is no relationship between the …
Abstract Consider developing a regression model in a context where substantive theory is weak. To focus on an extreme case, suppose that in fact there is no relationship between the dependent variable and the explanatory variables. Even so, if there are many explanatory variables, the R 2 will be high. If explanatory variables with small t statistics are dropped and the equation refitted, the R 2 will stay high and the overall F will become highly significant. This is demonstrated by simulation and by asymptotic calculation.
Abstract This article presents a novel algorithm that efficiently computes L 1 penalized (lasso) estimates of parameters in high‐dimensional models. The lasso has the property that it simultaneously performs variable …
Abstract This article presents a novel algorithm that efficiently computes L 1 penalized (lasso) estimates of parameters in high‐dimensional models. The lasso has the property that it simultaneously performs variable selection and shrinkage, which makes it very useful for finding interpretable prediction rules in high‐dimensional data. The new algorithm is based on a combination of gradient ascent optimization with the Newton–Raphson algorithm. It is described for a general likelihood function and can be applied in generalized linear models and other models with an L 1 penalty. The algorithm is demonstrated in the Cox proportional hazards model, predicting survival of breast cancer patients using gene expression data, and its performance is compared with competing approaches. An R package, penalized , that implements the method, is available on CRAN.
Abstract Model selection and inference are usually treated as separate stages of regression analysis, even though both tasks are performed on the same set of data. Once a model has …
Abstract Model selection and inference are usually treated as separate stages of regression analysis, even though both tasks are performed on the same set of data. Once a model has been selected, one typically proceeds as though one has a fresh data set generated by the selected model. Here, we present Monte Carlo results on the coverage rates of confidence regions for the regression parameters, conditional on the selected model order. The conditional coverage rates are much smaller than the nominal coverage rates, obtained by assuming that the model was known in advance. Furthermore, the overall coverage rate is much smaller than the nominal value. A possible remedy based on data splitting is suggested.
The lasso is a popular technique for simultaneous estimation and variable selection. Lasso variable selection has been shown to be consistent under certain conditions. In this work we derive a …
The lasso is a popular technique for simultaneous estimation and variable selection. Lasso variable selection has been shown to be consistent under certain conditions. In this work we derive a necessary condition for the lasso variable selection to be consistent. Consequently, there exist certain scenarios where the lasso is inconsistent for variable selection. We then propose a new version of the lasso, called the adaptive lasso, where adaptive weights are used for penalizing different coefficients in the ℓ1 penalty. We show that the adaptive lasso enjoys the oracle properties; namely, it performs as well as if the true underlying model were given in advance. Similar to the lasso, the adaptive lasso is shown to be near-minimax optimal. Furthermore, the adaptive lasso can be solved by the same efficient algorithm for solving the lasso. We also discuss the extension of the adaptive lasso in generalized linear models and show that the oracle properties still hold under mild regularity conditions. As a byproduct of our theory, the nonnegative garotte is shown to be consistent for variable selection.
Confidence intervals based on penalized maximum likelihood estimators such as the LASSO, adaptive LASSO, and hard-thresholding are analyzed. In the known-variance case, the finite-sample coverage properties of such intervals are …
Confidence intervals based on penalized maximum likelihood estimators such as the LASSO, adaptive LASSO, and hard-thresholding are analyzed. In the known-variance case, the finite-sample coverage properties of such intervals are determined and it is shown that symmetric intervals are the shortest. The length of the shortest intervals based on the hard-thresholding estimator is larger than the length of the shortest interval based on the adaptive LASSO, which is larger than the length of the shortest interval based on the LASSO, which in turn is larger than the standard interval based on the maximum likelihood estimator. In the case where the penalized estimators are tuned to possess the 'sparsity property', the intervals based on these estimators are larger than the standard interval by an order of magnitude. Furthermore, a simple asymptotic confidence interval construction in the 'sparse' case, that also applies to the smoothly clipped absolute deviation estimator, is discussed. The results for the known-variance case are shown to carry over to the unknown-variance case in an appropriate asymptotic sense.
When making sampling distribution inferences about the parameter of the data, θ, it is appropriate to ignore the process that causes missing data if the missing data are 'missing at …
When making sampling distribution inferences about the parameter of the data, θ, it is appropriate to ignore the process that causes missing data if the missing data are 'missing at random' and the observed data are 'observed at random', but these inferences are generally conditional on the observed pattern of missing data. When making direct-likelihood or Bayesian inferences about θ, it is appropriate to ignore the process that causes missing data if the missing data are missing at random and the parameter of the missing data process is 'distinct' from θ. These conditions are the weakest general conditions under which ignoring the process that causes missing data always leads to correct inferences.
Abstract Abstract A hierarchical Bayesian approach is proposed for variable selection and function estimation in additive nonparametric Gaussian regression models and additive nonparametric binary regression models. The prior for each …
Abstract Abstract A hierarchical Bayesian approach is proposed for variable selection and function estimation in additive nonparametric Gaussian regression models and additive nonparametric binary regression models. The prior for each component function is an integrated Wiener process resulting in a posterior mean estimate that is a cubic smoothing spline. Each of the explanatory variables is allowed to be in or out of the model, and the regression functions are estimated by model averaging. To allow variable selection and model averaging, data-based priors are used for the smoothing parameter and the slope at 0 of each component function. A two-step Markov chain Monte Carlo method is used to efficiently obtain the data-based prior and to carry out variable selection and function estimation. It is shown by simulation that significant improvements in the function estimators can be obtained over an approach that estimates all the unknown functions simultaneously. The methodology is illustrated for a binary regression using heart attack data. Key Words: Bayesian information criterionBinary regressionGaussian errorsModel averagingPosterior probabilitiesTime series models
The ℓ1-penalized method, or the Lasso, has emerged as an important tool for the analysis of large data sets. Many important results have been obtained for the Lasso in linear …
The ℓ1-penalized method, or the Lasso, has emerged as an important tool for the analysis of large data sets. Many important results have been obtained for the Lasso in linear regression which have led to a deeper understanding of high-dimensional statistical problems. In this article, we consider a class of weighted ℓ1-penalized estimators for convex loss functions of a general form, including the generalized linear models. We study the estimation, prediction, selection and sparsity properties of the weighted ℓ1-penalized estimator in sparse, high-dimensional settings where the number of predictors p can be much larger than the sample size n. Adaptive Lasso is considered as a special case. A multistage method is developed to approximate concave regularized estimation by applying an adaptive Lasso recursively. We provide prediction and estimation oracle inequalities for single- and multi-stage estimators, a general selection consistency theorem, and an upper bound for the dimension of the Lasso estimator. Important models including the linear regression, logistic regression and log-linear models are used throughout to illustrate the applications of the general results.
We propose a new method for model selection and model fitting in multivariate nonparametric regression models, in the framework of smoothing spline ANOVA. The “COSSO” is a method of regularization …
We propose a new method for model selection and model fitting in multivariate nonparametric regression models, in the framework of smoothing spline ANOVA. The “COSSO” is a method of regularization with the penalty functional being the sum of component norms, instead of the squared norm employed in the traditional smoothing spline method. The COSSO provides a unified framework for several recent proposals for model selection in linear models and smoothing spline ANOVA models. Theoretical properties, such as the existence and the rate of convergence of the COSSO estimator, are studied. In the special case of a tensor product design with periodic functions, a detailed analysis reveals that the COSSO does model selection by applying a novel soft thresholding type operation to the function components. We give an equivalent formulation of the COSSO estimator which leads naturally to an iterative algorithm. We compare the COSSO with MARS, a popular method that builds functional ANOVA models, in simulations and real examples. The COSSO method can be extended to classification problems and we compare its performance with those of a number of machine learning algorithms on real datasets. The COSSO gives very competitive performance in these studies.
We propose penalized likelihood methods for estimating the concentration matrix in the Gaussian graphical model. The methods lead to a sparse and shrinkage estimator of the concentration matrix that is …
We propose penalized likelihood methods for estimating the concentration matrix in the Gaussian graphical model. The methods lead to a sparse and shrinkage estimator of the concentration matrix that is positive definite, and thus conduct model selection and estimation simultaneously. The implementation of the methods is nontrivial because of the positive definite constraint on the concentration matrix, but we show that the computation can be done effectively by taking advantage of the efficient maxdet algorithm developed in convex optimization. We propose a BIC-type criterion for the selection of the tuning parameter in the penalized likelihood methods. The connection between our methods and existing methods is illustrated. Simulations and real examples demonstrate the competitive performance of the new methods.
In applications such as medical statistics and genetics, we encounter situations where a large number of highly correlated predictors explain a response. For example, the response may be a disease …
In applications such as medical statistics and genetics, we encounter situations where a large number of highly correlated predictors explain a response. For example, the response may be a disease indicator and the predictors may be treatment indicators or single nucleotide polymorphisms (SNPs). Constructing a good predictive model in such cases is well studied. Less well understood is how to recover the 'true sparsity pattern', that is finding which predictors have direct effects on the response, and indicating the statistical significance of the results. Restricting attention to binary predictors and response, we study the recovery of the true sparsity pattern using a two-stage method that separates establishing the presence of effects from inferring their exact relationship with the predictors. Simulations and a real data application demonstrate that the method discriminates well between associations and direct effects. Comparisons with lasso-based methods demonstrate favourable performance of the proposed method.
Multiple linear regression is a versatile model for encompassing analysis of variance, analysis of covariance, and aptitude-by-treatment interaction designs. The question of how to teach the coding of levels of …
Multiple linear regression is a versatile model for encompassing analysis of variance, analysis of covariance, and aptitude-by-treatment interaction designs. The question of how to teach the coding of levels of a qualitative variable is addressed in this paper. Although a variety of coding schemes will produce invariant omnibus statistical results for a given set of data, one’s interpretation of treatment effects and treatment differences depends on the particular code values chosen. A general procedure is presented that allows the user to generate, on an a priori basis, code values that result in directly interpretable estimates of interest.
This paper describes how penalized Cox regression, in combination with cross-validated partial likelihood can be employed to obtain reliable survival prediction models for high dimensional microarray data. The suggested procedure …
This paper describes how penalized Cox regression, in combination with cross-validated partial likelihood can be employed to obtain reliable survival prediction models for high dimensional microarray data. The suggested procedure is demonstrated on a breast cancer survival data set consisting of 295 tumours as collected in the National Cancer Institute in Amsterdam and previously reported in more general papers. The main aim of this paper it to show how generally accepted biostatistical procedures can be employed to analyse high-dimensional data.
This article presents bootstrap methods for estimation, using simple arguments. Minitab macros for implementing these methods are given.
This article presents bootstrap methods for estimation, using simple arguments. Minitab macros for implementing these methods are given.
The title Lasso has been suggested by Tibshirani (1996) as a colourful name for a technique of variable selection which requires the minimization of a sum of squares subject to …
The title Lasso has been suggested by Tibshirani (1996) as a colourful name for a technique of variable selection which requires the minimization of a sum of squares subject to an l1 bound κ on the solution. This forces zero components in the minimizing solution for small values of κ. Thus this bound can function as a selection parameter. This paper makes two contributions to computational problems associated with implementing the Lasso: (1) a compact descent method for solving the constrained problem for a particular value of κ is formulated, and (2) a homotopy method, in which the constraint bound κ becomes the homotopy parameter, is developed to completely describe the possible selection regimes. Both algorithms have a finite termination property. It is suggested that modified Gram-Schmidt orthogonalization applied to an augmented design matrix provides an effective basis for implementing the algorithms.