Subsampling for Big Data Linear Models with Measurement Errors

Locations

  • Statistics and Computing
  • arXiv (Cornell University)

Ask a Question About This Paper

Summary

The paper addresses a critical challenge in the analysis of big data: performing efficient and accurate statistical inference when the observed covariates are subject to measurement errors. Traditional subsampling methods, while effective for clean data, yield biased and inconsistent estimates when covariates are subject to measurement errors, a common scenario in many real-world applications due to imprecise sensors, data collection inaccuracies, or inherent variability.

To overcome this, the authors introduce two novel subsampling algorithms specifically designed for linear models with measurement errors. Both methods build upon the corrected likelihood approach proposed by Nakamura (1990), a foundational technique that modifies the standard likelihood function to account for the known structure of measurement errors, thereby ensuring unbiased estimation of model parameters.

The first proposed innovation is the Optimal Subsampling based on Corrected Likelihood (OSCL). This method aims to select a small, representative subset of the full dataset that maximizes the statistical efficiency of the parameter estimator. It achieves this by determining optimal subsampling probabilities for each data point. These probabilities are derived by minimizing the trace of the estimator’s asymptotic variance matrix, which is formulated using the corrected likelihood function. Since the optimal probabilities depend on the unknown parameters, a practical two-step procedure is employed: an initial pilot estimate is obtained (e.g., via uniform subsampling), which then informs the calculation of refined optimal probabilities for the final, more efficient subsample.

Recognizing that calculating exact optimal probabilities for massive datasets can be computationally intensive and memory-demanding, the paper also proposes a more scalable approach: Perturbation Subsampling based on Corrected Likelihood (PSCL). This method leverages random weighting, similar to existing perturbation subsampling techniques, but adapted to the measurement error context. It approximates the full-data objective function using independently generated stochastic weights. This ingenious strategy avoids the need to compute all individual sampling probabilities explicitly, making it particularly advantageous for very large datasets and allowing for parallelization.

A crucial aspect of this work is the rigorous theoretical analysis. The authors demonstrate that both OSCL and PSCL yield consistent and asymptotically normal estimators, providing strong statistical guarantees for their reliability even in the presence of measurement errors.

The main prior ingredients that underpin this work include:
1. Big Data Subsampling Algorithms: The general framework of subsampling for computational efficiency, drawing inspiration from methods like leverage sampling, information-based optimal subdata selection (IBOSS), and existing perturbation subsampling techniques.
2. Measurement Error Models: The established statistical theory for linear models with measurement errors, primarily building on the work by Fuller (1987) and the corrected likelihood method by Nakamura (1990) as the core mechanism to handle measurement inaccuracies.
3. Asymptotic Theory: Classical results on consistency and asymptotic normality for estimators, as applied in the context of both subsampling and measurement error models.

Overall, this paper offers a significant advancement in statistical methodology for big data, providing practical and statistically sound tools for analyzing massive datasets in linear models where covariate measurements are imperfect – a common and previously underexplored challenge in many scientific and industrial applications.

Subsampling algorithms for various parametric regression models with massive data have been extensively investigated in recent years. However, all existing studies on subsampling heavily rely on clean massive data. In … Subsampling algorithms for various parametric regression models with massive data have been extensively investigated in recent years. However, all existing studies on subsampling heavily rely on clean massive data. In practical applications, the observed covariates may suffer from inaccuracies due to measurement errors. To address the challenge of large datasets with measurement errors, this study explores two subsampling algorithms based on the corrected likelihood approach: the optimal subsampling algorithm utilizing inverse probability weighting and the perturbation subsampling algorithm employing random weighting assuming a perfectly known distribution. Theoretical properties for both algorithms are provided. Numerical simulations and two real-world examples demonstrate the effectiveness of these proposed methods compared to other uncorrected algorithms.
To fast approximate the maximum likelihood estimator with massive data, Wang et al. (JASA, 2017) proposed an Optimal Subsampling Method under the A-optimality Criterion (OSMAC) for in logistic regression. This … To fast approximate the maximum likelihood estimator with massive data, Wang et al. (JASA, 2017) proposed an Optimal Subsampling Method under the A-optimality Criterion (OSMAC) for in logistic regression. This paper extends the scope of the OSMAC framework to include generalized linear models with canonical link functions. The consistency and asymptotic normality of the estimator from a general subsampling algorithm are established, and optimal subsampling probabilities under the A- and L-optimality criteria are derived. Furthermore, using Frobenius norm matrix concentration inequality, finite sample properties of the subsample estimator based on optimal subsampling probabilities are derived. Since the optimal subsampling probabilities depend on the full data estimate, an adaptive two-step algorithm is developed. Asymptotic normality and optimality of the estimator from this adaptive algorithm are established. The proposed methods are illustrated and evaluated through numerical experiments on simulated and real datasets.
Large sample size brings the computation bottleneck for modern data analysis. Subsampling is one of efficient strategies to handle this problem. In previous studies, researchers make more fo- cus on … Large sample size brings the computation bottleneck for modern data analysis. Subsampling is one of efficient strategies to handle this problem. In previous studies, researchers make more fo- cus on subsampling with replacement (SSR) than on subsampling without replacement (SSWR). In this paper we investigate a kind of SSWR, poisson subsampling (PSS), for fast algorithm in ordinary least-square problem. We establish non-asymptotic property, i.e, the error bound of the correspond- ing subsample estimator, which provide a tradeoff between computation cost and approximation efficiency. Besides the non-asymptotic result, we provide asymptotic consistency and normality of the subsample estimator. Methodologically, we propose a two-step subsampling algorithm, which is efficient with respect to a statistical objective and independent on the linear model assumption.. Synthetic and real data are used to empirically study our proposed subsampling strategies. We argue by these empirical studies that, (1) our proposed two-step algorithm has obvious advantage when the assumed linear model does not accurate, and (2) the PSS strategy performs obviously better than SSR when the subsampling ratio increases.
The dramatic growth of big datasets presents a new challenge to data storage and analysis. Data reduction, or subsampling, that extracts useful information from datasets is a crucial step in … The dramatic growth of big datasets presents a new challenge to data storage and analysis. Data reduction, or subsampling, that extracts useful information from datasets is a crucial step in big data analysis. We propose an orthogonal subsampling (OSS) approach for big data with a focus on linear regression models. The approach is inspired by the fact that an orthogonal array of two levels provides the best experimental design for linear regression models in the sense that it minimizes the average variance of the estimated parameters and provides the best predictions. The merits of OSS are three-fold: (i) it is easy to implement and fast; (ii) it is suitable for distributed parallel computing and ensures the subsamples selected in different batches have no common data points; and (iii) it outperforms existing methods in minimizing the mean squared errors of the estimated parameters and maximizing the efficiencies of the selected subsamples. Theoretical results and extensive numerical results show that the OSS approach is superior to existing subsampling approaches. It is also more robust to the presence of interactions among covariates and, when they do exist, OSS provides more precise estimates of the interaction effects than existing methods. The advantages of OSS are also illustrated through analysis of real data.
The dramatic growth of big datasets presents a new challenge to data storage and analysis. Data reduction, or subsampling, that extracts useful information from datasets is a crucial step in … The dramatic growth of big datasets presents a new challenge to data storage and analysis. Data reduction, or subsampling, that extracts useful information from datasets is a crucial step in big-data analysis. We propose an orthogonal subsampling (OSS) approach for big data with a focus on linear regression models. The approach is inspired by the fact that an orthogonal array of two levels provides the best experimental design for linear regression models in the sense that it minimizes the average variance of the estimated parameters and provides the best predictions. The merits of OSS are three-fold: (i) it is easy to implement and fast; (ii) it is suitable for distributed parallel computing and ensures the subsamples selected in different batches have no common data points, and (iii) it outperforms existing methods in minimizing the mean squared errors of the estimated parameters and maximizing the efficiencies of the selected subsamples. Theoretical results and extensive numerical results show that the OSS approach is superior to existing subsampling approaches. It is also more robust to the presence of interactions among covariates, and, when they do exist, OSS provides more precise estimates of the interaction effects than existing methods. The advantages of OSS are also illustrated through analysis of real data.
Large sample size brings the computation bottleneck for modern data analysis. Subsampling is one of efficient strategies to handle this problem. In previous studies, researchers make more fo- cus on … Large sample size brings the computation bottleneck for modern data analysis. Subsampling is one of efficient strategies to handle this problem. In previous studies, researchers make more fo- cus on subsampling with replacement (SSR) than on subsampling without replacement (SSWR). In this paper we investigate a kind of SSWR, poisson subsampling (PSS), for fast algorithm in ordinary least-square problem. We establish non-asymptotic property, i.e, the error bound of the correspond- ing subsample estimator, which provide a tradeoff between computation cost and approximation efficiency. Besides the non-asymptotic result, we provide asymptotic consistency and normality of the subsample estimator. Methodologically, we propose a two-step subsampling algorithm, which is efficient with respect to a statistical objective and independent on the linear model assumption.. Synthetic and real data are used to empirically study our proposed subsampling strategies. We argue by these empirical studies that, (1) our proposed two-step algorithm has obvious advantage when the assumed linear model does not accurate, and (2) the PSS strategy performs obviously better than SSR when the subsampling ratio increases.
Abstract In today’s modern era of big data, computationally efficient and scalable methods are needed to support timely insights and informed decision making. One such method is subsampling, where a … Abstract In today’s modern era of big data, computationally efficient and scalable methods are needed to support timely insights and informed decision making. One such method is subsampling, where a subset of the big data is analysed and used as the basis for inference rather than considering the whole data set. A key question when applying subsampling approaches is how to select an informative subset based on the questions being asked of the data. A recent approach for this has been proposed based on determining subsampling probabilities for each data point, but a limitation of this approach is that the appropriate subsampling probabilities rely on an assumed model for the big data. In this article, to overcome this limitation, we propose a model robust approach where a set of models is considered, and the subsampling probabilities are evaluated based on the weighted average of probabilities that would be obtained if each model was considered singularly. Theoretical results are derived to inform such an approach. Our model robust subsampling approach is applied in a simulation study and in two real-world applications where performance is compared to current subsampling practices. The results show that our model robust approach outperforms alternative methods.
A significant hurdle for analyzing big data is the lack of effective technology and statistical inference methods. A popular approach for analyzing data with large sample is subsampling. Many subsampling … A significant hurdle for analyzing big data is the lack of effective technology and statistical inference methods. A popular approach for analyzing data with large sample is subsampling. Many subsampling probabilities have been introduced in literature (Ma, \emph{et al.}, 2015) for linear model. In this dissertation, we focus on generalized estimating equations (GEE) with big data and derive the asymptotic normality for the estimator without resampling and estimator with resampling. We also give the asymptotic representation of the bias of estimator without resampling and estimator with resampling. we show that bias becomes significant when the data is of high-dimensional. We also present a novel subsampling method called A-optimal which is derived by minimizing the trace of some dispersion matrices (Peng and Tan, 2018). We derive the asymptotic normality of the estimator based on A-optimal subsampling methods. We conduct extensive simulations on large sample data with high dimension to evaluate the performance of our proposed methods using MSE as a criterion. High dimensional data are further investigated and we show through simulations that minimizing the asymptotic variance does not imply minimizing the MSE as bias not negligible. We apply our proposed subsampling method to analyze a real data set, gas sensor data which has more than four millions data points. In both simulations and real data analysis, our A-optimal method outperform the traditional uniform subsampling method.
In today's modern era of Big data, computationally efficient and scalable methods are needed to support timely insights and informed decision making. One such method is sub-sampling, where a subset … In today's modern era of Big data, computationally efficient and scalable methods are needed to support timely insights and informed decision making. One such method is sub-sampling, where a subset of the Big data is analysed and used as the basis for inference rather than considering the whole data set. A key question when applying sub-sampling approaches is how to select an informative subset based on the questions being asked of the data. A recent approach for this has been proposed based on determining sub-sampling probabilities for each data point, but a limitation of this approach is that appropriate sub-sampling probabilities rely on an assumed model for the Big data. In this article, to overcome this limitation, we propose a model robust approach where a set of models is considered, and the sub-sampling probabilities are evaluated based on the weighted average of probabilities that would be obtained if each model was considered singularly. Theoretical support for such an approach is provided. Our model robust sub-sampling approach is applied in a simulation study and in two real world applications where performance is compared to current sub-sampling practices. The results show that our model robust approach outperforms alternative approaches.
AbstractHandling large datasets and calculating complex statistics on huge datasets require important computing resources. Using subsampling methods to calculate statistics of interest on small samples is often used in practice … AbstractHandling large datasets and calculating complex statistics on huge datasets require important computing resources. Using subsampling methods to calculate statistics of interest on small samples is often used in practice to reduce computational complexity, for instance using the divide and conquer strategy. In this article, we recall some results on subsampling distributions and derive a precise rate of convergence for these quantities and the corresponding quantiles. We also develop some standardisation techniques based on subsampling unstandardised statistics in the framework of large datasets. It is argued that using several subsampling distributions with different subsampling sizes brings a lot of information on the behaviour of statistical learning procedures: subsampling allows to estimate the rate of convergence of different algorithms, to estimate the variability of complex statistics, to estimate confidence intervals for out-of-sample errors and interpolate their values at larger scales. These results are illustrated on simulations, but also on two important datasets, frequently analysed in the statistical learning community, EMNIST (recognition of digits) and VeReMi (analysis of Network Vehicular Reference Misbehavior).Keywords: Big datasubsamplingstatistical learningscalingout-of sample errorAMS Subject Classifications: 62F0362F1062F1262H1262H15 AcknowledgmentsWe would like to thank the associate editors as well as two anonymous referees for their careful reading of the manuscript and the numerous comments which greatly improved the paper. P. Bertail and M. Zetlaoui have written the theoretical part of the paper, J. Tressou has designed and commented on the simulation results. The two computer scientists M. Bouchouia and O. Jelassi have implemented the algorithm for really big data on Teralab.Disclosure statementNo potential conflict of interest was reported by the author(s).Additional informationFundingThis research has been conducted as part of the project Labex MME-DII [grant number ANR11-LBX-0023-01] as well as Teralab and the industrial chair 'Machine Learning for Big Data'.
In the era of massive data, how to obtain valuable information from large-scale data has become an important research direction. Subsampling is a common method to reduce the dimension of … In the era of massive data, how to obtain valuable information from large-scale data has become an important research direction. Subsampling is a common method to reduce the dimension of such data and improve the efficiency of model calculation. The subsampling based on the Joint Mean and Variance Models (JMVM) are proposed. Numerical simulation is used to evaluate uniform sampling (UNIF), leverage subsampling (LEV) and shrinkage leverage subsampling (SLEV). The results show that the models based on LEV and SLEV exhibit higher accuracy in the estimation of mean parameters.
Data reduction is a fundamental challenge of modern technology, where classical statistical methods are not applicable because of computational limitations. We consider linear regression for an extraordinarily large number of … Data reduction is a fundamental challenge of modern technology, where classical statistical methods are not applicable because of computational limitations. We consider linear regression for an extraordinarily large number of observations, but only a few covariates. Subsampling aims at the selection of a given percentage of the existing original data. Under distributional assumptions on the covariates, we derive D-optimal subsampling designs and study their theoretical properties. We make use of fundamental concepts of optimal design theory and an equivalence theorem from constrained convex optimization. The thus obtained subsampling designs provide simple rules for whether to accept or reject a data point, allowing for an easy algorithmic implementation. In addition, we propose a simplified subsampling method with lower computational complexity that differs from the D-optimal design. We present a simulation study, comparing both subsampling schemes with the IBOSS method in the case of a fixed size of the subsample.
This thesis is concerned with massive data analysis via robust A-optimally efficient non-uniform subsampling. Motivated by the fact that massive data often contain outliers and that uniform sampling is not … This thesis is concerned with massive data analysis via robust A-optimally efficient non-uniform subsampling. Motivated by the fact that massive data often contain outliers and that uniform sampling is not efficient, we give numerous sampling distributions by minimizing the sum of the component variances of the subsampling estimate. And these sampling distributions are robust against outliers. Massive data pose two computational bottlenecks. Namely, data exceed a computer’s storage space, and computation requires too long waiting time. The two bottle necks can be simultaneously addressed by selecting a subsample as a surrogate for the full sample and completing the data analysis. We develop our theory in a typical setting for robust linear regression in which the estimating functions are not differentiable. For an arbitrary sampling distribution, we establish consistency for the subsampling estimate for both fixed and growing dimension( as high dimensionality is common in massive data). We prove asymptotic normality for fixed dimension. We discuss the A-optimal scoring method for fast computing. We conduct large simulations to evaluate the numerical performance of our proposed A-optimal sampling distribution. Real data applications are also performed.
A significant hurdle for analyzing large sample data is the lack of effective statistical computing and inference methods. An emerging powerful approach for analyzing large sample data is subsampling, by … A significant hurdle for analyzing large sample data is the lack of effective statistical computing and inference methods. An emerging powerful approach for analyzing large sample data is subsampling, by which one takes a random subsample from the original full sample and uses it as a surrogate for subsequent computation and estimation. In this paper, we study subsampling methods under two scenarios: approximating the full sample ordinary least-square (OLS) estimator and estimating the coefficients in linear regression. We present two algorithms, weighted estimation algorithm and unweighted estimation algorithm, and analyze asymptotic behaviors of their resulting subsample estimators under general conditions. For the weighted estimation algorithm, we propose a criterion for selecting the optimal sampling probability by making use of the asymptotic results. On the basis of the criterion, we provide two novel subsampling methods, the optimal subsampling and the predictor- length subsampling methods. The predictor-length subsampling method is based on the L2 norm of predictors rather than leverage scores. Its computational cost is scalable. For unweighted estimation algorithm, we show that its resulting subsample estimator is not consistent to the full sample OLS estimator. However, it has better performance than the weighted estimation algorithm for estimating the coefficients. Simulation studies and a real data example are used to demonstrate the effectiveness of our proposed subsampling methods.
This paper proposes a nonuniform subsampling method for finite mixtures of regression models to reduce large data computational tasks. A general estimator based on a subsample is investigated, and its … This paper proposes a nonuniform subsampling method for finite mixtures of regression models to reduce large data computational tasks. A general estimator based on a subsample is investigated, and its asymptotic normality is established. We assign optimal subsampling probabilities to data points that minimize the asymptotic mean squared errors of the general estimator and linearly transformed estimators. Since the proposed probabilities depend on unknown parameters, an implementable algorithm is developed. We first approximate the optimal subsampling probabilities using a pilot sample. After that, we select a subsample using the approximated subsampling probabilities and compute estimates using the subsample. We evaluate the proposed method in a simulation study and present a real data example using appliance energy data.
Faced with massive data, subsampling is a popular way to downsize the data volume for reducing computational burden. The key idea of subsampling is to perform statistical analysis on a … Faced with massive data, subsampling is a popular way to downsize the data volume for reducing computational burden. The key idea of subsampling is to perform statistical analysis on a representative subsample drawn from the full data. It provides a practical solution to extracting useful information from big data. In this article, we develop an efficient subsampling method for large‐scale multiplicative regression model, which can largely reduce the computational burden due to massive data. Under some regularity conditions, we establish consistency and asymptotic normality of the subsample‐based estimator, and derive the optimal subsampling probabilities according to the L‐optimality criterion. A two‐step algorithm is developed to approximate the optimal subsampling procedure. Meanwhile, the convergence rate and asymptotic normality of the two‐step subsample estimator are established. Numerical studies and two real data applications are carried out to evaluate the performance of our subsampling method.
This book is an introduction to the field of asymptotic statistics. The treatment is both practical and mathematically rigorous. In addition to most of the standard topics of an asymptotics … This book is an introduction to the field of asymptotic statistics. The treatment is both practical and mathematically rigorous. In addition to most of the standard topics of an asymptotics course, including likelihood inference, M-estimation, the theory of asymptotic efficiency, U-statistics, and rank procedures, the book also presents recent research topics such as semiparametric models, the bootstrap, and empirical processes and their applications. The topics are organized from the central idea of approximation by limit experiments, which gives the book one of its unifying themes. This entails mainly the local approximation of the classical i.i.d. set up with smooth parameters by location experiments involving a single, normally distributed observation. Thus, even the standard subjects of asymptotic statistics are presented in a novel way. Suitable as a graduate or Master's level statistics text, this book will also give researchers an overview of research in asymptotic statistics.
We consider the partially linear model relating a response $Y$ to predictors ($X, T$) with mean function $X^{\top}\beta + g(T)$ when the $X$’s are measured with additive error. The semiparametric … We consider the partially linear model relating a response $Y$ to predictors ($X, T$) with mean function $X^{\top}\beta + g(T)$ when the $X$’s are measured with additive error. The semiparametric likelihood estimate of Severini and Staniswalis leads to biased estimates of both the parameter $\beta$ and the function $g(\cdot)$ when measurement error is ignored. We derive a simple modification of their estimator which is a semiparametric version of the usual parametric correction for attenuation. The resulting estimator of $\beta$ is shown to be consistent and its asymptotic distribution theory is derived. Consistent standard error estimates using sandwich-type ideas are also developed.
Statistical models whose independent variables are subject to measurement errors are often referred to as 'errors-in-variables models'. To correct for the effects of measurement error on parameter estimation, this paper … Statistical models whose independent variables are subject to measurement errors are often referred to as 'errors-in-variables models'. To correct for the effects of measurement error on parameter estimation, this paper considers a correction for score functions. A corrected score function is one whose expectation with respect to the measurement error distribution coincides with the usual score function based on the unknown true independent variables. This approach makes it possible to do inference as well as estimation of model parameters without additional assumptions. The corrected score functions of some generalized linear models are obtained.
AbstractThis article focuses on variable selection for partially linear models when the covariates are measured with additive errors. We propose two classes of variable selection procedures, penalized least squares and … AbstractThis article focuses on variable selection for partially linear models when the covariates are measured with additive errors. We propose two classes of variable selection procedures, penalized least squares and penalized quantile regression, using the nonconvex penalized principle. The first procedure corrects the bias in the loss function caused by the measurement error by applying the so-called correction-for-attenuation approach, whereas the second procedure corrects the bias by using orthogonal regression. The sampling properties for the two procedures are investigated. The rate of convergence and the asymptotic normality of the resulting estimates are established. We further demonstrate that, with proper choices of the penalty functions and the regularization parameter, the resulting estimates perform asymptotically as well as an oracle procedure as proposed by Fan and Li. Choice of smoothing parameters is also discussed. Finite sample performance of the proposed variable selection procedures is assessed by Monte Carlo simulation studies. We further illustrate the proposed procedures by an application.KEY WORDS: Errors-in-variableError-freeError-proneLocal linear regressionQuantile regressionsmoothly clipped absolute deviation
Summary We consider the problem of approximating smoothing spline estimators in a nonparametric regression model. When applied to a sample of size $n$, the smoothing spline estimator can be expressed … Summary We consider the problem of approximating smoothing spline estimators in a nonparametric regression model. When applied to a sample of size $n$, the smoothing spline estimator can be expressed as a linear combination of $n$ basis functions, requiring $O(n^3)$ computational time when the number $d$ of predictors is two or more. Such a sizeable computational cost hinders the broad applicability of smoothing splines. In practice, the full-sample smoothing spline estimator can be approximated by an estimator based on $q$ randomly selected basis functions, resulting in a computational cost of $O(nq^2)$. It is known that these two estimators converge at the same rate when $q$ is of order $O\{n^{2/(pr+1)}\}$, where $p\in [1,2]$ depends on the true function and $r > 1$ depends on the type of spline. Such a $q$ is called the essential number of basis functions. In this article, we develop a more efficient basis selection method. By selecting basis functions corresponding to approximately equally spaced observations, the proposed method chooses a set of basis functions with great diversity. The asymptotic analysis shows that the proposed smoothing spline estimator can decrease $q$ to around $O\{n^{1/(pr+1)}\}$ when $d\leq pr+1$. Applications to synthetic and real-world datasets show that the proposed method leads to a smaller prediction error than other basis selection methods.
Nonuniform subsampling methods are effective to reduce computational burden and maintain estimation efficiency for massive data. Existing methods mostly focus on subsampling with replacement due to its high computational efficiency. … Nonuniform subsampling methods are effective to reduce computational burden and maintain estimation efficiency for massive data. Existing methods mostly focus on subsampling with replacement due to its high computational efficiency. If the data volume is so large that nonuniform subsampling probabilities cannot be calculated all at once, then subsampling with replacement is infeasible to implement. This article solves this problem using Poisson subsampling. We first derive optimal Poisson subsampling probabilities in the context of quasi-likelihood estimation under the A- and L-optimality criteria. For a practically implementable algorithm with approximated optimal subsampling probabilities, we establish the consistency and asymptotic normality of the resultant estimators. To deal with the situation that the full data are stored in different blocks or at multiple locations, we develop a distributed subsampling framework, in which statistics are computed simultaneously on smaller partitions of the full data. Asymptotic properties of the resultant aggregated estimator are investigated. We illustrate and evaluate the proposed strategies through numerical experiments on simulated and real datasets. Supplementary materials for this article are available online.
Summary We investigate optimal subsampling for quantile regression. We derive the asymptotic distribution of a general subsampling estimator and then derive two versions of optimal subsampling probabilities. One version minimizes … Summary We investigate optimal subsampling for quantile regression. We derive the asymptotic distribution of a general subsampling estimator and then derive two versions of optimal subsampling probabilities. One version minimizes the trace of the asymptotic variance-covariance matrix for a linearly transformed parameter estimator and the other minimizes that of the original parameter estimator. The former does not depend on the densities of the responses given covariates and is easy to implement. Algorithms based on optimal subsampling probabilities are proposed and asymptotic distributions, and the asymptotic optimality of the resulting estimators are established. Furthermore, we propose an iterative subsampling procedure based on the optimal subsampling probabilities in the linearly transformed parameter estimation which has great scalability to utilize available computational resources. In addition, this procedure yields standard errors for parameter estimators without estimating the densities of the responses given the covariates. We provide numerical examples based on both simulated and real data to illustrate the proposed method.
Extraordinary amounts of data are being produced in many branches of science. Proven statistical methods are no longer applicable with extraordinary large data sets due to computational limitations. A critical … Extraordinary amounts of data are being produced in many branches of science. Proven statistical methods are no longer applicable with extraordinary large data sets due to computational limitations. A critical step in big data analysis is data reduction. Existing investigations in the context of linear regression focus on subsampling-based methods. However, not only is this approach prone to sampling errors, it also leads to a covariance matrix of the estimators that is typically bounded from below by a term that is of the order of the inverse of the subdata size. We propose a novel approach, termed information-based optimal subdata selection (IBOSS). Compared to leading existing subdata methods, the IBOSS approach has the following advantages: (i) it is significantly faster; (ii) it is suitable for distributed parallel computing; (iii) the variances of the slope parameter estimators converge to 0 as the full data size increases even if the subdata size is fixed, i.e., the convergence rate depends on the full data size; (iv) data analysis for IBOSS subdata is straightforward and the sampling distribution of an IBOSS estimator is easy to assess. Theoretical results and extensive simulations demonstrate that the IBOSS approach is superior to subsampling-based methods, sometimes by orders of magnitude. The advantages of the new approach are also illustrated through analysis of real data.
The dramatic growth of big datasets presents a new challenge to data storage and analysis. Data reduction, or subsampling, that extracts useful information from datasets is a crucial step in … The dramatic growth of big datasets presents a new challenge to data storage and analysis. Data reduction, or subsampling, that extracts useful information from datasets is a crucial step in big-data analysis. We propose an orthogonal subsampling (OSS) approach for big data with a focus on linear regression models. The approach is inspired by the fact that an orthogonal array of two levels provides the best experimental design for linear regression models in the sense that it minimizes the average variance of the estimated parameters and provides the best predictions. The merits of OSS are three-fold: (i) it is easy to implement and fast; (ii) it is suitable for distributed parallel computing and ensures the subsamples selected in different batches have no common data points, and (iii) it outperforms existing methods in minimizing the mean squared errors of the estimated parameters and maximizing the efficiencies of the selected subsamples. Theoretical results and extensive numerical results show that the OSS approach is superior to existing subsampling approaches. It is also more robust to the presence of interactions among covariates, and, when they do exist, OSS provides more precise estimates of the interaction effects than existing methods. The advantages of OSS are also illustrated through analysis of real data.
Smoothing splines have been used pervasively in nonparametric regressions. However, the computational burden of smoothing splines is significant when the sample size n is large. When the number of predictors … Smoothing splines have been used pervasively in nonparametric regressions. However, the computational burden of smoothing splines is significant when the sample size n is large. When the number of predictors d ≥ 2 , the computational cost for smoothing splines is at the order of O(n3) using the standard approach. Many methods have been developed to approximate smoothing spline estimators by using q basis functions instead of n ones, resulting in a computational cost of the order O(nq2). These methods are called the basis selection methods. Despite algorithmic benefits, most of the basis selection methods require the assumption that the sample is uniformly-distributed on a hyper-cube. These methods may have deteriorating performance when such an assumption is not met. To overcome the obstacle, we develop an efficient algorithm that is adaptive to the unknown probability density function of the predictors. Theoretically, we show the proposed estimator has the same convergence rate as the full-basis estimator when q is roughly at the order of O[n2d/{(pr+1)(d +2)}] , where p ∈[1, 2] and r ≈ 4 are some constants depend on the type of the spline. Numerical studies on various synthetic datasets demonstrate the superior performance of the proposed estimator in comparison with mainstream competitors.
Subsampling or subdata selection is a useful approach in large-scale statistical learning. Most existing studies focus on model-based subsampling methods which significantly depend on the model assumption. In this article, … Subsampling or subdata selection is a useful approach in large-scale statistical learning. Most existing studies focus on model-based subsampling methods which significantly depend on the model assumption. In this article, we consider the model-free subsampling strategy for generating subdata from the original full data. In order to measure the goodness of representation of a subdata with respect to the original data, we propose a criterion, generalized empirical <inline-formula><tex-math notation="LaTeX">$F$</tex-math></inline-formula> -discrepancy (GEFD), and study its theoretical properties in connection with the classical generalized <inline-formula><tex-math notation="LaTeX">$\ell _{2}$</tex-math></inline-formula> -discrepancy in the theory of uniform designs. These properties allow us to develop a kind of low-GEFD data-driven subsampling method based on the existing uniform designs. By simulation examples and a real case study, we show that the proposed subsampling method is superior to the random sampling method. Moreover, our method keeps robust under diverse model specifications while other popular model-based subsampling methods are under-performing. In practice, such a model-free property is more appealing than the model-based subsampling methods, where the latter may have poor performance when the model is misspecified, as demonstrated in our simulation studies. In addition, our method is orders of magnitude faster than other model-free subsampling methods, which makes it more applicable for subsampling of Big Data.