The paper addresses a critical challenge in the analysis of big data: performing efficient and accurate statistical inference when the observed covariates are subject to measurement errors. Traditional subsampling methods, while effective for clean data, yield biased and inconsistent estimates when covariates are subject to measurement errors, a common scenario in many real-world applications due to imprecise sensors, data collection inaccuracies, or inherent variability.
To overcome this, the authors introduce two novel subsampling algorithms specifically designed for linear models with measurement errors. Both methods build upon the corrected likelihood approach proposed by Nakamura (1990), a foundational technique that modifies the standard likelihood function to account for the known structure of measurement errors, thereby ensuring unbiased estimation of model parameters.
The first proposed innovation is the Optimal Subsampling based on Corrected Likelihood (OSCL). This method aims to select a small, representative subset of the full dataset that maximizes the statistical efficiency of the parameter estimator. It achieves this by determining optimal subsampling probabilities for each data point. These probabilities are derived by minimizing the trace of the estimator’s asymptotic variance matrix, which is formulated using the corrected likelihood function. Since the optimal probabilities depend on the unknown parameters, a practical two-step procedure is employed: an initial pilot estimate is obtained (e.g., via uniform subsampling), which then informs the calculation of refined optimal probabilities for the final, more efficient subsample.
Recognizing that calculating exact optimal probabilities for massive datasets can be computationally intensive and memory-demanding, the paper also proposes a more scalable approach: Perturbation Subsampling based on Corrected Likelihood (PSCL). This method leverages random weighting, similar to existing perturbation subsampling techniques, but adapted to the measurement error context. It approximates the full-data objective function using independently generated stochastic weights. This ingenious strategy avoids the need to compute all individual sampling probabilities explicitly, making it particularly advantageous for very large datasets and allowing for parallelization.
A crucial aspect of this work is the rigorous theoretical analysis. The authors demonstrate that both OSCL and PSCL yield consistent and asymptotically normal estimators, providing strong statistical guarantees for their reliability even in the presence of measurement errors.
The main prior ingredients that underpin this work include:
1. Big Data Subsampling Algorithms: The general framework of subsampling for computational efficiency, drawing inspiration from methods like leverage sampling, information-based optimal subdata selection (IBOSS), and existing perturbation subsampling techniques.
2. Measurement Error Models: The established statistical theory for linear models with measurement errors, primarily building on the work by Fuller (1987) and the corrected likelihood method by Nakamura (1990) as the core mechanism to handle measurement inaccuracies.
3. Asymptotic Theory: Classical results on consistency and asymptotic normality for estimators, as applied in the context of both subsampling and measurement error models.
Overall, this paper offers a significant advancement in statistical methodology for big data, providing practical and statistically sound tools for analyzing massive datasets in linear models where covariate measurements are imperfect – a common and previously underexplored challenge in many scientific and industrial applications.