This paper addresses the critical need for a rigorous mathematical formalization of the data minimization principle within machine learning. Data minimization, a cornerstone of global data protection regulations like GDPR, mandates that organizations collect, process, and retain only personal data that is adequate, relevant, and limited to what is necessary for specified objectives. Despite its legal importance, the principle has lacked an operational definition suitable for complex ML systems.
The key innovation of this work is the introduction of a formal optimization framework for data minimization. This framework conceptualizes data minimization as a bilevel optimization problem. The outer objective aims to minimize the amount of data (quantified by the L1-norm of a binary “minimization matrix”), while the inner objective ensures that the machine learning model trained on the reduced dataset maintains its utility (performance) within a predefined acceptable drop tolerance. Crucially, this framework enables individualized data minimization, allowing for the selective removal of specific features from individual data points, a more granular approach than traditional methods like global feature selection or random sample pruning.
A significant finding from this formalization and empirical evaluation is the inherent disconnect between current interpretations of data minimization and actual privacy outcomes. The paper comprehensively assesses the privacy implications by evaluating three distinct real-world privacy risks: Reconstruction Risk (the ease with which removed data can be inferred or recreated), Re-identification Risk (the likelihood of linking anonymized data back to individuals), and Membership Inference Risk (the ability to determine if a data point was part of the training set). The authors demonstrate that simply reducing data size to preserve model utility does not proportionally reduce these privacy risks, challenging the implicit assumption that data minimization inherently enhances privacy.
To bridge this identified gap, the paper proposes a novel approach: incorporating explicit privacy considerations directly into the data minimization objective. By introducing “privacy scores” for features (e.g., based on their uniqueness or correlation with other features), the minimization algorithms can be guided to remove data that poses higher privacy risks while still maintaining utility. This modification is shown to achieve a substantially better privacy-utility trade-off. Furthermore, the research investigates the compatibility of this framework with existing privacy-preserving techniques like Differential Privacy (specifically DP-SGD) for the underlying model training, showing that such integration can further mitigate membership inference risks.
This research builds upon several main prior ingredients. It leverages the legal definitions and objectives of data minimization as stipulated in major data protection regulations. The formalization draws heavily from optimization theory, particularly bilevel optimization, and utilizes established concepts of loss functions and empirical risk minimization from machine learning. The comprehensive privacy evaluation relies on existing methodologies for quantifying privacy threats, including techniques for reconstruction, re-identification, and membership inference attacks. Finally, the practical implementations adapt various data reduction techniques, from basic feature selection and random subsampling to more advanced optimization-based and evolutionary algorithms, and integrates with established privacy-preserving machine learning paradigms like Differential Privacy.