The Data Minimization Principle in Machine Learning

Type: Article
Publication Date: 2025-06-23
Citations: 0

Locations

  • arXiv (Cornell University)

Ask a Question About This Paper

Summary

This paper addresses the critical need for a rigorous mathematical formalization of the data minimization principle within machine learning. Data minimization, a cornerstone of global data protection regulations like GDPR, mandates that organizations collect, process, and retain only personal data that is adequate, relevant, and limited to what is necessary for specified objectives. Despite its legal importance, the principle has lacked an operational definition suitable for complex ML systems.

The key innovation of this work is the introduction of a formal optimization framework for data minimization. This framework conceptualizes data minimization as a bilevel optimization problem. The outer objective aims to minimize the amount of data (quantified by the L1-norm of a binary “minimization matrix”), while the inner objective ensures that the machine learning model trained on the reduced dataset maintains its utility (performance) within a predefined acceptable drop tolerance. Crucially, this framework enables individualized data minimization, allowing for the selective removal of specific features from individual data points, a more granular approach than traditional methods like global feature selection or random sample pruning.

A significant finding from this formalization and empirical evaluation is the inherent disconnect between current interpretations of data minimization and actual privacy outcomes. The paper comprehensively assesses the privacy implications by evaluating three distinct real-world privacy risks: Reconstruction Risk (the ease with which removed data can be inferred or recreated), Re-identification Risk (the likelihood of linking anonymized data back to individuals), and Membership Inference Risk (the ability to determine if a data point was part of the training set). The authors demonstrate that simply reducing data size to preserve model utility does not proportionally reduce these privacy risks, challenging the implicit assumption that data minimization inherently enhances privacy.

To bridge this identified gap, the paper proposes a novel approach: incorporating explicit privacy considerations directly into the data minimization objective. By introducing “privacy scores” for features (e.g., based on their uniqueness or correlation with other features), the minimization algorithms can be guided to remove data that poses higher privacy risks while still maintaining utility. This modification is shown to achieve a substantially better privacy-utility trade-off. Furthermore, the research investigates the compatibility of this framework with existing privacy-preserving techniques like Differential Privacy (specifically DP-SGD) for the underlying model training, showing that such integration can further mitigate membership inference risks.

This research builds upon several main prior ingredients. It leverages the legal definitions and objectives of data minimization as stipulated in major data protection regulations. The formalization draws heavily from optimization theory, particularly bilevel optimization, and utilizes established concepts of loss functions and empirical risk minimization from machine learning. The comprehensive privacy evaluation relies on existing methodologies for quantifying privacy threats, including techniques for reconstruction, re-identification, and membership inference attacks. Finally, the practical implementations adapt various data reduction techniques, from basic feature selection and random subsampling to more advanced optimization-based and evolutionary algorithms, and integrates with established privacy-preserving machine learning paradigms like Differential Privacy.

The principle of data minimization aims to reduce the amount of data collected, processed or retained to minimize the potential for misuse, unauthorized access, or data breaches. Rooted in privacy-by-design … The principle of data minimization aims to reduce the amount of data collected, processed or retained to minimize the potential for misuse, unauthorized access, or data breaches. Rooted in privacy-by-design principles, data minimization has been endorsed by various global data protection regulations. However, its practical implementation remains a challenge due to the lack of a rigorous formulation. This paper addresses this gap and introduces an optimization framework for data minimization based on its legal definitions. It then adapts several optimization algorithms to perform data minimization and conducts a comprehensive evaluation in terms of their compliance with minimization objectives as well as their impact on user privacy. Our analysis underscores the mismatch between the privacy expectations of data minimization and the actual privacy benefits, emphasizing the need for approaches that account for multiple facets of real-world privacy risks.
Aiming to train and deploy predictive models, organizations collect large amounts of detailed client data, risking the exposure of private information in the event of a breach. To mitigate this, … Aiming to train and deploy predictive models, organizations collect large amounts of detailed client data, risking the exposure of private information in the event of a breach. To mitigate this, policymakers increasingly demand compliance with the data minimization (DM) principle, restricting data collection to only that data which is relevant and necessary for the task. Despite regulatory pressure, the problem of deploying machine learning models that obey DM has so far received little attention. In this work, we address this challenge in a comprehensive manner. We propose a novel vertical DM (vDM) workflow based on data generalization, which by design ensures that no full-resolution client data is collected during training and deployment of models, benefiting client privacy by reducing the attack surface in case of a breach. We formalize and study the corresponding problem of finding generalizations that both maximize data utility and minimize empirical privacy risk, which we quantify by introducing a diverse set of policy-aligned adversarial scenarios. Finally, we propose a range of baseline vDM algorithms, as well as Privacy-aware Tree (PAT), an especially effective vDM algorithm that outperforms all baselines across several settings. We plan to release our code as a publicly available library, helping advance the standardization of DM for machine learning. Overall, we believe our work can help lay the foundation for further exploration and adoption of DM principles in real-world applications.
An up-to-date account of the interplay between optimization and machine learning, accessible to students and researchers in both communities. The interplay between optimization and machine learning is one of the … An up-to-date account of the interplay between optimization and machine learning, accessible to students and researchers in both communities. The interplay between optimization and machine learning is one of the most important developments in modern computational science. Optimization formulations and methods are proving to be vital in designing algorithms to extract essential knowledge from huge volumes of data. Machine learning, however, is not simply a consumer of optimization technology but a rapidly evolving field that is itself generating new optimization ideas. This book captures the state of the art of the interaction between optimization and machine learning in a way that is accessible to researchers in both fields. Optimization approaches have enjoyed prominence in machine learning because of their wide applicability and attractive theoretical properties. The increasing complexity, size, and variety of today's machine learning models call for the reassessment of existing assumptions. This book starts the process of reassessment. It describes the resurgence in novel contexts of established frameworks such as first-order methods, stochastic approximations, convex relaxations, interior-point methods, and proximal methods. It also devotes attention to newer themes such as regularized optimization, robust optimization, gradient and subgradient methods, splitting techniques, and second-order methods. Many of these techniques draw inspiration from other fields, including operations research, theoretical computer science, and subfields of optimization. The book will enrich the ongoing cross-fertilization between the machine learning community and these other fields, and within the broader optimization community.
We critically review three major theories of machine learning and provide a new theory according to which machines learn a function when the machines successfully compute it. We show that … We critically review three major theories of machine learning and provide a new theory according to which machines learn a function when the machines successfully compute it. We show that this theory challenges common assumptions in the statistical and the computational learning theories, for it implies that learning true probabilities is equivalent neither to obtaining a correct calculation of the true probabilities nor to obtaining an almost-sure convergence to them. We also briefly discuss some case studies from natural language processing and macroeconomics from the perspective of the new theory.
Rapid progress in deep learning is leading to a diverse set of quickly changing models, with a dramatically growing demand for compute. However, as frameworks specialize performance optimization to patterns … Rapid progress in deep learning is leading to a diverse set of quickly changing models, with a dramatically growing demand for compute. However, as frameworks specialize performance optimization to patterns in popular networks, they implicitly constrain novel and diverse models that drive progress in research. We empower deep learning researchers by defining a flexible and user-customizable pipeline for optimizing training of arbitrary deep neural networks, based on data movement minimization. The pipeline begins with standard networks in PyTorch or ONNX and transforms computation through progressive lowering. We define four levels of general-purpose transformations, from local intra-operator optimizations to global data movement reduction. These operate on a data-centric graph intermediate representation that expresses computation and data movement at all levels of abstraction, including expanding basic operators such as convolutions to their underlying computations. Central to the design is the interactive and introspectable nature of the pipeline. Every part is extensible through a Python API, and can be tuned interactively using a GUI. We demonstrate competitive performance or speedups on ten different networks, with interactive optimizations discovering new opportunities in EfficientNet.
Rapid progress in deep learning is leading to a diverse set of quickly changing models, with a dramatically growing demand for compute. However, as frameworks specialize performance optimization to patterns … Rapid progress in deep learning is leading to a diverse set of quickly changing models, with a dramatically growing demand for compute. However, as frameworks specialize performance optimization to patterns in popular networks, they implicitly constrain novel and diverse models that drive progress in research. We empower deep learning researchers by defining a flexible and user-customizable pipeline for optimizing training of arbitrary deep neural networks, based on data movement minimization. The pipeline begins with standard networks in PyTorch or ONNX and transforms computation through progressive lowering. We define four levels of general-purpose transformations, from local intra-operator optimizations to global data movement reduction. These operate on a data-centric graph intermediate representation that expresses computation and data movement at all levels of abstraction, including expanding basic operators such as convolutions to their underlying computations. Central to the design is the interactive and introspectable nature of the pipeline. Every part is extensible through a Python API, and can be tuned interactively using a GUI. We demonstrate competitive performance or speedups on ten different networks, with interactive optimizations discovering new opportunities in EfficientNet.
Deep learning on large-scale data is currently dominant nowadays. The unprecedented scale of data has been arguably one of the most important driving forces behind its success. However, there still … Deep learning on large-scale data is currently dominant nowadays. The unprecedented scale of data has been arguably one of the most important driving forces behind its success. However, there still exist scenarios where collecting data or labels could be extremely expensive, e.g., medical imaging and robotics. To fill up this gap, this paper considers the problem of data-efficient learning from scratch using a small amount of representative data. First, we characterize this problem by active learning on homeomorphic tubes of spherical manifolds. This naturally generates feasible hypothesis class. With homologous topological properties, we identify an important connection - finding tube manifolds is equivalent to minimizing hyperspherical energy (MHE) in physical geometry. Inspired by this connection, we propose a MHE-based active learning (MHEAL) algorithm, and provide comprehensive theoretical guarantees for MHEAL, covering convergence and generalization analysis. Finally, we demonstrate the empirical performance of MHEAL in a wide range of applications for data-efficient learning, including deep clustering, distribution matching, version space sampling, and deep active learning.
Deep learning on large-scale data is dominant nowadays. The unprecedented scale of data has been arguably one of the most important driving forces for the success of deep learning. However, … Deep learning on large-scale data is dominant nowadays. The unprecedented scale of data has been arguably one of the most important driving forces for the success of deep learning. However, there still exist scenarios where collecting data or labels could be extremely expensive, e.g., medical imaging and robotics. To fill up this gap, this paper considers the problem of data-efficient learning from scratch using a small amount of representative data. First, we characterize this problem by active learning on homeomorphic tubes of spherical manifolds. This naturally generates feasible hypothesis class. With homologous topological properties, we identify an important connection -- finding tube manifolds is equivalent to minimizing hyperspherical energy (MHE) in physical geometry. Inspired by this connection, we propose a MHE-based active learning (MHEAL) algorithm, and provide comprehensive theoretical guarantees for MHEAL, covering convergence and generalization analysis. Finally, we demonstrate the empirical performance of MHEAL in a wide range of applications on data-efficient learning, including deep clustering, distribution matching, version space sampling and deep active learning.
Abstract Application of machine learning may be understood as deriving new knowledge for practical use through explaining accumulated observations, training set. Peirce used the term \textbf{abduction} for this kind of … Abstract Application of machine learning may be understood as deriving new knowledge for practical use through explaining accumulated observations, training set. Peirce used the term \textbf{abduction} for this kind of inference. Here I formalize the concept of abduction for real valued hypotheses, and show that 14 of the most popular textbook ML learners (every learner I tested), covering classification, regression and clustering, implement this concept of abduction inference. The approach is proposed as an alternative to Statistical learning theory, which requires an impractical assumption of indefinitely increasing training set for its justification.
Application of machine learning may be understood as deriving new knowledge for practical use through explaining accumulated observations, training set. Peirce used the term abduction for this kind of inference. … Application of machine learning may be understood as deriving new knowledge for practical use through explaining accumulated observations, training set. Peirce used the term abduction for this kind of inference. Here I formalize the concept of abduction for real valued hypotheses, and show that 14 of the most popular textbook ML learners (every learner I tested), covering classification, regression and clustering, implement this concept of abduction inference. The approach is proposed as an alternative to statistical learning theory, which requires an impractical assumption of indefinitely increasing training set for its justification.
We define a neural network as a septuple consisting of (1) a state vector, (2) an input projection, (3) an output projection, (4) a weight matrix, (5) a bias vector, … We define a neural network as a septuple consisting of (1) a state vector, (2) an input projection, (3) an output projection, (4) a weight matrix, (5) a bias vector, (6) an activation map and (7) a loss function. We argue that the loss function can be imposed either on the boundary (i.e. input and/or output neurons) or in the bulk (i.e. hidden neurons) for both supervised and unsupervised systems. We apply the principle of maximum entropy to derive a canonical ensemble of the state vectors subject to a constraint imposed on the bulk loss function by a Lagrange multiplier (or an inverse temperature parameter). We show that in an equilibrium the canonical partition function must be a product of two factors: a function of the temperature and a function of the bias vector and weight matrix. Consequently, the total Shannon entropy consists of two terms which represent respectively a thermodynamic entropy and a complexity of the neural network. We derive the first and second laws of learning: during learning the total entropy must decrease until the system reaches an equilibrium (i.e. the second law), and the increment in the loss function must be proportional to the increment in the thermodynamic entropy plus the increment in the complexity (i.e. the first law). We calculate the entropy destruction to show that the efficiency of learning is given by the Laplacian of the total free energy which is to be maximized in an optimal neural architecture, and explain why the optimization condition is better satisfied in a deep network with a large number of hidden layers. The key properties of the model are verified numerically by training a supervised feedforward neural network using the method of stochastic gradient descent. We also discuss a possibility that the entire universe on its most fundamental level is a neural network.
We define a neural network as a septuple consisting of (1) a state vector, (2) an input projection, (3) an output projection, (4) a weight matrix, (5) a bias vector, … We define a neural network as a septuple consisting of (1) a state vector, (2) an input projection, (3) an output projection, (4) a weight matrix, (5) a bias vector, (6) an activation map and (7) a loss function. We argue that the loss function can be imposed either on the boundary (i.e. input and/or output neurons) or in the bulk (i.e. hidden neurons) for both supervised and unsupervised systems. We apply the principle of maximum entropy to derive a canonical ensemble of the state vectors subject to a constraint imposed on the bulk loss function by a Lagrange multiplier (or an inverse temperature parameter). We show that in an equilibrium the canonical partition function must be a product of two factors: a function of the temperature and a function of the bias vector and weight matrix. Consequently, the total Shannon entropy consists of two terms which represent respectively a thermodynamic entropy and a complexity of the neural network. We derive the first and second laws of learning: during learning the total entropy must decrease until the system reaches an equilibrium (i.e. the second law), and the increment in the loss function must be proportional to the increment in the thermodynamic entropy plus the increment in the complexity (i.e. the first law). We calculate the entropy destruction to show that the efficiency of learning is given by the Laplacian of the total free energy which is to be maximized in an optimal neural architecture, and explain why the optimization condition is better satisfied in a deep network with a large number of hidden layers. The key properties of the model are verified numerically by training a supervised feedforward neural network using the method of stochastic gradient descent. We also discuss a possibility that the entire universe on its most fundamental level is a neural network.
Written in an easily accessible style, this book provides the ideal blend of theory and practical, applicable knowledge. It covers neural networks, graphical models, reinforcement learning, evolutionary algorithms, dimensionality reduction … Written in an easily accessible style, this book provides the ideal blend of theory and practical, applicable knowledge. It covers neural networks, graphical models, reinforcement learning, evolutionary algorithms, dimensionality reduction methods, and the important area of optimization. It treads the fine line between adequate academic rigor and overwhelming students with equations and mathematical concepts. The author includes examples based on widely available datasets and practical and theoretical problems to test understanding and application of the material. The book describes algorithms with code examples backed up by a website that provides working implementations in Python.
This book bridges theoretical computer science and machine learning by exploring what the two sides can teach each other. It emphasizes the need for flexible, tractable models that better capture … This book bridges theoretical computer science and machine learning by exploring what the two sides can teach each other. It emphasizes the need for flexible, tractable models that better capture not what makes machine learning hard, but what makes it easy. Theoretical computer scientists will be introduced to important models in machine learning and to the main questions within the field. Machine learning researchers will be introduced to cutting-edge research in an accessible format, and gain familiarity with a modern, algorithmic toolkit, including the method of moments, tensor decompositions and convex programming relaxations. The treatment beyond worst-case analysis is to build a rigorous understanding about the approaches used in practice and to facilitate the discovery of exciting, new ways to solve important long-standing problems.
This book bridges theoretical computer science and machine learning by exploring what the two sides can teach each other. It emphasizes the need for flexible, tractable models that better capture … This book bridges theoretical computer science and machine learning by exploring what the two sides can teach each other. It emphasizes the need for flexible, tractable models that better capture not what makes machine learning hard, but what makes it easy. Theoretical computer scientists will be introduced to important models in machine learning and to the main questions within the field. Machine learning researchers will be introduced to cutting-edge research in an accessible format, and gain familiarity with a modern, algorithmic toolkit, including the method of moments, tensor decompositions and convex programming relaxations. The treatment beyond worst-case analysis is to build a rigorous understanding about the approaches used in practice and to facilitate the discovery of exciting, new ways to solve important long-standing problems.
This paper elaborates the birth and development of bilevel programming integrating bibliographies home and abroad. Then analysis several solving algorithms of general bilevel programming, multi-objective bilevel programming and introduce their … This paper elaborates the birth and development of bilevel programming integrating bibliographies home and abroad. Then analysis several solving algorithms of general bilevel programming, multi-objective bilevel programming and introduce their current applications in transportation, resource allocation, logistics and other aspects briefly. The last part of the paper is the further prospects of its development.
Quantitative studies in many fields involve the analysis of multivariate data of diverse types, including measurements that we may consider binary, ordinal and continuous. One approach to the analysis of … Quantitative studies in many fields involve the analysis of multivariate data of diverse types, including measurements that we may consider binary, ordinal and continuous. One approach to the analysis of such mixed data is to use a copula model, in which the associations among the variables are parameterized separately from their univariate marginal distributions. The purpose of this article is to provide a simple, general method of semiparametric inference for copula models via a type of rank likelihood function for the association parameters. The proposed method of inference can be viewed as a generalization of marginal likelihood estimation, in which inference for a parameter of interest is based on a summary statistic whose sampling distribution is not a function of any nuisance parameters. In the context of copula estimation, the extended rank likelihood is a function of the association parameters only and its applicability does not depend on any assumptions about the marginal distributions of the data, thus making it appropriate for the analysis of mixed continuous and discrete data with arbitrary marginal distributions. Estimation and inference for parameters of the Gaussian copula are available via a straightforward Markov chain Monte Carlo algorithm based on Gibbs sampling. Specification of prior distributions or a parametric form for the univariate marginal distributions of the data is not necessary.
Possible solutions to the problem of combining classifiers can be divided into three categories according to the levels of information available from the various classifiers. Four approaches based on different … Possible solutions to the problem of combining classifiers can be divided into three categories according to the levels of information available from the various classifiers. Four approaches based on different methodologies are proposed for solving this problem. One is suitable for combining individual classifiers such as Bayesian, k-nearest-neighbor, and various distance classifiers. The other three could be used for combining any kind of individual classifiers. On applying these methods to combine several classifiers for recognizing totally unconstrained handwritten numerals, the experimental results show that the performance of individual classifiers can be improved significantly. For example, on the US zipcode database, 98.9% recognition with 0.90% substitution and 0.2% rejection can be obtained, as well as high reliability with 95% recognition, 0% substitution, and 5% rejection.< <ETX xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">&gt;</ETX>
Machine learning techniques based on neural networks are achieving remarkable results in a wide variety of domains. Often, the training of models requires large, representative datasets, which may be crowdsourced … Machine learning techniques based on neural networks are achieving remarkable results in a wide variety of domains. Often, the training of models requires large, representative datasets, which may be crowdsourced and contain sensitive information. The models should not expose private information in these datasets. Addressing this goal, we develop new algorithmic techniques for learning and a refined analysis of privacy costs within the framework of differential privacy. Our implementation and experiments demonstrate that we can train deep neural networks with non-convex objectives, under a modest privacy budget, and at a manageable cost in software complexity, training efficiency, and model quality.
We quantitatively investigate how machine learning models leak information about the individual data records on which they were trained. We focus on the basic membership inference attack: given a data … We quantitatively investigate how machine learning models leak information about the individual data records on which they were trained. We focus on the basic membership inference attack: given a data record and black-box access to a model, determine if the record was in the model's training dataset. To perform membership inference against a target model, we make adversarial use of machine learning and train our own inference model to recognize differences in the target model's predictions on the inputs that it trained on versus the inputs that it did not train on. We empirically evaluate our inference techniques on classification models trained by commercial "machine learning as a service" providers such as Google and Amazon. Using realistic datasets and classification tasks, including a hospital discharge dataset whose membership is sensitive from the privacy perspective, we show that these models can be vulnerable to membership inference attacks. We then investigate the factors that influence this leakage and evaluate mitigation strategies.
Machine learning algorithms, when applied to sensitive data, pose a distinct threat to privacy. A growing body of prior work demonstrates that models produced by these algorithms may leak specific … Machine learning algorithms, when applied to sensitive data, pose a distinct threat to privacy. A growing body of prior work demonstrates that models produced by these algorithms may leak specific private information in the training data to an attacker, either through the models' structure or their observable behavior. However, the underlying cause of this privacy risk is not well understood beyond a handful of anecdotal accounts that suggest overfitting and influence might play a role. This paper examines the effect that overfitting and influence have on the ability of an attacker to learn information about the training data from machine learning models, either through training set membership inference or attribute inference attacks. Using both formal and empirical analyses, we illustrate a clear relationship between these factors and the privacy risk that arises in several popular machine learning algorithms. We find that overfitting is sufficient to allow an attacker to perform membership inference and, when the target attribute meets certain conditions about its influence, attribute inference attacks. Interestingly, our formal analysis also shows that overfitting is not necessary for these attacks and begins to shed light on what other factors may be in play. Finally, we explore the connection between membership inference and attribute inference, showing that there are deep connections between the two that lead to effective new attacks.
Article 5(1)(c) of the European Union's General Data Protection Regulation (GDPR) requires that "personal data shall be [...] adequate, relevant, and limited to what is necessary in relation to the … Article 5(1)(c) of the European Union's General Data Protection Regulation (GDPR) requires that "personal data shall be [...] adequate, relevant, and limited to what is necessary in relation to the purposes for which they are processed (`data minimisation')". To date, the legal and computational definitions of 'purpose limitation' and 'data minimization' remain largely unclear. In particular, the interpretation of these principles is an open issue for information access systems that optimize for user experience through personalization and do not strictly require personal data collection for the delivery of basic service.
As machine learning becomes more widely used, the need to study its implications in security and privacy becomes more urgent. Although the body of work in privacy has been steadily … As machine learning becomes more widely used, the need to study its implications in security and privacy becomes more urgent. Although the body of work in privacy has been steadily growing over the past few years, research on the privacy aspects of machine learning has received less focus than the security aspects. Our contribution in this research is an analysis of more than 45 papers related to privacy attacks against machine learning that have been published during the past seven years. We propose an attack taxonomy, together with a threat model that allows the categorization of different attacks based on the adversarial knowledge, and the assets under attack. An initial exploration of the causes of privacy leaks is presented, as well as a detailed analysis of the different attacks. Finally, we present an overview of the most commonly proposed defenses and a discussion of the open problems and future directions identified during our analysis.
This paper determines whether the two core data protection principles of data minimi- sation and purpose limitation can be meaningfully implemented in data-driven systems. While contemporary data processing practices appear … This paper determines whether the two core data protection principles of data minimi- sation and purpose limitation can be meaningfully implemented in data-driven systems. While contemporary data processing practices appear to stand at odds with these prin- ciples, we demonstrate that systems could technically use much less data than they currently do. This observation is a starting point for our detailed techno-legal analysis uncovering obstacles that stand in the way of meaningful implementation and compliance as well as exemplifying unexpected trade-offs which emerge where data protection law is applied in practice. Our analysis seeks to inform debates about the impact of data protec- tion on the development of artificial intelligence in the European Union, offering practical action points for data controllers, regulators, and researchers.
Abstract The concept of sensitive data has been a mainstay of data protection for a number of decades. The concept itself is used to denote several categories of data for … Abstract The concept of sensitive data has been a mainstay of data protection for a number of decades. The concept itself is used to denote several categories of data for which processing is deemed to pose a higher risk for data subjects than other forms of data. Such risks are often perceived in terms of an elevated probability of discrimination, or related harms, to vulnerable groups in society. As a result, data protection frameworks have traditionally foreseen a higher burden for the processing of sensitive data than other forms of data. The sui generis protection of sensitive data—stronger than the protection of non-sensitive personal data—can also seemingly be a necessity from a fundamental rights-based perspective, as indicated by human rights jurisprudence. This Article seeks to analyze the continued relevance of sensitive data in both contemporary and potential future contexts. Such an exercise is important for two main reasons. First, the legal regime responsible for the regulation of the use of personal data has evolved considerably since the concept of sensitive data was first used. This has been exemplified by the creation of the EU’s General Data Protection Regulation (GDPR) in Europe. It has introduced a number of requirements relating to sensitive data that are likely to represent added burdens for controllers who want to process personal data. Second, the very nature of personal data is changing. Increases in computing power, more complex algorithms, and the availability of ever more potentially complimentary data online mean that more and more data can be considered of a sensitive nature. This creates various risks going forward, including an inflation effect whereby the concept loses its value, as well as the possibility that data controllers may increasingly seek to circumvent compliance with the requirements placed upon the use of sensitive data. This Article analyzes how such developments are likely to influence the concept of sensitive data and, in particular, its ability to protect vulnerable groups from harm. The authors propose a possible interpretative solution: A hybrid approach where a purpose-based definition acquires a bigger role in deciding whether data is sensitive, combined with a context-based ‘backstop’ based on reasonable foreseeability.
These attacks on statistical databases are no longer a theoretical danger. These attacks on statistical databases are no longer a theoretical danger.
Computation-intensive design problems are becoming increasingly common in manufacturing industries. The computation burden is often caused by expensive analysis and simulation processes in order to reach a comparable level of … Computation-intensive design problems are becoming increasingly common in manufacturing industries. The computation burden is often caused by expensive analysis and simulation processes in order to reach a comparable level of accuracy as physical testing data. To address such a challenge, approximation or metamodeling techniques are often used. Metamodeling techniques have been developed from many different disciplines including statistics, mathematics, computer science, and various engineering disciplines. These metamodels are initially developed as “surrogates” of the expensive simulation process in order to improve the overall computation efficiency. They are then found to be a valuable tool to support a wide scope of activities in modern engineering design, especially design optimization. This work reviews the state-of-the-art metamodel-based techniques from a practitioner’s perspective according to the role of metamodeling in supporting design optimization, including model approximation, design space exploration, problem formulation, and solving various types of optimization problems. Challenges and future development of metamodeling in support of engineering design is also analyzed and discussed.
Modern machine learning systems are increasingly characterized by extensive personal data collection, despite the diminishing returns and increasing societal costs of such practices. Yet, data minimisation is one of the … Modern machine learning systems are increasingly characterized by extensive personal data collection, despite the diminishing returns and increasing societal costs of such practices. Yet, data minimisation is one of the core data protection principles enshrined in the European Union's General Data Protection Regulation ('GDPR') and requires that only personal data that is adequate, relevant and limited to what is necessary is processed. However, the principle has seen limited adoption due to the lack of technical interpretation.
Recently, privacy issues in web services that rely on users’ personal data have raised great attention. Despite that recent regulations force companies to offer choices for each user to opt-in … Recently, privacy issues in web services that rely on users’ personal data have raised great attention. Despite that recent regulations force companies to offer choices for each user to opt-in or opt-out of data disclosure, real-world applications usually only provide an “all or nothing” binary option for users to either disclose all their data or preserve all data with the cost of no personalized service. In this article, we argue that such a binary mechanism is not optimal for both consumers and platforms. To study how different privacy mechanisms affect users’ decisions on information disclosure and how users’ decisions affect the platform’s revenue, we propose a privacy-aware recommendation framework that gives users fine control over their data. In this new framework, users can proactively control which data to disclose based on the tradeoff between anticipated privacy risks and potential utilities. Then we study the impact of different data disclosure mechanisms via simulation with reinforcement learning due to the high cost of real-world experiments. The results show that the platform mechanisms with finer split granularity and more unrestrained disclosure strategy can bring better results for both consumers and platforms than the “all or nothing” mechanism adopted by most real-world applications.
The recent California Consumer Privacy Act (CCPA) requires that personal data shall be limited to what is necessary for business purposes. Business services shall "implement technical safeguards that prohibit re-identification … The recent California Consumer Privacy Act (CCPA) requires that personal data shall be limited to what is necessary for business purposes. Business services shall "implement technical safeguards that prohibit re-identification of the consumer to whom the information may pertain". For recommender systems, we believe the legal concepts of limitation and technical safeguard are not specific enough to operationalize in practice. This study makes efforts to map the legislative challenges to practice of reducing personal data. More importantly, we borrowed the notion of uncertainty from the machine learning community, and added it as another aspect of recommendation utility, in addition to recommendation accuracy, to guide the data reduction process. The benefit of using uncertainty is that we have more comprehensive consideration while reducing the personal data. In addition, two major types of uncertainty in machine learning models: aleatoric uncertainty and epistemic uncertainty, helped us formulate two groups of data reduction strategies: within-user and between-user. We conducted a series of analyses regarding uncertainty change and accuracy loss caused by different data reduction strategies. We found that at the aggregate level, data reduction is feasible with certain data reduction strategies. At the individual level, the recommendation utility (both uncertainty and accuracy) loss incurred by data reduction disparately impacts different users — a finding which has implications for fairness and transparency of AI models. Our results reveal the difficulty and intricacy of the data reduction problem in the context of recommender systems.