AI Alignment at Your Discretion

Type: Article
Publication Date: 2025-06-23
Citations: 0

Locations

  • arXiv (Cornell University)

Ask a Question About This Paper

Summary

The paper introduces and formalizes the concept of alignment discretion, defined as the inherent latitude granted to human or algorithmic annotators when interpreting and applying AI alignment principles. This discretion arises because alignment principles often conflict or are indecisive in practice, making a purely rule-based “correct” output impossible.

The significance of this work lies in its critical examination of a previously unexamined, yet pervasive, aspect of AI alignment that contributes to opaque, potentially arbitrary, and uninterpretable model behaviors. By quantifying discretion, the paper highlights a core gap in current feedback-based alignment processes, which risk embedding unscrutinized human value judgments and arbitrary decisions into AI systems. It posits that unaddressed discretion can lead to “alignment-washing,” where the appearance of ethical compliance masks underlying inconsistencies. The paper advocates for greater transparency, accountability, and control over this discretion, drawing strong parallels to the established legal framework for judicial discretion.

The key innovations of this paper include:
1. Formalizing Alignment Discretion: Explicitly defining and framing discretion in AI alignment as a fundamental concept, distinct from mere annotator disagreement, and linking it to principles’ conflict, consensus, and indifference.
2. Developing Quantitative Metrics: Introducing a suite of novel metrics to systematically analyze discretion:
* Discretion Arbitrariness (DA): Measures how often an annotator’s preference contradicts a clear consensus among principles, indicating arbitrary judgment.
* Principle Supremacy (PS) & Principle Priority (w*): Quantifies how annotators implicitly prioritize conflicting principles, assigning a numerical weight to each principle’s importance based on how frequently it “wins” over others in conflicted cases.
* Discretion Discrepancy (DD): Compares the principle prioritization rankings between different annotators (human vs. algorithmic models) to assess how well algorithmic annotators mimic human discretion.
3. Empirical Analysis on Real-World Datasets: Applying these metrics to widely used safety alignment datasets (Anthropic HH-RLHF and PKU-SafeRLHF). The findings reveal:
* A high frequency of principle conflict or indifference (80-85% of cases), necessitating discretion.
* Significant human annotator arbitrariness (28.9% on HH-RLHF, 15-20% on PKU-SafeRLHF), suggesting inconsistencies in human judgments.
* A notable discrepancy between human and algorithmic annotators’ (especially large language models, LLMs) principle priorities, indicating that LLMs may not internalize human values as intended from preference data. Reward models generally align better with human discretion than LLMs.
* Off-the-shelf LLMs show varying degrees of discretion discrepancy with human preferences, highlighting a challenge in transferring human-like discretion to models.

The main prior ingredients upon which this research builds are:
* Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI: These are the prevailing AI alignment paradigms that rely on human (or AI-generated) preferences to train models. The paper directly audits the discretion embedded within these processes.
* Legal Philosophy of Judicial Discretion: A foundational theoretical inspiration. Concepts from legal scholars like Hart, Dworkin, Raz, and Barak regarding “arbitrium judicis” (judicial discretion), the balance of consistency and flexibility, and the structuring of authority provide the conceptual framework for understanding and measuring discretion in AI.
* Preference Modeling and Ranking Systems: The technical formalization of preferences draws on established methods like the Bradley-Terry-Luce (BTL) model for pairwise comparisons. The derivation of principle priorities is inspired by ranking systems such as ELO scores from chess.
* Annotator Disagreement Research: While acknowledging existing work on measuring annotator agreement (e.g., Kendall tau distance, Cohen’s Kappa), this paper extends it by focusing on the underlying reasons for disagreement in alignment—namely, principle prioritization and conflict resolution.
* LLMs as Evaluators/Oracles: The common practice of using powerful LLMs (like GPT-4o in this paper) as “oracles” to assess principle-specific preferences is utilized as a technical ingredient, even while the paper simultaneously audits the discretion of LLMs themselves.

AI alignment aims to make AI systems behave in line with human intentions and values. As AI systems grow more capable, so do risks from misalignment. To provide a comprehensive 
 AI alignment aims to make AI systems behave in line with human intentions and values. As AI systems grow more capable, so do risks from misalignment. To provide a comprehensive and up-to-date overview of the alignment field, in this survey, we delve into the core concepts, methodology, and practice of alignment. First, we identify four principles as the key objectives of AI alignment: Robustness, Interpretability, Controllability, and Ethicality (RICE). Guided by these four principles, we outline the landscape of current alignment research and decompose them into two key components: forward alignment and backward alignment. The former aims to make AI systems aligned via alignment training, while the latter aims to gain evidence about the systems' alignment and govern them appropriately to avoid exacerbating misalignment risks. On forward alignment, we discuss techniques for learning from feedback and learning under distribution shift. On backward alignment, we discuss assurance techniques and governance practices. We also release and continually update the website (www.alignmentsurvey.com) which features tutorials, collections of papers, blog posts, and other resources.
AI systems often rely on two key components: a specified goal or reward function and an optimization algorithm to compute the optimal behavior for that goal. This approach is intended 
 AI systems often rely on two key components: a specified goal or reward function and an optimization algorithm to compute the optimal behavior for that goal. This approach is intended to provide value for a principal: the user on whose behalf the agent acts. The objectives given to these agents often refer to a partial specification of the principal's goals. We consider the cost of this incompleteness by analyzing a model of a principal and an agent in a resource constrained world where the $L$ attributes of the state correspond to different sources of utility for the principal. We assume that the reward function given to the agent only has support on $J < L$ attributes. The contributions of our paper are as follows: 1) we propose a novel model of an incomplete principal-agent problem from artificial intelligence; 2) we provide necessary and sufficient conditions under which indefinitely optimizing for any incomplete proxy objective leads to arbitrarily low overall utility; and 3) we show how modifying the setup to allow reward functions that reference the full state or allowing the principal to update the proxy objective over time can lead to higher utility solutions. The results in this paper argue that we should view the design of reward functions as an interactive and dynamic process and identifies a theoretical scenario where some degree of interactivity is desirable.
The field of AI alignment aims to steer AI systems toward human goals, preferences, and ethical principles. Its contributions have been instrumental for improving the output quality, safety, and trustworthiness 
 The field of AI alignment aims to steer AI systems toward human goals, preferences, and ethical principles. Its contributions have been instrumental for improving the output quality, safety, and trustworthiness of today's AI models. This perspective article draws attention to a fundamental challenge inherent in all AI alignment endeavors, which we term the "AI alignment paradox": The better we align AI models with our values, the easier we make it for adversaries to misalign the models. We illustrate the paradox by sketching three concrete example incarnations for the case of language models, each corresponding to a distinct way in which adversaries can exploit the paradox. With AI's increasing real-world impact, it is imperative that a broad community of researchers be aware of the AI alignment paradox and work to find ways to break out of it, in order to ensure the beneficial use of AI for the good of humanity.
The document outlines a comprehensive strategy for integrating AI into higher education, emphasizing the need for curriculum restructuring, pedagogical transformation, and strategic implementation to prepare students for an AI-augmented future. 
 The document outlines a comprehensive strategy for integrating AI into higher education, emphasizing the need for curriculum restructuring, pedagogical transformation, and strategic implementation to prepare students for an AI-augmented future. AI as an Educational Opportunity: Higher education institutions should reframe AI use from a discipline issue to an educational opportunity, as detection and prohibition approaches are ineffective and worsen equity gaps. Emphasis on Meta-AI Skills: Learning outcomes should be revised to emphasize meta-AI skills like prompt engineering, output evaluation, and AI collaboration, which require higher cognitive engagement than traditional skills. Curriculum Restructuring: Universities need to update general education requirements and program sequences to reflect AI-transformed professional practices, ensuring students develop both domain knowledge and AI collaboration skills. Permanent and Temporary Scaffolding: Pedagogy should distinguish between temporary scaffolding for developing basic meta-AI skills and permanent scaffolding where AI tools are used professionally, focusing on sophisticated collaboration patterns. Personalized Learning Support: RAG-based technology can provide personalized learning support through custom AI tutoring systems that offer targeted assistance while maintaining accuracy through controlled knowledge bases. Enhanced Student Services: AI integration can enhance student services by streamlining administrative processes and providing proactive, accessible support across departments, benefiting underserved students. Operational Efficiency: AI can improve operational efficiency in higher education by enhancing enrollment management, academic administration, and research support workflows. Strategic Implementation: Successful AI integration requires decisive leadership, strategic resource allocation, and active management of organizational resistance, with a focus on high-impact areas.
AI alignment work is important from both a commercial and a safety lens. With this paper, we aim to help actors who support alignment efforts to make these efforts as 
 AI alignment work is important from both a commercial and a safety lens. With this paper, we aim to help actors who support alignment efforts to make these efforts as effective as possible, and to avoid potential adverse effects. We begin by suggesting that institutions that are trying to act in the public interest (such as governments) should aim to support specifically alignment work that reduces accident or misuse risks. We then describe four problems which might cause alignment efforts to be counterproductive, increasing large-scale AI risks. We suggest mitigations for each problem. Finally, we make a broader recommendation that institutions trying to act in the public interest should think systematically about how to make their alignment efforts as effective, and as likely to be beneficial, as possible.
Abstract : As artificial intelligence systems grow more advanced and self-governing, the issue of AI alignment—ensuring that these systems follow objectives consistent with human values—has become one of the most 
 Abstract : As artificial intelligence systems grow more advanced and self-governing, the issue of AI alignment—ensuring that these systems follow objectives consistent with human values—has become one of the most pressing topics in AI safety and ethics. Even in well-constructed systems, misaligned goals can result in unexpected behaviors that might lead to harmful or ethically dubious outcomes. This research paper delves into the conceptual underpinnings, technical strategies, and societal impacts of AI alignment. The discussion starts by exploring the theoretical foundations of alignment, focusing on models related to human values, utility functions, and the learning of preferences. Following this, the paper evaluates existing approaches like inverse reinforcement learning, cooperative inverse reinforcement learning, and reward modeling, analyzing their advantages, drawbacks, and real-world applicability. By conducting a comparative analysis of case studies and simulations, the research underscores significant challenges in implementing human values, such as ambiguity in values, dependence on context, and the potential for specification gaming. It also stresses the necessity of integrating ethical pluralism and a variety of human viewpoints. Additionally, the study examines the significance of interpretability, transparency, and interdisciplinary collaboration in improving alignment results. Research indicates that no single method provides a comprehensive solution; however, a hybrid, multi-dimensional strategy—rooted in human-centered design and ongoing feedback—appears most promising. The study emphasizes the pressing need for proactive alignment strategies as AI systems become more integrated into critical areas like healthcare, governance, and autonomous decision-making. Ultimately, achieving strong AI alignment is not merely a technical issue but a profoundly human challenge that necessitates contributions from technologists, ethicists, and society as a whole to ensure AI benefits the common good. Keywords: AI Alignment, Human Values, Ethical Artificial Intelligence, Value Learning, Inverse Reinforcement Learning.
AI faces a trifecta of grand challenges the Energy Wall, the Alignment Problem and the Leap from Narrow AI to AGI. Contemporary AI solutions consume unsustainable amounts of energy during 
 AI faces a trifecta of grand challenges the Energy Wall, the Alignment Problem and the Leap from Narrow AI to AGI. Contemporary AI solutions consume unsustainable amounts of energy during model training and daily operations. Making things worse, the amount of computation required to train each new AI model has been doubling every 2 months since 2020, directly translating to increases in energy consumption. The leap from AI to AGI requires multiple functional subsystems operating in a balanced manner, which requires a system architecture. However, the current approach to artificial intelligence lacks system design; even though system characteristics play a key role in the human brain from the way it processes information to how it makes decisions. Similarly, current alignment and AI ethics approaches largely ignore system design, yet studies show that the brains system architecture plays a critical role in healthy moral decisions. In this paper, we argue that system design is critically important in overcoming all three grand challenges. We posit that system design is the missing piece in overcoming the grand challenges. We present a Systematic AI Approach for AGI that utilizes system design principles for AGI, while providing ways to overcome the energy wall and the alignment challenges.
The EU AI Act is the proposed EU legislation concerning AI systems. This paper identifies several categories of the AI Act. Based on this categorization, a questionnaire is developed that 
 The EU AI Act is the proposed EU legislation concerning AI systems. This paper identifies several categories of the AI Act. Based on this categorization, a questionnaire is developed that serves as a tool to offer insights by creating quantitative data. Analysis of the data shows various challenges for organizations in different compliance categories. The influence of organization characteristics, such as size and sector, is examined to determine the impact on compliance. The paper will also share qualitative data on which questions were prevalent among respondents, both on the content of the AI Act as the application. The paper concludes by stating that there is still room for improvement in terms of compliance with the AIA and refers to a related project that examines a solution to help these organizations.
A key challenge in training Large Language Models (LLMs) is properly aligning them with human preferences. Reinforcement Learning with Human Feedback (RLHF) uses pairwise comparisons from human annotators to train 
 A key challenge in training Large Language Models (LLMs) is properly aligning them with human preferences. Reinforcement Learning with Human Feedback (RLHF) uses pairwise comparisons from human annotators to train reward functions and has emerged as a popular alignment method. However, input datasets in RLHF are not necessarily balanced in the types of questions and answers that are included. Therefore, we want RLHF algorithms to perform well even when the set of alternatives is not uniformly distributed. Drawing on insights from social choice theory, we introduce robustness to approximate clones, a desirable property of RLHF algorithms which requires that adding near-duplicate alternatives does not significantly change the learned reward function. We first demonstrate that the standard RLHF algorithm based on regularized maximum likelihood estimation (MLE) fails to satisfy this property. We then propose the weighted MLE, a new RLHF algorithm that modifies the standard regularized MLE by weighting alternatives based on their similarity to other alternatives. This new algorithm guarantees robustness to approximate clones while preserving desirable theoretical properties.
This study is a technical supplement to "AI gone astray: How subtle shifts in patient data send popular algorithms reeling, undermining patient safety." from STAT News, which investigates the effect 
 This study is a technical supplement to "AI gone astray: How subtle shifts in patient data send popular algorithms reeling, undermining patient safety." from STAT News, which investigates the effect of time drift on clinically deployed machine learning models. We use MIMIC-IV, a publicly available dataset, to train models that replicate commercial approaches by Dascena and Epic to predict the onset of sepsis, a deadly and yet treatable condition. We observe some of these models degrade overtime; most notably an RNN built on Epic features degrades from a 0.729 AUC to a 0.525 AUC over a decade, leading us to investigate technical and clinical drift as root causes of this performance drop.
State-of-the-art index tuners rely on query optimizer's cost estimates to search for the index configuration with the largest estimated execution cost improvement`. Due to well-known limitations in optimizer's estimates, in 
 State-of-the-art index tuners rely on query optimizer's cost estimates to search for the index configuration with the largest estimated execution cost improvement`. Due to well-known limitations in optimizer's estimates, in a significant fraction of cases, an index estimated to improve a query's execution cost, e.g., CPU time, makes that worse when implemented. Such errors are a major impediment for automated indexing in production systems. We observe that comparing the execution cost of two plans of the same query corresponding to different index configurations is a key step during index tuning. Instead of using optimizer's estimates for such comparison, our key insight is that formulating it as a classification task in machine learning results in significantly higher accuracy. We present a study of the design space for this classification problem. We further show how to integrate this classifier into the state-of-the-art index tuners with minimal modifications, i.e., how artificial intelligence (AI) can benefit automated indexing (AI). Our evaluation using industry-standard benchmarks and a large number of real customer workloads demonstrates up to 5x reduction in the errors in identifying the cheaper plan in a pair, which eliminates almost all query execution cost regressions when the model is used in index tuning.
"First, do no harm" faces a fundamental challenge in artificial intelligence: how can we specify what constitutes harm? While prior work treats harm specification as a technical hurdle to be 
 "First, do no harm" faces a fundamental challenge in artificial intelligence: how can we specify what constitutes harm? While prior work treats harm specification as a technical hurdle to be overcome through better algorithms or more data, we argue this assumption is unsound. Drawing on information theory, we demonstrate that complete harm specification is fundamentally impossible for any system where harm is defined external to its specifications. This impossibility arises from an inescapable information-theoretic gap: the entropy of harm H(O) always exceeds the mutual information I(O;I) between ground truth harm O and a system's specifications I. We introduce two novel metrics: semantic entropy H(S) and the safety-capability ratio I(O;I)/H(O), to quantify these limitations. Through a progression of increasingly sophisticated specification attempts, we show why each approach must fail and why the resulting gaps are not mere engineering challenges but fundamental constraints akin to the halting problem. These results suggest a paradigm shift: rather than pursuing complete specifications, AI alignment research should focus on developing systems that can operate safely despite irreducible specification uncertainty.
This paper proposes a Right to AI, which asserts that individuals and communities should meaningfully participate in the development and governance of the AI systems that shape their lives. Motivated 
 This paper proposes a Right to AI, which asserts that individuals and communities should meaningfully participate in the development and governance of the AI systems that shape their lives. Motivated by the increasing deployment of AI in critical domains and inspired by Henri Lefebvre's concept of the Right to the City, we reconceptualize AI as a societal infrastructure, rather than merely a product of expert design. In this paper, we critically evaluate how generative agents, large-scale data extraction, and diverse cultural values bring new complexities to AI oversight. The paper proposes that grassroots participatory methodologies can mitigate biased outcomes and enhance social responsiveness. It asserts that data is socially produced and should be managed and owned collectively. Drawing on Sherry Arnstein's Ladder of Citizen Participation and analyzing nine case studies, the paper develops a four-tier model for the Right to AI that situates the current paradigm and envisions an aspirational future. It proposes recommendations for inclusive data ownership, transparent design processes, and stakeholder-driven oversight. We also discuss market-led and state-centric alternatives and argue that participatory approaches offer a better balance between technical efficiency and democratic legitimacy.
Reaching consensus on a commonly accepted definition of AI Fairness has long been a central challenge in AI ethics and governance. There is a broad spectrum of views across society 
 Reaching consensus on a commonly accepted definition of AI Fairness has long been a central challenge in AI ethics and governance. There is a broad spectrum of views across society on what the concept of fairness means and how it should best be put to practice. In this workbook, we tackle this challenge by exploring how a context-based and society-centred approach to understanding AI Fairness can help project teams better identify, mitigate, and manage the many ways that unfair bias and discrimination can crop up across the AI project workflow. We begin by exploring how, despite the plurality of understandings about the meaning of fairness, priorities of equality and non-discrimination have come to constitute the broadly accepted core of its application as a practical principle. We focus on how these priorities manifest in the form of equal protection from direct and indirect discrimination and from discriminatory harassment. These elements form ethical and legal criteria based upon which instances of unfair bias and discrimination can be identified and mitigated across the AI project workflow. We then take a deeper dive into how the different contexts of the AI project lifecycle give rise to different fairness concerns. This allows us to identify several types of AI Fairness (Data Fairness, Application Fairness, Model Design and Development Fairness, Metric-Based Fairness, System Implementation Fairness, and Ecosystem Fairness) that form the basis of a multi-lens approach to bias identification, mitigation, and management. Building on this, we discuss how to put the principle of AI Fairness into practice across the AI project workflow through Bias Self-Assessment and Bias Risk Management as well as through the documentation of metric-based fairness criteria in a Fairness Position Statement.
Prior work has explicated the coloniality of artificial intelligence (AI) development and deployment through mechanisms such as extractivism, automation, sociological essentialism, surveillance, and containment. However, that work has not engaged 
 Prior work has explicated the coloniality of artificial intelligence (AI) development and deployment through mechanisms such as extractivism, automation, sociological essentialism, surveillance, and containment. However, that work has not engaged much with alignment: teaching behaviors to a large language model (LLM) in line with desired values, and has not considered a mechanism that arises within that process: moral absolutism -- a part of the coloniality of knowledge. Colonialism has a history of altering the beliefs and values of colonized peoples; in this paper, I argue that this history is recapitulated in current LLM alignment practices and technologies. Furthermore, I suggest that AI alignment be decolonialized using three forms of openness: openness of models, openness to society, and openness to excluded knowledges. This suggested approach to decolonial AI alignment uses ideas from the argumentative moral philosophical tradition of Hinduism, which has been described as an open-source religion. One concept used is vi\'{s}e\d{s}a-dharma, or particular context-specific notions of right and wrong. At the end of the paper, I provide a suggested reference architecture to work toward the proposed framework.
The EU AI Act was created to ensure ethical and safe Artificial Intelligence (AI) development and deployment across the EU. This study aims to identify key challenges and strategies for 
 The EU AI Act was created to ensure ethical and safe Artificial Intelligence (AI) development and deployment across the EU. This study aims to identify key challenges and strategies for helping enterprises focus on resources effectively. To achieve this aim, we conducted a Multivocal Literature Review (MLR) to explore the sentiments of both the industry and the academia. From 130 articles, 56 met the criteria. Our key findings are three-fold. First, liability. Second, discrimination. Third, tool adequacy. Additionally, some negative sentiments were expressed by industry and academia regarding regulatory interpretations, specific requirements, and transparency issues. Next, our findings are three essential themes for enterprises. First, risk-based regulatory compliance. Second, ethical frameworks and principles in technology development. Third, policies and systems for regulatory risk management. These results identify the key challenges and strategies and provide less commonly discussed themes, enabling enterprises to align with the requirements and minimize their distance from the EU market.
It has become commonplace to assert that autonomous agents will have to be built to follow human rules of behavior--social norms and laws. But human laws and norms are complex 
 It has become commonplace to assert that autonomous agents will have to be built to follow human rules of behavior--social norms and laws. But human laws and norms are complex and culturally varied systems, in many cases agents will have to learn the rules. This requires autonomous agents to have models of how human rule systems work so that they can make reliable predictions about rules. In this paper we contribute to the building of such models by analyzing an overlooked distinction between important rules and what we call silly rules--rules with no discernible direct impact on welfare. We show that silly rules render a normative system both more robust and more adaptable in response to shocks to perceived stability. They make normativity more legible for humans, and can increase legibility for AI systems as well. For AI systems to integrate into human normative systems, we suggest, it may be important for them to have models that include representations of silly rules.
There are few branches of the Theory of Evolution which appear to the mathematical statistician so much in need of exact treatment as those of Regression, Heredity, and Panmixia. Round 
 There are few branches of the Theory of Evolution which appear to the mathematical statistician so much in need of exact treatment as those of Regression, Heredity, and Panmixia. Round the notion of panmixia much obscurity has accumulated, owing to the want of precise definition and quantitative measurement. The problems of regression and heredity have been dealt with by Mr. Francis Galton in his epochmaking work on ‘Natural Inheritance,’ but, although he has shown exact methods of dealing, both experimentally and mathematically, with the problems of inheritance, it does not appear that mathematicians have hitherto developed his treatment, or that biologists and medical men have yet fully appreciated that he has really shown how many of the problems which perplex them may receive at any rate a partial answer. A considerable portion of the present memoir will be devoted to the expansion and fuller development of Mr. Galton’s ideas, particularly their application to the problem of bi-parental inheritance . At the same time I shall endeavour to point out how the results apply to some current biological and medical problems. In the first place, we must definitely free our minds, in the present state of our knowledge of the mechanism of inheritance and reproduction, of any hope of reaching a mathematical relation expressing the degree of correlation between individual parent and individual offspring. The causes in any individual case of inheritance are far too complex to admit of exact treatment; and up to the present the classification of the circumstances under which greater or less degrees of correlation between special groups of parents and offspring may be expected has made but little progress. This is largely owing to a certain prevalence of almost metaphysical speculation as to the causes of heredity, which has usurped the place of that careful collection and elaborate experiment by which alone sufficient data might have been accumulated, with a view to ultimately narrowing and specialising the circumstances under which correlation was measured. We must proceed from inheritance in the mass to inheritance in narrower and narrwoer classes, rather than attempt to build up general rules on the observation of individual instances. Shortly, we must proceed by the method of statistics, rather than by the consideration of typical cases. It may seem discouraging to the medical practitioner, with the problem before him of inheritance in a particular family, to be told that nothing but averages, means, and probabilities with regard to large classes can as yet be scientifically dealt with ; but the very nature of the distribution of variation, whether healthy or morhid, seems to indicate that we are dealing with that sphere of indefinitely numerous small causes, which in so many other instances has shown itself only amenable to the calculus of chance, and not to any analysis of the individual instance. On the other hand, the mathematical theory wall be of assistance to the medical man by answering, inter alia, in its discussion of regression the problem as to the average effect upon the offspring of given degrees of morbid variation in the parents. It may enable the physician, in many cases, to state a belief based on a high degree of probability, if it offers no ground for dogma in individual cases. One of the most noteworthy results of Mr. Francis Galton’s researches is his discovery of the mode in which a population actually reproduces itself by regression and fraternal variation. It is with some expansion and fuller mathematical treatment of these ideas that this memoir commences.
Abstract This study is concerned with the extension of the Bradley-Terry model for paired comparisons to situations which allow an expression of no preference. A new model is developed and 
 Abstract This study is concerned with the extension of the Bradley-Terry model for paired comparisons to situations which allow an expression of no preference. A new model is developed and its performance compared with a model proposed by Rao and Kupper. The maximum likelihood estimates of the parameters are found using an iterative procedure which, under a weak assumption, converges monotonically to the solution of the likelihood equations. It is noted that for a balanced paired comparison experiment the ranking obtained from the maximum likelihood estimates agrees with that obtained from a scoring system which allots two points for a win, one for a tie and zero for a loss. The likelihood ratio test of the hypothesis of equal preferences is shown to have the same asymptotic efficiency as that for the Rao-Kupper model. Two examples are presented, one of which introduces a set of data for an unbalanced paired comparison experiment. Initial applications of the test of goodness of fit suggest that the proposed model yields a reasonable representation of actual experimentation.
A previously described coefficient of agreement for nominal scales, kappa, treats all disagreements equally. A generalization to weighted kappa (Kw) is presented. The Kw provides for the incorpation of ratio-scaled 
 A previously described coefficient of agreement for nominal scales, kappa, treats all disagreements equally. A generalization to weighted kappa (Kw) is presented. The Kw provides for the incorpation of ratio-scaled degrees of disagreement (or agreement) to each of the cells of the k * k table of joi
The standard theory of choice—based on value maximization—associates with each option a real value such that, given an offered set, the decision maker chooses the option with the highest value. 
 The standard theory of choice—based on value maximization—associates with each option a real value such that, given an offered set, the decision maker chooses the option with the highest value. Despite its simplicity and intuitive appeal, there is a growing body of data that is inconsistent with this theory. In particular, the relative attractiveness of x compared to y often depends on the presence or absence of a third option z, and the “market share” of an option can actually be increased by enlarging the offered set. We review recent empirical findings that are inconsistent with value maximization, and present a context-dependent model that expresses the value of each option as an additive combination of two components: a contingent weighting process that captures the effect of the background context, and a binary comparison process that describes the effect of the local context. The model accounts for observed violations of the standard theory and provides a framework for analyzing context-dependent preferences.
Spearman's footrule and Kendall's tau are two well established distances between rankings. They, however, fail to take into account concepts crucial to evaluating a result set in information retrieval: element 
 Spearman's footrule and Kendall's tau are two well established distances between rankings. They, however, fail to take into account concepts crucial to evaluating a result set in information retrieval: element relevance and positional information. That is, changing the rank of a highly-relevant document should result in a higher penalty than changing the rank of an irrelevant document; a similar logic holds for the top versus the bottom of the result ordering. In this work, we extend both of these metrics to those with position and element weights, and show that a variant of the Diaconis-Graham inequality still holds - the generalized two measures remain within a constant factor of each other for all permutations.
Specifying a numeric reward function for reinforcement learning typically requires a lot of hand-tuning from a human expert. In contrast, preference-based reinforcement learning (PBRL) utilizes only pairwise comparisons between trajectories 
 Specifying a numeric reward function for reinforcement learning typically requires a lot of hand-tuning from a human expert. In contrast, preference-based reinforcement learning (PBRL) utilizes only pairwise comparisons between trajectories as a feedback signal, which are often more intuitive to specify. Currently available approaches to PBRL for control problems with continuous state/action spaces require a known or estimated model, which is often not available and hard to learn. In this paper, we integrate preference-based estimation of the reward function into a model-free reinforcement learning (RL) algorithm, resulting in a model-free PBRL algorithm. Our new algorithm is based on Relative Entropy Policy Search (REPS), enabling us to utilize stochastic policies and to directly control the greediness of the policy update. REPS decreases exploration of the policy slowly by limiting the relative entropy of the policy update, which ensures that the algorithm is provided with a versatile set of trajectories, and consequently with informative preferences. The preference-based estimation is computed using a sample-based Bayesian method, which can also estimate the uncertainty of the utility. Additionally, we also compare to a linear solvable approximation, based on inverse RL. We show that both approaches perform favourably to the current state-of-the-art. The overall result is an algorithm that can learn non-parametric continuous action policies from a small number of preferences.
The word 'ethics' is under siege in technology policy circles. Weaponized in support of deregulation, self-regulation or handsoff governance, "ethics" is increasingly identified with technology companies' self-regulatory efforts and with 
 The word 'ethics' is under siege in technology policy circles. Weaponized in support of deregulation, self-regulation or handsoff governance, "ethics" is increasingly identified with technology companies' self-regulatory efforts and with shallow appearances of ethical behavior. So-called "ethics washing" by tech companies is on the rise, prompting criticism and scrutiny from scholars and the tech community at large. In parallel to the growth of ethics washing, its condemnation has led to a tendency to engage in "ethics bashing." This consists in the trivialization of ethics and moral philosophy now understood as discrete tools or pre-formed social structures such as ethics boards, self-governance schemes or stakeholder groups.
Datasets that power machine learning are often used, shared, and reused with little visibility into the processes of deliberation that led to their creation. As artificial intelligence systems are increasingly 
 Datasets that power machine learning are often used, shared, and reused with little visibility into the processes of deliberation that led to their creation. As artificial intelligence systems are increasingly used in high-stakes tasks, system development and deployment practices must be adapted to address the very real consequences of how model development data is constructed and used in practice. This includes greater transparency about data, and accountability for decisions made when developing it. In this paper, we introduce a rigorous framework for dataset development transparency that supports decision-making and accountability. The framework uses the cyclical, infrastructural and engineering nature of dataset development to draw on best practices from the software development lifecycle. Each stage of the data development lifecycle yields documents that facilitate improved communication and decision-making, as well as drawing attention to the value and necessity of careful data work. The proposed framework makes visible the often overlooked work and decisions that go into dataset creation, a critical step in closing the accountability gap in artificial intelligence and a critical/necessary resource aligned with recent work on auditing processes.
Skill estimation mechanisms, colloquially known as rating systems, play an important role in competitive sports and games. They provide a measure of player skill, which incentivizes competitive performances and enables 
 Skill estimation mechanisms, colloquially known as rating systems, play an important role in competitive sports and games. They provide a measure of player skill, which incentivizes competitive performances and enables balanced match-ups. In this paper, we present a novel Bayesian rating system for contests with many participants. It is widely applicable to competition formats with discrete ranked matches, such as online programming competitions, obstacle courses races, and video games. The system's simplicity allows us to prove theoretical bounds on its robustness and runtime. In addition, we show that it is incentive-compatible: a player who seeks to maximize their rating will never want to underperform. Experimentally, the rating system surpasses existing systems in prediction accuracy, and computes faster than existing systems by up to an order of magnitude.
Su Lin Blodgett, Gilsinia Lopez, Alexandra Olteanu, Robert Sim, Hanna Wallach. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on 
 Su Lin Blodgett, Gilsinia Lopez, Alexandra Olteanu, Robert Sim, Hanna Wallach. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021.
Most Artificial Intelligence applications are based on supervised machine learning (ML), which ultimately grounds on manually annotated data. The annotation process is often performed in terms of a majority vote 
 Most Artificial Intelligence applications are based on supervised machine learning (ML), which ultimately grounds on manually annotated data. The annotation process is often performed in terms of a majority vote and this has been proved to be often problematic, as highlighted by recent studies on the evaluation of ML models. In this article we describe and advocate for a different paradigm, which we call data perspectivism, which moves away from traditional gold standard datasets, towards the adoption of methods that integrate the opinions and perspectives of the human subjects involved in the knowledge representation step of ML processes. Drawing on previous works which inspired our proposal we describe the potential of our proposal for not only the more subjective tasks (e.g. those related to human language) but also to tasks commonly understood as objective (e.g. medical decision making), and present the main advantages of adopting a perspectivist stance in ML, as well as possible disadvantages, and various ways in which such a stance can be implemented in practice. Finally, we share a set of recommendations and outline a research agenda to advance the perspectivist stance in ML.
Abstract Majority voting and averaging are common approaches used to resolve annotator disagreements and derive single ground truth labels from multiple annotations. However, annotators may systematically disagree with one another, 
 Abstract Majority voting and averaging are common approaches used to resolve annotator disagreements and derive single ground truth labels from multiple annotations. However, annotators may systematically disagree with one another, often reflecting their individual biases and values, especially in the case of subjective tasks such as detecting affect, aggression, and hate speech. Annotator disagreements may capture important nuances in such tasks that are often ignored while aggregating annotations to a single ground truth. In order to address this, we investigate the efficacy of multi-annotator models. In particular, our multi-task based approach treats predicting each annotators’ judgements as separate subtasks, while sharing a common learned representation of the task. We show that this approach yields same or better performance than aggregating labels in the data prior to training across seven different binary classification tasks. Our approach also provides a way to estimate uncertainty in predictions, which we demonstrate better correlate with annotation disagreements than traditional methods. Being able to model uncertainty is especially useful in deployment scenarios where knowing when not to make a prediction is important.
Maarten Sap, Swabha Swayamdipta, Laura Vianna, Xuhui Zhou, Yejin Choi, Noah Smith. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language 
 Maarten Sap, Swabha Swayamdipta, Laura Vianna, Xuhui Zhou, Yejin Choi, Noah Smith. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022.
A common practice in building NLP datasets, especially using crowd-sourced annotations, involves obtaining multiple annotator judgements on the same data instances, which are then flattened to produce a single “ground 
 A common practice in building NLP datasets, especially using crowd-sourced annotations, involves obtaining multiple annotator judgements on the same data instances, which are then flattened to produce a single “ground truth” label or score, through majority voting, averaging, or adjudication. While these approaches may be appropriate in certain annotation tasks, such aggregations overlook the socially constructed nature of human perceptions that annotations for relatively more subjective tasks are meant to capture. In particular, systematic disagreements between annotators owing to their socio-cultural backgrounds and/or lived experiences are often obfuscated through such aggregations. In this paper, we empirically demonstrate that label aggregation may introduce representational biases of individual and group perspectives. Based on this finding, we propose a set of recommendations for increased utility and transparency of datasets for downstream use cases.
Celebrated for their conceptual clarity, titles in the Clarendon Law Series offer concise, accessible overviews of major fields of law and legal thought. The Concept of Law is an important 
 Celebrated for their conceptual clarity, titles in the Clarendon Law Series offer concise, accessible overviews of major fields of law and legal thought. The Concept of Law is an important work of legal philosophy. It was first published fifty years ago. This book includes a new introduction that sets the book in the context of subsequent developments in social and political philosophy, clarifying misunderstandings of Hart's project and highlighting central tensions and problems in the work. Topics covered include: sovereign and subject, the law as the unions of primary and secondary rules, formalism, rule-scepticism, justice, morality, and international law.
Whether examining election outcomes, the legal status of terrorism suspects, or if (or how) people can be sentenced to death, a judge in a modern democracy assumes a role that 
 Whether examining election outcomes, the legal status of terrorism suspects, or if (or how) people can be sentenced to death, a judge in a modern democracy assumes a role that raises some of the most contentious political issues of our day. But do judges even have a role beyond deciding the disputes before them under law? What are the criteria for judging the justices who write opinions for the United States Supreme Court or constitutional courts in other democracies? These are the questions that one of the world's foremost judges and legal theorists, Aharon Barak, poses in this book. In fluent prose, Barak sets forth a powerful vision of the role of the judge. He argues that this role comprises two central elements beyond dispute resolution: bridging the gap between the law and society, and protecting the constitution and democracy. The former involves balancing the need to adapt the law to social change against the need for stability; the latter, judges' ultimate accountability, not to public opinion or to politicians, but to the "internal morality" of democracy. Barak's vigorous support of "purposive interpretation" (interpreting legal texts--for example, statutes and constitutions--in light of their purpose) contrasts sharply with the influential "originalism" advocated by U.S. Supreme Court Justice Antonin Scalia. As he explores these questions, Barak also traces how supreme courts in major democracies have evolved since World War II, and he guides us through many of his own decisions to show how he has tried to put these principles into action, even under the burden of judging on terrorism.
Abstract This revised edition of one of the classic works of modern legal philosophy, first published in 1979, represents the author's contribution which has had an enduring influence on philosophical 
 Abstract This revised edition of one of the classic works of modern legal philosophy, first published in 1979, represents the author's contribution which has had an enduring influence on philosophical work on the nature of law and its relation to morality. The new edition includes two previously uncollected essays and a new introduction from the author.
Kangaroo Courts and the Rule of Law -The Legacy of Modernism addresses the legacy of contemporary critiques of language for the concept of the rule of law. Between those who 
 Kangaroo Courts and the Rule of Law -The Legacy of Modernism addresses the legacy of contemporary critiques of language for the concept of the rule of law. Between those who care about the rule of law and those who are interested in contemporary legal theory, there has been a dialogue of the deaf, which cannot continue. Starting from the position that contemporary critiques of linguistic meaning and legal certainty are too important to be dismissed, Desmond Manderson takes up the political and intellectual challenge they pose. Can the rule of law be re-configured in light of the critical turn of the past several years in legal theory, rather than being steadfastly opposed to it? Pursuing a reflection upon the relationship between law and the humanities, the book stages an encounter between the influential theoretical work of Jacques Derrida and MIkhail Bakhtin, and D.H. Lawrence's strange and misunderstood novel Kangaroo (1923). At a critical juncture in our intellectual history - the modernist movement at the end of the first world war - and struggling with the same problems we are puzzling over today, Lawrence articulated complex ideas about the nature of justice and the nature of literature. Using Lawrence to clarify Derrida's writings on law, as well as using Derrida and Bakhtin to clarify Lawrence's experience of literature, Manderson makes a robust case for 'law and literature.' With this framework in mind he outlines a 'post-positivist' conception of the rule of law - in which justice is imperfectly possible, rather than perfectly impossible.
Abstract This book compares how and why the European Court of Justice, the French Cour de cassation, and the United States Supreme Court offer different approaches for generating judicial accountability 
 Abstract This book compares how and why the European Court of Justice, the French Cour de cassation, and the United States Supreme Court offer different approaches for generating judicial accountability and control, judicial debate and deliberation, and ultimately judicial legitimacy. Examining the judicial argumentation of the U.S. Supreme Court and the French Cour de cassation, the book first reorders the traditional comparative understanding of the difference between French civil law and American common law judicial decision-making. It then uses this analysis to offer the first detailed comparative examination of the interpretive practice of the European Court of Justice (ECJ). The book shows that the judicial system of France rests on a particularly unified institutional and ideological framework founded on explicitly republican notions of meritocracy and managerial expertise. Law-making per se may be limited to the legislature, but significant judicial normative administration is entrusted to state selected, trained, and sanctioned elites who are policed internally through hierarchical institutional structures. The American judicial system, by contrast, employs a more participatory and democratic approach that reflects a more populist vision and generates its legitimacy primarily by argumentative means. American judges engage in extensive debates that subject them to public scrutiny and control. The ECJ hovers delicately between the institutional/argumentative and republican/democratic extremes. On the one hand, the ECJ reproduces the hierarchical French discursive structure on which it was originally patterned. On the other, it transposes this structure into a transnational context of fractured political and legal assumptions.
Human annotated data plays a crucial role in machine learning (ML) research and development. However, the ethical considerations around the processes and decisions that go into dataset annotation have not 
 Human annotated data plays a crucial role in machine learning (ML) research and development. However, the ethical considerations around the processes and decisions that go into dataset annotation have not received nearly enough attention. In this paper, we survey an array of literature that provides insights into ethical considerations around crowdsourced dataset annotation. We synthesize these insights, and lay out the challenges in this space along two layers: (1) who the annotator is, and how the annotators' lived experiences can impact their annotations, and (2) the relationship between the annotators and the crowdsourcing platforms, and what that relationship affords them. Finally, we introduce a novel framework, CrowdWorkSheets, for dataset developers to facilitate transparent documentation of key decisions points at various stages of the data annotation pipeline: task formulation, selection of annotators, platform and infrastructure choices, dataset analysis and evaluation, and dataset release and maintenance.
Many machine learning projects for new application areas involve teams of humans who label data for a particular purpose, from hiring crowdworkers to the paper's authors labeling the data themselves. 
 Many machine learning projects for new application areas involve teams of humans who label data for a particular purpose, from hiring crowdworkers to the paper's authors labeling the data themselves. Such a task is quite similar to (or a form of) structured content analysis, which is a longstanding methodology in the social sciences and humanities, with many established best practices. In this paper, we investigate to what extent a sample of machine learning application papers in social computing --- specifically papers from ArXiv and traditional publications performing an ML classification task on Twitter data --- give specific details about whether such best practices were followed. Our team conducted multiple rounds of structured content analysis of each paper, making determinations such as: Does the paper report who the labelers were, what their qualifications were, whether they independently labeled the same items, whether inter-rater reliability metrics were disclosed, what level of training and/or instructions were given to labelers, whether compensation for crowdworkers is disclosed, and if the training data is publicly available. We find a wide divergence in whether such practices were followed and documented. Much of machine learning research and education focuses on what is done once a "gold standard" of training data is available, but we discuss issues around the equally-important aspect of whether such data is reliable in the first place.
Abstract In the past decade, crowdworking on online labor market platforms has become an important source of income for a growing number of people worldwide. This development has led to 
 Abstract In the past decade, crowdworking on online labor market platforms has become an important source of income for a growing number of people worldwide. This development has led to increasing political and scholarly interest in the wages people can earn on such platforms. This study extends the literature, which is often based on a single platform, region, or category of crowdworking, through a meta-analysis of prevalent hourly wages. After a systematic literature search, the paper considers 22 primary empirical studies, including 105 wages and 76,765 data points from 22 platforms, eight different countries, and 10 years. It is found that, on average, microtasks results in an hourly wage of less than $6. This wage is significantly lower than the mean wage of online freelancers, which is roughly three times higher when not factoring in unpaid work. Hourly wages accounting for unpaid work, such as searching for tasks and communicating with requesters, tend to be significantly lower than wages not considering unpaid work. Legislators and researchers evaluating wages in crowdworking need to be aware of this bias when assessing hourly wages, given that the majority of literature does not account for the effect of unpaid work time on crowdworking wages. To foster the comparability of different research results, the article suggests that scholars consider a wage correction factor to account for unpaid work. Finally, researchers should be aware that remuneration and work processes on crowdworking platforms can systematically affect the data collection method and inclusion of unpaid work.
We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, 
 We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based model pre-trained to predict the next token in a document. The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4's performance based on models trained with no more than 1/1,000th the compute of GPT-4.
The field of machine learning (ML) has long struggled with a principles-to-practice gap, whereby careful codes and commitments dissipate on their way to practical application. The present work bridges this 
 The field of machine learning (ML) has long struggled with a principles-to-practice gap, whereby careful codes and commitments dissipate on their way to practical application. The present work bridges this gap through an applied affordance framework. 'Affordances' are how the features of a technology shape, but do not determine, the functions and effects of that technology. Here, I demonstrate the value of an affordance framework as applied to ML, considering ML systems through the prism of design studies. Specifically, I apply the mechanisms and conditions framework of affordances, which models the way technologies request, demand, encourage, discourage, refuse, and allow technical and social outcomes. Illustrated through three case examples across work, policing, and housing justice, the mechanisms and conditions framework reveals the social nature of technical choices, clarifying how and for whom those choices manifest. This approach displaces vagaries and general claims with the particularities of systems in context, empowering critically minded practitioners while holding power—and the systems power relations produce—to account. More broadly, this work pairs the design studies tradition with the ML domain, setting a foundation for deliberate and considered (re)making of sociotechnical futures.
With artificial intelligence systems increasingly applied in consequential domains, researchers have begun to ask how AI systems ought to act in ethically charged situations where even humans lack consensus. In 
 With artificial intelligence systems increasingly applied in consequential domains, researchers have begun to ask how AI systems ought to act in ethically charged situations where even humans lack consensus. In the Moral Machine project, researchers crowdsourced answers to "Trolley Problems" concerning autonomous vehicles. Subsequently, Noothigattu et al. (2018) proposed inferring linear functions that approximate each individual's preferences and aggregating these linear models by averaging parameters across the population. In this paper, we examine this averaging mechanism, focusing on fairness concerns and strategic effects. We investigate a simple setting where the population consists of two groups, the minority constitutes an α < 0.5 share of the population, and within-group preferences are homogeneous. Focusing on the fraction of contested cases where the minority group prevails, we make the following observations: (a) even when all parties report their preferences truthfully, the fraction of disputes where the minority prevails is less than proportionate in α; (b) the degree of sub-proportionality grows more severe as the level of disagreement between the groups increases; (c) when parties report preferences strategically, pure strategy equilibria do not always exist; and (d) whenever a pure strategy equilibrium exists, the majority group prevails 100% of the time. These findings raise concerns about stability and fairness of averaging as a mechanism for aggregating diverging voices. Finally, we discuss alternatives, including randomized dictatorship and median-based mechanisms.
Many NLP applications require manual text annotations for a variety of tasks, notably to train classifiers or evaluate the performance of unsupervised models. Depending on the size and degree of 
 Many NLP applications require manual text annotations for a variety of tasks, notably to train classifiers or evaluate the performance of unsupervised models. Depending on the size and degree of complexity, the tasks may be conducted by crowd workers on platforms such as MTurk as well as trained annotators, such as research assistants. Using four samples of tweets and news articles ( n = 6,183), we show that ChatGPT outperforms crowd workers for several annotation tasks, including relevance, stance, topics, and frame detection. Across the four datasets, the zero-shot accuracy of ChatGPT exceeds that of crowd workers by about 25 percentage points on average, while ChatGPT’s intercoder agreement exceeds that of both crowd workers and trained annotators for all tasks. Moreover, the per-annotation cost of ChatGPT is less than $0.003—about thirty times cheaper than MTurk. These results demonstrate the potential of large language models to drastically increase the efficiency of text classification.
Human variation in labeling is often considered noise. Annotation projects for machine learning (ML) aim at minimizing human label variation, with the assumption to maximize data quality and in turn 
 Human variation in labeling is often considered noise. Annotation projects for machine learning (ML) aim at minimizing human label variation, with the assumption to maximize data quality and in turn optimize and maximize machine learning metrics. However, thisconventional practice assumes that there exists a *ground truth*, and neglects that there exists genuine human variation in labeling due to disagreement, subjectivity in annotation or multiple plausible answers.In this position paper, we argue that this big open problem of human label variation persists and critically needs more attention to move our field forward. This is because human label variation impacts all stages of the ML pipeline: *data, modeling and evaluation*. However, few works consider all of these dimensions jointly; and existing research is fragmented. We reconcile different previously proposed notions of human label variation, provide a repository of publicly-available datasets with un-aggregated labels, depict approaches proposed so far, identify gaps and suggest ways forward. As datasets are becoming increasingly available, we hope that this synthesized view on the "problem" will lead to an open discussion on possible strategies to devise fundamentally new directions.
Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, Geoffrey Irving. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 
 Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, Geoffrey Irving. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022.
Annotators' disagreement in linguistic data has been recently the focus of multiple initiatives aimed at raising awareness on issues related to 'majority voting' when aggregating diverging annotations. Disagreement can indeed 
 Annotators' disagreement in linguistic data has been recently the focus of multiple initiatives aimed at raising awareness on issues related to 'majority voting' when aggregating diverging annotations. Disagreement can indeed reflect different aspects of linguistic annotation, from annotators' subjectivity to sloppiness or lack of enough context to interpret a text. In this work we first propose a taxonomy of possible reasons leading to annotators' disagreement in subjective tasks. Then, we manually label part of a Twitter dataset for offensive language detection in English following this taxonomy, identifying how the different categories are distributed. Finally we run a set of experiments aimed at assessing the impact of the different types of disagreement on classification performance. In particular, we investigate how accurately tweets belonging to different categories of disagreement can be classified as offensive or not, and how injecting data with different types of disagreement in the training set affects performance. We also perform offensive language detection as a multi-task framework, using disagreement classification as an auxiliary task.
In recent years, progress in NLU has been driven by benchmarks. These benchmarks are typically collected by crowdsourcing, where annotators write examples based on annotation instructions crafted by dataset creators. 
 In recent years, progress in NLU has been driven by benchmarks. These benchmarks are typically collected by crowdsourcing, where annotators write examples based on annotation instructions crafted by dataset creators. In this work, we hypothesize that annotators pick up on patterns in the crowdsourcing instructions, which bias them to write many similar examples that are then over-represented in the collected data. We study this form of bias, termed instruction bias, in 14 recent NLU benchmarks, showing that instruction examples often exhibit concrete patterns, which are propagated by crowdworkers to the collected data. This extends previous work (Geva et al., 2019) and raises a new concern of whether we are modeling the dataset creator's instructions, rather than the task. Through a series of experiments, we show that, indeed, instruction bias can lead to overestimation of model performance, and that models struggle to generalize beyond biases originating in the crowdsourcing instructions. We further analyze the influence of instruction bias in terms of pattern frequency and model size, and derive concrete recommendations for creating future NLU benchmarks.