The paper introduces and formalizes the concept of alignment discretion, defined as the inherent latitude granted to human or algorithmic annotators when interpreting and applying AI alignment principles. This discretion arises because alignment principles often conflict or are indecisive in practice, making a purely rule-based âcorrectâ output impossible.
The significance of this work lies in its critical examination of a previously unexamined, yet pervasive, aspect of AI alignment that contributes to opaque, potentially arbitrary, and uninterpretable model behaviors. By quantifying discretion, the paper highlights a core gap in current feedback-based alignment processes, which risk embedding unscrutinized human value judgments and arbitrary decisions into AI systems. It posits that unaddressed discretion can lead to âalignment-washing,â where the appearance of ethical compliance masks underlying inconsistencies. The paper advocates for greater transparency, accountability, and control over this discretion, drawing strong parallels to the established legal framework for judicial discretion.
The key innovations of this paper include:
1. Formalizing Alignment Discretion: Explicitly defining and framing discretion in AI alignment as a fundamental concept, distinct from mere annotator disagreement, and linking it to principlesâ conflict, consensus, and indifference.
2. Developing Quantitative Metrics: Introducing a suite of novel metrics to systematically analyze discretion:
* Discretion Arbitrariness (DA): Measures how often an annotatorâs preference contradicts a clear consensus among principles, indicating arbitrary judgment.
* Principle Supremacy (PS) & Principle Priority (w*): Quantifies how annotators implicitly prioritize conflicting principles, assigning a numerical weight to each principleâs importance based on how frequently it âwinsâ over others in conflicted cases.
* Discretion Discrepancy (DD): Compares the principle prioritization rankings between different annotators (human vs. algorithmic models) to assess how well algorithmic annotators mimic human discretion.
3. Empirical Analysis on Real-World Datasets: Applying these metrics to widely used safety alignment datasets (Anthropic HH-RLHF and PKU-SafeRLHF). The findings reveal:
* A high frequency of principle conflict or indifference (80-85% of cases), necessitating discretion.
* Significant human annotator arbitrariness (28.9% on HH-RLHF, 15-20% on PKU-SafeRLHF), suggesting inconsistencies in human judgments.
* A notable discrepancy between human and algorithmic annotatorsâ (especially large language models, LLMs) principle priorities, indicating that LLMs may not internalize human values as intended from preference data. Reward models generally align better with human discretion than LLMs.
* Off-the-shelf LLMs show varying degrees of discretion discrepancy with human preferences, highlighting a challenge in transferring human-like discretion to models.
The main prior ingredients upon which this research builds are:
* Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI: These are the prevailing AI alignment paradigms that rely on human (or AI-generated) preferences to train models. The paper directly audits the discretion embedded within these processes.
* Legal Philosophy of Judicial Discretion: A foundational theoretical inspiration. Concepts from legal scholars like Hart, Dworkin, Raz, and Barak regarding âarbitrium judicisâ (judicial discretion), the balance of consistency and flexibility, and the structuring of authority provide the conceptual framework for understanding and measuring discretion in AI.
* Preference Modeling and Ranking Systems: The technical formalization of preferences draws on established methods like the Bradley-Terry-Luce (BTL) model for pairwise comparisons. The derivation of principle priorities is inspired by ranking systems such as ELO scores from chess.
* Annotator Disagreement Research: While acknowledging existing work on measuring annotator agreement (e.g., Kendall tau distance, Cohenâs Kappa), this paper extends it by focusing on the underlying reasons for disagreement in alignmentânamely, principle prioritization and conflict resolution.
* LLMs as Evaluators/Oracles: The common practice of using powerful LLMs (like GPT-4o in this paper) as âoraclesâ to assess principle-specific preferences is utilized as a technical ingredient, even while the paper simultaneously audits the discretion of LLMs themselves.