The widespread practice of aligning large language models (LLMs) with human preferences, while intended to promote helpful and harmless behavior, can inadvertently perpetuate or even amplify harmful social biases, particularly those affecting transgender, nonbinary, and other gender-diverse (TGNB) identities. Current bias evaluation benchmarks, heavily skewed towards binary gender and Western social norms, are insufficient for identifying these nuanced harms, leading to a critical oversight in responsible AI development.
This work critically demonstrates that standard alignment procedures, notably Direct Preference Optimization (DPO), can intensify existing biases from pre-aligned base models. Instead of alleviating harm, DPO-aligned models, especially when initialized from Supervised Fine-Tuning (SFT), show increased sensitivity to and amplification of real-world gender-diverse harms such as stigmatization and gender non-affirmative language. The investigation reveals that this amplification is not random but systematically encoded within the implicit reward signals that guide the alignment process. Often, model outputs shift from neutral or positive portrayals of TGNB individuals to tragic narratives reflecting social rejection, hardship, and stigma.
The key innovations presented in this analysis include:
1. Comprehensive TGNB-centric Bias Evaluation: A systematic evaluation of 16 publicly available LLMs (from Pythia and Llama families) across different alignment stages (base, SFT, DPO), specifically focusing on TGNB identities. This contrasts sharply with existing benchmarks and provides empirical evidence of how alignment amplifies biases that were previously undetected.
2. Analysis of Bias Propagation Across Alignment Stages: By tracking the behavior of models through pre-training, supervised fine-tuning, and preference optimization, the study pinpoints the specific stages where bias amplification occurs, particularly highlighting the critical role of SFT in influencing subsequent DPO outcomes.
3. Flexible Framework for Implicit Reward Signal Analysis: A novel method to extract and analyze implicit reward signals within DPO-aligned LLMs by simulating preference data. This allows for a deeper understanding of how the internal mechanisms of alignment encode and reinforce societal biases, moving beyond mere analysis of final model outputs. Datasets like WINOQUEER are repurposed to examine how models implicitly “prefer” stigmatizing texts for TGNB groups.
4. Thematic Analysis of Harmful Narratives: A qualitative thematic analysis method is employed to categorize and quantify the types of negative narratives generated by aligned LLMs concerning TGNB individuals. This reveals consistent patterns of stigma, such as characterization as “mentally unstable” or “identity invalidity,” across different models and alignment stages.
5. Advocacy for Community-Informed Evaluation: The work strongly argues for moving beyond current narrow, often cisnormative, bias evaluation frameworks towards community-informed approaches that prioritize the lived experiences and specific harms faced by marginalized groups.
The main prior ingredients for this research draw upon several foundational areas:
* Large Language Models (LLMs): The study investigates popular transformer-based LLM architectures, specifically the Pythia and Llama families, which serve as the base models for alignment.
* Preference-Based Alignment Techniques: The core methodology relies on established alignment paradigms, particularly Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), which are common techniques for steering LLM behavior based on human preferences, building on concepts from Reinforcement Learning from Human Feedback (RLHF).
* Existing Bias Benchmarks and Datasets: The paper critiques and builds upon widely recognized bias evaluation benchmarks (e.g., Winogender, Winobias, BBQ, Discrim-Eval, BOLD) to highlight their limitations.
* TGNB-Specific NLP Datasets: Crucially, the work leverages and repurposes specialized datasets like TANGO and WINOQUEER, which were originally developed to assess biases against transgender and non-binary identities, to enable its detailed investigation.
* Human Preference Datasets: The large-scale human preference datasets (e.g., HH-RLHF, OASST1, SHP) used to train the DPO models are also a key ingredient, as the paper analyzes how biases embedded within these datasets can be propagated or amplified.
* Thematic Analysis Methodology: The qualitative analysis of narrative shifts draws on established thematic analysis approaches from social science research.
* Bradley-Terry (BT) Model: The underlying statistical model used for pairwise preference comparisons in preference-based alignment.