Aligning to Illusions: Choice Blindness in Human and AI Feedback

Wenbin Wu

arXiv 🤖 Machine Learning

Aligning to Illusions: Choice Blindness in Human and AI Feedback

Wenbin Wu

March 9, 2026

arXiv preprint

8 min read

Abstract

Reinforcement Learning from Human Feedback (RLHF) assumes annotator preferences reflect stable internal states. We challenge this through three experiments spanning the preference pipeline. In a human choice blindness study, 91% of surreptitiously swapped preferences go undetected, extending choice blindness to third-person evaluative comparison of unfamiliar text. Testing fifteen LLM judges as potential replacements, we find detection relies on shallow text matching rather than genuine self-monitoring: removing prior reasoning from context causes blindness to surge from near-zero to over 50%, while explicit social pressure induces near-universal compliance. In a dose-response experiment across two architectures from 86M to 2B parameters, one-sixth to one-third of labels must be corrupted before the reward signal halves, yet standard pairwise accuracy remains virtually unchanged. A Best-of-N evaluation confirms this translates to downstream policy degradation: at 50% corruption, reward-guided selection produces no improvement over random sampling, while the proxy model reports monotonically increasing scores. Together, these results reveal a preference construction problem: the signal entering RLHF is shaped by elicitation context in ways that neither human metacognition, LLM self-monitoring, nor standard evaluation metrics can detect.

Keywords

#metacognition#reinforcement learning#LLM#monitoring#evaluation

View on arXiv

Abstract

Keywords

Related Research

Reinforcement Learning: A Survey

Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting

Deep Reinforcement Learning with Double Q-Learning