AI Discourse Causes Self-Fulfilling (Mis)alignment

💥 Explore this trending post from Hacker News 📖

📂 **Category**:

📌 **What You’ll Learn**:

[Submitted on 15 Jan 2026 (v1), last revised 19 Feb 2026 (this version, v2)]

View a PDF of the paper titled Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment, by Cameron Tice and 5 other authors

View PDF
HTML (experimental)

Abstract:Pretraining corpora contain extensive discourse about AI systems, yet the causal influence of this discourse on downstream alignment remains poorly understood. If prevailing descriptions of AI behaviour are predominantly negative, LLMs may internalise corresponding behavioural priors, giving rise to self-fulfilling misalignment. This paper provides the first controlled study of this hypothesis by pretraining 6.9B-parameter LLMs with varying amounts of (mis)alignment discourse. We find that discussion of AI contributes to misalignment. Upsampling synthetic training documents about AI misalignment leads to a notable increase in misaligned behaviour. Conversely, upsampling documents about aligned behaviour reduces misalignment scores from 45% to 9%. We consider this evidence of self-fulfilling alignment. These effects are dampened, but persist through post-training. Our findings establish the study of how pretraining data shapes alignment priors, or alignment pretraining, as a complement to post-training. We recommend practitioners consider pretraining for alignment alongside capabilities. We share our models, data, and evaluations at this http URL.

Submission history

From: Kyle O’Brien [view email]
[v1]
Thu, 15 Jan 2026 07:59:31 UTC (1,982 KB)
[v2]
Thu, 19 Feb 2026 22:53:56 UTC (2,369 KB)

💬 **What’s your take?**
Share your thoughts in the comments below!

#️⃣ **#Discourse #SelfFulfilling #Misalignment**

🕒 **Posted on**: 1779143591

🌟 **Want more?** Click here for more info! 🌟

AI Discourse Causes Self-Fulfilling (Mis)alignment

Submission history

By

Leave a Reply Cancel reply