Alignment Pretraining

AI Discourse Causes Self-Fulfilling (Mis)alignment

Cameron Tice1* Puria Radmard1,2* Samuel Ratnam3 Andy Kim4 David Africa2 Kyle O'Brien1
1Geodesic Research    2University of Cambridge    3University of Oxford    4Independent

TL;DR
LLMs pretrained on data about misaligned AIs themselves become less aligned. Luckily, pretraining LLMs with synthetic data about good AIs helps them become more aligned. These alignment priors persist through post-training, providing alignment-in-depth. Alignment pretraining only requires modifications to training data mixes. General performance is largely unaffected. We recommend labs pretrain for alignment, just as they do for capabilities.

An overview of our pretraining interventions. Training data discussing AI systems has a measurable effect on the alignment of LLMs prompted with "You are an AI assistant". Upsampling positive data related to AI systems during pretraining results in an increase in rates of alignment that persist even after production post-training on over four million examples. Similar to how upsampling relevant pretraining data improves capabilities such as reasoning and coding, so too can it improve alignment.

Graph showing the self-fulfilling effects of alignment and misalignment discourse in pretraining data