Alignment Pretraining

AI Discourse Causes Self-Fulfilling (Mis)alignment

Cameron Tice1* Puria Radmard1,2* Samuel Ratnam3 Andy Kim4 David Africa2 Kyle O'Brien1
1Geodesic Research    2University of Cambridge    3University of Oxford    4Independent
Graph showing the self-fulfilling effects of alignment and misalignment discourse in pretraining data

An overview of our pretraining interventions. Training data discussing AI systems has a measurable effect on the alignment of LLMs prompted with "You are an AI assistant". Upsampling positive data related to AI systems during midtraining results in an increase in rates of alignment that persist even after production post-training on over four million examples. Similar to how upsampling relevant pretraining data improves capabilities such as reasoning and coding, so too can alignment.