AI Discourse Causes Self-Fulfilling (Mis)alignment
An overview of our pretraining interventions. Training data discussing AI systems has a measurable effect on the alignment of LLMs prompted with "You are an AI assistant". Upsampling positive data related to AI systems during midtraining results in an increase in rates of alignment that persist even after production post-training on over four million examples. Similar to how upsampling relevant pretraining data improves capabilities such as reasoning and coding, so too can alignment.