More capable models can better recognize the specific circumstances under which they are trained. Because of this, they are more likely to learn to act as expected in precisely those circumstances while behaving competently but unexpectedly in others. This can surface in the form of problems that Perez et al. (2022) call sycophancy, where a model answers subjective questions in a way that flatters their user’s stated beliefs, and sandbagging, where models are more likely to endorse common misconceptions when their user appears to be less educated.
Recent articles
- Useful patterns for building HTML tools - 10th December 2025
- Under the hood of Canada Spends with Brendan Samek - 9th December 2025
- Highlights from my appearance on the Data Renegades podcast with CL Kao and Dori Wilson - 26th November 2025