Simon Willison’s Weblog


gpt-4-turbo over the API produces (statistically significant) shorter completions when it "thinks" its December vs. when it thinks its May (as determined by the date in the system prompt).

I took the same exact prompt over the API (a code completion task asking to implement a machine learning task without libraries).

I created two system prompts, one that told the API it was May and another that it was December and then compared the distributions.

For the May system prompt, mean = 4298 For the December system prompt, mean = 4086

N = 477 completions in each sample from May and December

t-test p < 2.28e-07

Rob Lynch