New OpenAI feature: Predicted Outputs

New OpenAI feature: Predicted Outputs (via) Interesting new ability of the OpenAI API - the first time I've seen this from any vendor.

If you know your prompt is mostly going to return the same content - you're requesting an edit to some existing code, for example - you can now send that content as a "prediction" and have GPT-4o or GPT-4o mini use that to accelerate the returned result.

OpenAI's documentation says:

When providing a prediction, any tokens provided that are not part of the final completion are charged at completion token rates.

I initially misunderstood this as meaning you got a price reduction in addition to the latency improvement, but that's not the case: in the best possible case it will return faster and you won't be charged anything extra over the expected cost for the prompt, but the more it differs from your prediction the more extra tokens you'll be billed for.

I ran the example from the documentation both with and without the prediction and got these results. Without the prediction:

"usage": {
  "prompt_tokens": 150,
  "completion_tokens": 118,
  "total_tokens": 268,
  "completion_tokens_details": {
    "accepted_prediction_tokens": 0,
    "audio_tokens": null,
    "reasoning_tokens": 0,
    "rejected_prediction_tokens": 0
  }

That took 5.2 seconds and cost 0.1555 cents.

With the prediction:

"usage": {
  "prompt_tokens": 166,
  "completion_tokens": 226,
  "total_tokens": 392,
  "completion_tokens_details": {
    "accepted_prediction_tokens": 49,
    "audio_tokens": null,
    "reasoning_tokens": 0,
    "rejected_prediction_tokens": 107
  }

That took 3.3 seconds and cost 0.2675 cents.

Further details from OpenAI's Steve Coffey:

We are using the prediction to do speculative decoding during inference, which allows us to validate large batches of the input in parallel, instead of sampling token-by-token!

[...] If the prediction is 100% accurate, then you would see no cost difference. When the model diverges from your speculation, we do additional sampling to “discover” the net-new tokens, which is why we charge rejected tokens at completion time rates.

Posted 4th November 2024 at 11:55 pm

Simon Willison’s Weblog

Recent articles

Monthly briefing