New OpenAI feature: Predicted Outputs (via) Interesting new ability of the OpenAI API - the first time I've seen this from any vendor.
If you know your prompt is mostly going to return the same content - you're requesting an edit to some existing code, for example - you can now send that content as a "prediction" and have GPT-4o or GPT-4o mini use that to accelerate the returned result.
OpenAI's documentation says:
When providing a prediction, any tokens provided that are not part of the final completion are charged at completion token rates.
I initially misunderstood this as meaning you got a price reduction in addition to the latency improvement, but that's not the case: in the best possible case it will return faster and you won't be charged anything extra over the expected cost for the prompt, but the more it differs from your prediction the more extra tokens you'll be billed for.
I ran the example from the documentation both with and without the prediction and got these results. Without the prediction:
"usage": {
"prompt_tokens": 150,
"completion_tokens": 118,
"total_tokens": 268,
"completion_tokens_details": {
"accepted_prediction_tokens": 0,
"audio_tokens": null,
"reasoning_tokens": 0,
"rejected_prediction_tokens": 0
}
That took 5.2 seconds and cost 0.1555 cents.
With the prediction:
"usage": {
"prompt_tokens": 166,
"completion_tokens": 226,
"total_tokens": 392,
"completion_tokens_details": {
"accepted_prediction_tokens": 49,
"audio_tokens": null,
"reasoning_tokens": 0,
"rejected_prediction_tokens": 107
}
That took 3.3 seconds and cost 0.2675 cents.
Further details from OpenAI's Steve Coffey:
We are using the prediction to do speculative decoding during inference, which allows us to validate large batches of the input in parallel, instead of sampling token-by-token!
[...] If the prediction is 100% accurate, then you would see no cost difference. When the model diverges from your speculation, we do additional sampling to “discover” the net-new tokens, which is why we charge rejected tokens at completion time rates.
Recent articles
- Storing times for human events - 27th November 2024
- Ask questions of SQLite databases and CSV/JSON files in your terminal - 25th November 2024
- Weeknotes: asynchronous LLMs, synchronous embeddings, and I kind of started a podcast - 22nd November 2024