Feed a video to a vision LLM as a sequence of JPEG frames on the CLI (also LLM 0.25)
5th May 2025
The new llm-video-frames plugin can turn a video file into a sequence of JPEG frames and feed them directly into a long context vision LLM such as GPT-4.1, even when that LLM doesn’t directly support video input. It depends on a plugin feature I added to LLM 0.25, which I released last night.
Here’s how to try it out:
brew install ffmpeg # or apt-get or your package manager of choice
uv tool install llm # or pipx install llm or pip install llm
llm install llm-video-frames
llm keys set openai
# Paste your OpenAI API key here
llm -f video-frames:video.mp4 \
'describe the key scenes in this video' \
-m gpt-4.1-mini
The video-frames:filepath.mp4
syntax is provided by the new plugin. It takes that video, converts it to a JPEG for every second of the video and then turns those into attachments that can be passed to the LLM.
Here’s a demo, against this video of Cleo:
llm -f video-frames:cleo.mp4 'describe key moments' -m gpt-4.1-mini
And the output from the model (transcript here):
The sequence of images captures the key moments of a dog being offered and then enjoying a small treat:
- In the first image, a hand is holding a small cupcake with purple frosting close to a black dog’s face. The dog looks eager and is focused intently on the treat.
- The second image shows the dog beginning to take a bite of the cupcake from the person’s fingers. The dog’s mouth is open, gently nibbling on the treat.
- In the third image, the dog has finished or is almost done with the treat and looks content, with a slight smile and a relaxed posture. The treat is no longer in the person’s hand, indicating that the dog has consumed it.
This progression highlights the dog’s anticipation, enjoyment, and satisfaction with the treat.
Total cost: 7,072 input tokens, 156 output tokens—for GPT-4.1 mini that’s 0.3078 cents (less than a third of a cent).
In this case the plugin converted the video into three images: frame_00001.jpg, frame_00002.jpg and frame_00003.jpg.
The plugin accepts additional arguments. You can increase the frames-per-second using ?fps=2
—for example:
llm -f 'video-frames:video.mp4?fps=2' 'summarize this video'
Or you can add ?timestamps=1
to cause ffmpeg
to overlay a timestamp in the bottom right corner of each frame. This gives the model a chance to return timestamps in its output.
Let’s try that with the Cleo video:
llm -f 'video-frames:cleo.mp4?timestamps=1&fps=5' \
'key moments, include timestamps' -m gpt-4.1-mini
Here’s the output (transcript here):
Here are the key moments from the video “cleo.mp4” with timestamps:
- 00:00:00.000—A dog on leash looks at a small cupcake with purple frosting being held by a person.
- 00:00:00.800—The dog closely sniffs the cupcake.
- 00:00:01.400—The person puts a small amount of the cupcake frosting on their finger.
- 00:00:01.600—The dog starts licking the frosting from the person’s finger.
- 00:00:02.600—The dog continues licking enthusiastically.
Let me know if you need more details or a description of any specific part.
That one sent 14 images to the API, at a total cost of 32,968 input, 141 output = 1.3413 cents.
It sent 5.9MB of image data to OpenAI’s API, encoded as base64 in the JSON API call.
The GPT-4.1 model family accepts up to 1,047,576 tokens. Aside from a 20MB size limit per image I haven’t seen any documentation of limits on the number of images. You can fit a whole lot of JPEG frames in a million tokens!
Here’s what one of those frames looks like with the timestamp overlaid in the corner:
How I built the plugin with o4-mini
This is a great example of how rapid prototyping with an LLM can help demonstrate the value of a feature.
I was considering whether it would make sense for fragment plugins to return images in issue 972 when I had the idea to use ffmpeg
to split a video into frames.
I know from past experience that a good model can write an entire plugin for LLM if you feed it the right example, so I started with this (reformatted here for readability):
llm -m o4-mini -f github:simonw/llm-hacker-news -s 'write a new plugin called llm_video_frames.py which takes video:path-to-video.mp4 and creates a temporary directory which it then populates with one frame per second of that video using ffmpeg - then it returns a list of [llm.Attachment(path="path-to-frame1.jpg"), ...] - it should also support passing video:video.mp4?fps=2 to increase to two frames per second, and if you pass ?timestamps=1 or ×tamps=1 then it should add a text timestamp to the bottom right conner of each image with the mm:ss timestamp of that frame (or hh:mm:ss if more than one hour in) and the filename of the video without the path as well.' -o reasoning_effort high
Here’s the transcript.
The new attachment mechanism went from vague idea to “I should build that” as a direct result of having an LLM-built proof-of-concept that demonstrated the feasibility of the new feature.
The code it produced was about 90% of the code I shipped in the finished plugin. Total cost 5,018 input, 2,208 output = 1.5235 cents.
Annotated release notes for everything else in LLM 0.25
Here are the annotated release notes for everything else:
- New plugin feature: register_fragment_loaders(register) plugins can now return a mixture of fragments and attachments. The llm-video-frames plugin is the first to take advantage of this mechanism. #972
As decsribed above. The inspiration for this feature came from the llm-arxiv plugin by agustif.
My original plan was to leave these models exclusively to the new llm-openai plugin, since that allows me to add support for new models without a full LLM release. I’m going to punt on that until I’m ready to entirely remove the OpenAI models from LLM core.
- New environment variables:
LLM_MODEL
andLLM_EMBEDDING_MODEL
for setting the model to use without needing to specify-m model_id
every time. #932
A convenience feature for if you want to set the default model for a terminal session with LLM without using the global default model" mechanism.
- New command:
llm fragments loaders
, to list all currently available fragment loader prefixes provided by plugins. #941
Mainly for consistence with the existing llm templates loaders command. Here’s the output when I run llm fragments loaders
on my machine:
docs:
Fetch the latest documentation for the specified package from
https://github.com/simonw/docs-for-llms
Use '-f docs:' for the documentation of your current version of LLM.
docs-preview:
Similar to docs: but fetches the latest docs including alpha/beta releases.
symbex:
Walk the given directory, parse every .py file, and for every
top-level function or class-method produce its signature and
docstring plus an import line.
github:
Load files from a GitHub repository as fragments
Argument is a GitHub repository URL or username/repository
issue:
Fetch GitHub issue/pull and comments as Markdown
Argument is either "owner/repo/NUMBER" or URL to an issue
pr:
Fetch GitHub pull request with comments and diff as Markdown
Argument is either "owner/repo/NUMBER" or URL to a pull request
hn:
Given a Hacker News article ID returns the full nested conversation.
For example: -f hn:43875136
video-frames:
Fragment loader "video-frames:<path>?fps=N×tamps=1"
- extracts frames at `fps` per second (default 1)
- if `timestamps=1`, overlays "filename hh:mm:ss" at bottom-right
That’s from llm-docs, llm-fragments-symbex, llm-fragments-github, llm-hacker-news and llm-video-frames.
llm fragments
command now shows fragments ordered by the date they were first used. #973
This makes it easier to quickly debug a new fragment plugin—you can run llm fragments
and glance at the bottom few entries.
I’ve also been using the new llm-echo debugging plugin for this—it adds a new fake model called “echo” which simply outputs whatever the prompt, system prompt, fragments and attachments are that were passed to the model:
llm -f docs:sqlite-utils -m echo 'Show me the context'
llm chat
now includes a!edit
command for editing a prompt using your default terminal text editor. Thanks, Benedikt Willi. #969
This is a really nice enhancement to llm chat
, making it much more convenient to edit longe prompts.
And the rest:
- Allow
-t
and--system
to be used at the same time. #916- Fixed a bug where accessing a model via its alias would fail to respect any default options set for that model. #968
- Improved documentation for extra-openai-models.yaml. Thanks, Rahim Nathwani and Dan Guido. #950, #957
llm -c/--continue
now works correctly with the-d/--database
option.llm chat
now accepts that-d/--database
option. Thanks, Sukhbinder Singh. #933
More recent articles
- Two publishers and three authors fail to understand what "vibe coding" means - 1st May 2025
- Understanding the recent criticism of the Chatbot Arena - 30th April 2025