63 posts tagged “vision-llms”
LLMs that can also be used to interpret images and video, such as GPT-4o, Claude 3 and Gemini Pro.
2024
Our next-generation model: Gemini 1.5 (via) The big news here is about context length: Gemini 1.5 (a Mixture-of-Experts model) will do 128,000 tokens in general release, available in limited preview with a 1 million token context and has shown promising research results with 10 million tokens!
1 million tokens is 700,000 words or around 7 novels—also described in the blog post as an hour of video or 11 hours of audio.
2023
Ice Cubes GPT-4 prompts. The Ice Cubes open source Mastodon app recently grew a very good "describe this image" feature to help people add alt text to their images. I had a dig around in their repo and it turns out they're using GPT-4 Vision for this (and regular GPT-4 for other features), passing the image with this prompt:
What’s in this image? Be brief, it's for image alt description on a social network. Don't write in the first person.
Multi-modal prompt injection image attacks against GPT-4V
GPT4-V is the new mode of GPT-4 that allows you to upload images as part of your conversations. It’s absolutely brilliant. It also provides a whole new set of vectors for prompt injection attacks.
[... 889 words]