PaliGemma model README (via) One of the more over-looked announcements from Google I/O yesterday was PaliGemma, an openly licensed VLM (Vision Language Model) in the Gemma family of models.
The model accepts an image and a text prompt. It outputs text, but that text can include special tokens representing regions on the image. This means it can return both bounding boxes and fuzzier segment outlines of detected objects, behavior that can be triggered using a prompt such as "segment puffins".
You can try it out on Hugging Face.
It's a 3B model, making it feasible to run on consumer hardware.
Recent articles
- W̶e̶e̶k̶n̶o̶t̶e̶s̶ Monthnotes for October - 30th October 2024
- You can now run prompts against images, audio and video in your terminal using LLM - 29th October 2024
- Run a prompt to generate and execute jq programs using llm-jq - 27th October 2024