DeepSeek Janus-Pro

DeepSeek Janus-Pro. Another impressive model release from DeepSeek. Janus is their series of "unified multimodal understanding and generation models" - these are models that can both accept images as input and generate images for output.

Janus-Pro is the new 7B model, which DeepSeek describe as "an advanced version of Janus, improving both multimodal understanding and visual generation significantly". It's released under the not fully open source DeepSeek license.

Janus-Pro is accompanied by this paper, which includes this note about the training:

Our Janus is trained and evaluated using HAI-LLM, which is a lightweight and efficient distributed training framework built on top of PyTorch. The whole training process took about 7/14 days on a cluster of 16/32 nodes for 1.5B/7B model, each equipped with 8 Nvidia A100 (40GB) GPUs.

It includes a lot of high benchmark scores, but closes with some notes on the model's current limitations:

In terms of multimodal understanding, the input resolution is limited to 384 × 384, which affects its performance in fine-grained tasks such as OCR. For text-to-image generation, the low resolution, combined with reconstruction losses introduced by the vision tokenizer, results in images that, while rich in semantic content, still lack fine details. For example, small facial regions occupying limited image space may appear under-detailed. Increasing the image resolution could mitigate these issues.

The easiest way to try this one out is using the Hugging Face Spaces demo. I tried the following prompt for the image generation capability:

A photo of a raccoon holding a handwritten sign that says "I love trash"

And got back this image:

I cute raccoon is holding a sign which says the word I, then a red heart, then a garbled word like tratieh, then a less garbled wrd trash

It's now also been ported to Transformers.js, which means you can run the 1B model directly in a WebGPU browser such as Chrome here at webml-community/janus-pro-webgpu (loads about 2.24 GB of model files).

Posted 27th January 2025 at 9:49 pm

Simon Willison’s Weblog

Recent articles

Monthly briefing