The main innovation here is just using more data. Specifically, Qwen2.5 Coder is a continuation of an earlier Qwen 2.5 model. The original Qwen 2.5 model was trained on 18 trillion tokens spread across a variety of languages and tasks (e.g, writing, programming, question answering). Qwen 2.5-Coder sees them train this model on an additional 5.5 trillion tokens of data. This means Qwen has been trained on a total of ~23T tokens of data – for perspective, Facebook’s LLaMa3 models were trained on about 15T tokens. I think this means Qwen is the largest publicly disclosed number of tokens dumped into a single language model (so far).
Recent articles
- CaMeL offers a promising new direction for mitigating prompt injection attacks - 11th April 2025
- Model Context Protocol has prompt injection security problems - 9th April 2025
- Long context support in LLM 0.24 using fragments and template plugins - 7th April 2025