The main innovation here is just using more data. Specifically, Qwen2.5 Coder is a continuation of an earlier Qwen 2.5 model. The original Qwen 2.5 model was trained on 18 trillion tokens spread across a variety of languages and tasks (e.g, writing, programming, question answering). Qwen 2.5-Coder sees them train this model on an additional 5.5 trillion tokens of data. This means Qwen has been trained on a total of ~23T tokens of data – for perspective, Facebook’s LLaMa3 models were trained on about 15T tokens. I think this means Qwen is the largest publicly disclosed number of tokens dumped into a single language model (so far).
Recent articles
- Live blog: the 12th day of OpenAI - "Early evals for OpenAI o3" - 20th December 2024
- December in LLMs has been a lot - 20th December 2024
- Gemini 2.0 Flash "Thinking mode" - 19th December 2024