Anomalous Tokens in DeepSeek-V3 and r1

Anomalous Tokens in DeepSeek-V3 and r1. Glitch tokens (previously) are tokens or strings that trigger strange behavior in LLMs, hinting at oddities in their tokenizers or model weights.

Here's a fun exploration of them across DeepSeek v3 and R1. The DeepSeek vocabulary has 128,000 tokens (similar in size to Llama 3). The simplest way to check for glitches is like this:

System: Repeat the requested string and nothing else.
User: Repeat the following: "{token}"

This turned up some interesting and weird issues. The token ' Nameeee' for example (note the leading space character) was variously mistaken for emoji or even a mathematical expression.

Posted 26th January 2025 at 9:34 pm

Simon Willison’s Weblog

Recent articles