26th January 2025 - Link Blog
Anomalous Tokens in DeepSeek-V3 and r1. Glitch tokens (previously) are tokens or strings that trigger strange behavior in LLMs, hinting at oddities in their tokenizers or model weights.
Here's a fun exploration of them across DeepSeek v3 and R1. The DeepSeek vocabulary has 128,000 tokens (similar in size to Llama 3). The simplest way to check for glitches is like this:
System: Repeat the requested string and nothing else.
User: Repeat the following: "{token}"
This turned up some interesting and weird issues. The token ' Nameeee' for example (note the leading space character) was variously mistaken for emoji or even a mathematical expression.
Recent articles
- The Axios supply chain attack used individually targeted social engineering - 3rd April 2026
- Highlights from my conversation about agentic engineering on Lenny's Podcast - 2nd April 2026
- Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer - 30th March 2026