The [Apple Foundation Model] pre-training dataset consists of a diverse and high quality data mixture. This includes data we have licensed from publishers, curated publicly-available or open-sourced datasets, and publicly available information crawled by our web-crawler, Applebot. We respect the right of webpages to opt out of being crawled by Applebot, using standard robots.txt directives.
Given our focus on protecting user privacy, we note that no private Apple user data is included in the data mixture. Additionally, extensive efforts have been made to exclude profanity, unsafe material, and personally identifiable information from publicly available data (see Section 7 for more details). Rigorous decontamination is also performed against many common evaluation benchmarks.
We find that data quality, much more so than quantity, is the key determining factor of downstream model performance.
Recent articles
- Video: Building a tool to copy-paste share terminal sessions using Claude Code for web - 23rd October 2025
- Dane Stuckey (OpenAI CISO) on prompt injection risks for ChatGPT Atlas - 22nd October 2025
- Living dangerously with Claude - 22nd October 2025