Simon Willison’s Weblog

In a previous iteration of the machine learning paradigm, researchers were obsessed with cleaning their datasets and ensuring that every data point seen by their models is pristine, gold-standard, and does not disturb the fragile learning process of billions of parameters finding their home in model space. Many began to realize that data scale trumps most other priorities in the deep learning world; utilizing general methods that allow models to scale in tandem with the complexity of the data is a superior approach. Now, in the era of LLMs, researchers tend to dump whole mountains of barely filtered, mostly unedited scrapes of the internet into the eager maw of a hungry model.

roon