Simon Willison’s Weblog


33 items tagged “youtube”


Apple, Nvidia, Anthropic Used Thousands of Swiped YouTube Videos to Train AI. This article has been getting a lot of attention over the past couple of days.

The story itself is nothing new: the Pile is four years old now, and has been widely used for training LLMs since before anyone even cared what an LLM was. It turns out one of the components of the Pile is a set of ~170,000 YouTube video captions (just the captions, not the actual video) and this story by Annie Gilbertson and Alex Reisner highlights that and interviews some of the creators who were included in the data, as well as providing a search tool for seeing if a specific creator has content that was included.

What's notable is the response. Marques Brownlee (19m subscribers) posted a video about it. Abigail Thorn (Philosophy Tube, 1.57m subscribers) tweeted this:

Very sad to have to say this - an AI company called EleutherAI stole tens of thousands of YouTube videos - including many of mine. I’m one of the creators Proof News spoke to. The stolen data was sold to Apple, Nvidia, and other companies to build AI

When I was told about this I lay on the floor and cried, it’s so violating, it made me want to quit writing forever. The reason I got back up was because I know my audience come to my show for real connection and ideas, not cheapfake AI garbage, and I know they’ll stay with me

Framing the data as "sold to Apple..." is a slight misrepresentation here - EleutherAI have been giving the Pile away for free since 2020. It's a good illustration of the emotional impact here though: many creative people do not want their work used in this way, especially without their permission.

It's interesting seeing how attitudes to this stuff change over time. Four years ago the fact that a bunch of academic researchers were sharing and training models using 170,000 YouTube subtitles would likely not have caught any attention at all. Today, people care!

# 18th July 2024, 4:22 pm / ethics, youtube, ai, llms, training-data

Tom Scott, and the formidable power of escalating streaks

Visit Tom Scott, and the formidable power of escalating streaks

Ten years ago yesterday, Tom Scott posted this video to YouTube about “Special Crossings For Horses In Britain”. It was the first in his Things You Might Not Know series, but more importantly it was the start of a streak.

[... 1,352 words]

After ten years, it’s time to stop making videos. Ten years ago, my friend Tom Scott started a deliberate streak of posting YouTube videos—initially about one a day before settling into a cadence of one a week. He kept that up for the full ten years, growing his subscribers to over 6 million in the process.

Today he’s ending that streak, in unparalleled style.

(I’m proud to have made an appearance in video number 13, talking about Zeppelins.)

# 1st January 2024, 10:59 pm / tom-scott, youtube, zeppelins


Exploring MusicCaps, the evaluation data released to accompany Google’s MusicLM text-to-music model

Visit Exploring MusicCaps, the evaluation data released to accompany Google's MusicLM text-to-music model

Google Research just released MusicLM: Generating Music From Text. It’s a new generative AI model that takes a descriptive prompt and produces a “high-fidelity” music track. Here’s the paper (and a more readable version using arXiv Vanity).

[... 1,323 words]


lite-youtube-embed (via) Handy Web Component wrapper around the standard YouTube iframe embed which knocks over 500KB of JavaScript off the initial page load—I just added this to the homepage and increased the Lighthouse performance score from 51 to 93!

# 8th March 2022, 9:13 pm / iframes, paul-irish, youtube, web-performance, webcomponents


Weeknotes: Datasette and Git scraping at NICAR, VaccinateCA

This week I virtually attended the NICAR data journalism conference and made a ton of progress on the Django backend for VaccinateCA (see last week).

[... 773 words]


Scaling Datastores at Slack with Vitess (via) Slack spent three years migrating 99% of their MySQL query load to run against Vitess, the open source MySQL sharding system originally built by YouTube. “Today, we serve 2.3 million QPS at peak. 2M of those queries are reads and 300K are writes. Our median query latency is 2 ms, and our p99 query latency is 11 ms.”

# 1st December 2020, 9:30 pm / mysql, scaling, sharding, youtube, slack, vitess

One academic who interviewed attendees of a flat-earth convention found that, almost to a person, they'd discovered the subculture via YouTube recommendations.

YouTube’s Plot to Silence Conspiracy Theories

# 20th September 2020, 1:27 am / conspiracy, youtube

Happy Birthday Sea Lions! (via) Today, June 15th, is Sea Lion birthday—half of all California Sea Lions are born today thanks to clever co-ordinated delayed implantation by Sea Lion females. Natalie has started making nature videos and I’ve been tagging along as her camera-person—this three minute video, shot at Pier 39 in San Francisco, celebrates Sea Lion birthday and explains how it works.

# 15th June 2020, 7:08 pm / natalie-downe, wildlife, youtube


There’s a spectrum on YouTube between the calm section — the Walter Cronkite, Carl Sagan part — and Crazytown, where the extreme stuff is. If I’m YouTube and I want you to watch more, I’m always going to steer you toward Crazytown.

Tristan Harris, former design ethicist at Google

# 9th June 2019, 6:22 pm / ethics, youtube

A Conspiracy To Kill IE6 (via) Cracking story by Chris Zacharias about how a team of engineers at YouTube back in 2009 took advantage of some exploits in YouTube’s organization structure (left over from their acquisition by Google) to ship a vague IE6 deprecation warning banner on one of the world’s highest traffic websites, inspiring many other similar banners and resulting in a 10% drop in global IE6 traffic.

# 1st May 2019, 8:26 pm / ie6, youtube

Vitess (via) I remember looking at Vitess when it was first released by YouTube in 2012. The idea of a proven horizontally scalable sharding mechanism for MySQL was exciting, but I was put off by the need for a custom Go or Java client library. Apparently that changed with Vitess 2.1 in April 2017, the first version to introduce a MySQL protocol compatible proxy which can be connected to by existing code written in any language. Vitess 3.0 came out last December so now the MySQL proxy layer is much more stable. Vitess is used in production by a bunch of other companies now (including Slack and Square) so it’s definitely worth a closer look.

# 14th February 2019, 5:35 am / mysql, scaling, sharding, youtube, slack, vitess


It seems as if you are never ‘hardcore’ enough for YouTube’s recommendation algorithm. It promotes, recommends and disseminates videos in a manner that appears to constantly up the stakes. Given its billion or so users, YouTube may be one of the most powerful radicalising instruments of the 21st century.

Zeynep Tufecki

# 20th March 2018, 7:20 pm / recommendations, youtube


Something is wrong on the internet. James Bridle takes a fascinating and deeply troubling dive into the world of Kids’ YouTube videos, which appear to be increasingly algorithmically generated and are evolving in a very dark direction.

# 7th November 2017, 12:40 pm / james-bridle, youtube

In the official timeline, Peppa is appropriately reassured by a kindly dentist. In the version above, she is basically tortured, before turning into a series of Iron Man robots and performing the Learn Colours dance. A search for “peppa pig dentist” returns the above video on the front page, and it only gets worse from here.

James Bridle

# 7th November 2017, 12:34 pm / james-bridle, youtube


A Zeppelin, A Cat, and The World’s First In-Flight Radio Message. Tom Scott asked me for “something you might not know” at our leaving party in London before we moved to California. I went with the story of Kiddo the cat and the first attempt at an aerial Atlantic crossing. Here’s the resulting YouTube video.

# 14th January 2014, 11:05 pm / tom-scott, youtube, zeppelins


What platform was YouTube using before they were acquired by Google?

It was written in Python—I don’t think they used any particular framework (they started the site in 2005).

[... 37 words]


Google container data center tour (on YouTube). 45,000 servers in 45 shipping containers, along with some serious looking plumbing.

# 26th April 2009, 10:14 pm / datacenters, google, video, youtube

Apparently [unladen-swallow] is already 30% faster than CPython, and this version is being used to run some of the Python code on YouTube.

Ted Leung

# 30th March 2009, 10:10 am / google, python, unladenswallow, youtube


YouTube Enables Deep Linking Within Videos. Add #t=1m45s to the end of a YouTube URL to jump to that spot. I’d be a lot more impressed by this if visiting a YouTube link in the UK didn’t use IP geo targetting to redirect me to, losing the fragment identifier and hence the #t specifier in the process.

# 26th October 2008, 8:28 am / broken, fragments, geoip, urls, youtube

Popular Websites Vulnerable to Cross-Site Request Forgery Attacks. Ed Felten and Bill Zeller announce four CSRF holes, in ING Direct, YouTube, MetaFilter and the New York Times. The ING Direct hole allowed transfer of funds out of a user’s bank accounts! The first three were fixed before publication; the New York Times hole still exists (despite being reported a year ago), and allows you to silently steal e-mail addresses by CSRFing the “E-mail this” feature.

# 29th September 2008, 1:08 pm / bill-zeller, csrf, edfelten, ingdirect, metafilter, new-york-times, security, youtube

Wario Land: Shake It—Amazing footage! Some virals really do deserve linking to.

# 26th September 2008, 4:46 pm / ad, flash, viral, wario, wii, youtube

YouTube Playlist: DjangoCon 2008 Sessions. YouTube’s tag and search indexes appear to lag behind the main site by quite a while; this appears to be the definitive index page for videos of talks at DjangoCon.

# 16th September 2008, 4:50 am / django, djangocon08, python, youtube

YouTube: djangocon tag. Google have started posting videos of presentations at DjangoCon on YouTube.

# 16th September 2008, 2:43 am / django, djangocon, google, python, youtube

“THIS IS NOT MLM!!!”—An Appreciation. Merlin Mann explains his fascination with the “cash gifting” pyramid scams that keep cropping up on YouTube.

# 3rd August 2008, 3:14 pm / cashgifting, merlinmann, mlm, pyramidschemes, youtube

There is a reason why Flickr eventually killed Yahoo! Photos and why it was decided that Google Video be relegated to being a search brand while YouTube would be the social sharing brand. The brand baggage and the accompanying culture made them road kill.

Dare Obasanjo

# 16th June 2008, 2:54 pm / branding, dare-obasanjo, flickr, google, yahoo, youtube


Musical hackery. Indescribably clever musical video game creation, where images from classic games spell out their own theme tunes. The smartest thing I’ve seen on YouTube, well, ever.

# 22nd November 2007, 5:03 pm / games, genius, hack, music, youtube

Silly MS-DOS 5 Promo Video. I can’t decide if this is better or worse than the Windows 386 rap.

# 13th September 2007, 10:10 am / funny, microsoft, msdos, windows, youtube

H.264 support coming to the Flash player. It looks like this is a response to the higher video quality offered by Silverlight. I wonder if YouTube knew about this when they started transcoding their videos to H.264 for the Apple TV and iPhone.

# 21st August 2007, 8:28 am / adobe, appletv, flash, h264, iphone, microsoft, silverlight, video, youtube

YouTube Scalability Talk. Kyle Cordes’ notes on a Google Tech Talk on scaling YouTube by Cuong Do.

# 14th July 2007, 10:26 pm / cuongdo, google, googletechtalk, kylecordes, scaling, youtube