Simon Willison’s Weblog

Subscribe

12 items tagged “copyright”

2024

Releasing Common Corpus: the largest public domain dataset for training LLMs (via) Released today. 500 billion words from “a wide diversity of cultural heritage initiatives”. 180 billion words of English, 110 billion of French, 30 billion of German, then Dutch, Spanish and Italian.

Includes quite a lot of US public domain data—21 million digitized out-of-copyright newspapers (or do they mean newspaper articles?)

“This is only an initial part of what we have collected so far, in part due to the lengthy process of copyright duration verification. In the following weeks and months, we’ll continue to publish many additional datasets also coming from other open sources, such as open data or open science.”

Coordinated by French AI startup Pleias and supported by the French Ministry of Culture, among others.

I can’t wait to try a model that’s been trained on this. # 20th March 2024, 7:34 pm

OpenAI and journalism. Bit of a misleading title here: this is OpenAI’s first public response to the lawsuit filed by the New York Times concerning their use of unlicensed NYT content to train their models. # 8th January 2024, 6:33 pm

We believe that AI tools are at their best when they incorporate and represent the full diversity and breadth of human intelligence and experience. [...] Because copyright today covers virtually every sort of human expression– including blog posts, photographs, forum posts, scraps of software code, and government documents–it would be impossible to train today’s leading AI models without using copyrighted materials. Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.

OpenAI to the Lords Select Committee on LLMs # 8th January 2024, 5:33 pm

2023

I’ve resigned from my role leading the Audio team at Stability AI, because I don’t agree with the company’s opinion that training generative AI models on copyrighted works is ‘fair use’.

[...] I disagree because one of the factors affecting whether the act of copying is fair use, according to Congress, is “the effect of the use upon the potential market for or value of the copyrighted work”. Today’s generative AI models can clearly be used to create works that compete with the copyrighted works they are trained on. So I don’t see how using copyrighted works to train generative AI models of this nature can be considered fair use.

But setting aside the fair use argument for a moment — since ‘fair use’ wasn’t designed with generative AI in mind — training generative AI models in this way is, to me, wrong. Companies worth billions of dollars are, without permission, training generative AI models on creators’ works, which are then being used to create new content that in many cases can compete with the original works.

Ed Newton-Rex # 15th November 2023, 9:31 pm

Beyond these specific legal arguments, Stability AI may find it has a “vibes” problem. The legal criteria for fair use are subjective and give judges some latitude in how to interpret them. And one factor that likely influences the thinking of judges is whether a defendant seems like a “good actor.” Google is a widely respected technology company that tends to win its copyright lawsuits. Edgier companies like Napster tend not to.

Timothy B. Lee # 3rd April 2023, 3:38 pm

Stable Diffusion copyright lawsuits could be a legal earthquake for AI. Timothy B. Lee provides a thorough discussion of the copyright lawsuits currently targeting Stable Diffusion and GitHub Copilot, including subtle points about how the interpretation of “fair use” might be applied to the new field of generative AI. # 3rd April 2023, 3:34 pm

mitsua-diffusion-one (via) “Mitsua Diffusion One is a latent text-to-image diffusion model, which is a successor of Mitsua Diffusion CC0. This model is trained from scratch using only public domain/CC0 or copyright images with permission for use.” I’ve been talking about how much I’d like to try out a “vegan” AI model trained entirely on out-of-copyright images for ages, and here one is! It looks like the training data mainly came from CC0 art gallery collections such as the Metropolitan Museum of Art Open Access. # 23rd March 2023, 2:56 pm

2010

Copyright is hard. Dan Catt spots Adam Liversage (Director of Communications for the BPI, and hence right in the middle of the DEBill) suggesting his wife violate copyright on Twitter. # 11th April 2010, 2:26 pm

2008

Free licenses upheld by US “IP” court. Free software and CC licenses which dictate conditions that, when violated, turn you in to a copyright infringer now have precedence in US law. # 14th August 2008, 9:33 am

The Open Web Foundation. Launched today at OSCON, an independent, non-profit organisation dedicated to incubating and protecting new specifications like OAuth and oEmbed. The focus is incubation, licensing, copyright and community. # 24th July 2008, 5:40 pm

2007

Open Rights Group: Our first two years. ORG’s review of the past two years shows just how worthwhile a cause they have become—highlights include their hugely successful campaign against copyright term extension and their involvement in this year’s e-voting trials. # 25th November 2007, 10:05 pm

Please, fanboys, don’t send me dumb notes averring that Apple’s failure to police this use of its mark will lead to the end of its ability to stop manufacturers from producing rival MP3 players and calling them iPods. That’s a fairy tale that trademark lawyers tell their kids when they want to reassure them that they’ll have a healthy college fund.

Cory Doctorow # 12th February 2007, 2:05 pm