Simon Willison’s Weblog

Subscribe

Blogmarks tagged copyright, generativeai

Filters: Type: blogmark × copyright × generativeai × Sorted by date


Releasing Common Corpus: the largest public domain dataset for training LLMs (via) Released today. 500 billion words from “a wide diversity of cultural heritage initiatives”. 180 billion words of English, 110 billion of French, 30 billion of German, then Dutch, Spanish and Italian.

Includes quite a lot of US public domain data—21 million digitized out-of-copyright newspapers (or do they mean newspaper articles?)

“This is only an initial part of what we have collected so far, in part due to the lengthy process of copyright duration verification. In the following weeks and months, we’ll continue to publish many additional datasets also coming from other open sources, such as open data or open science.”

Coordinated by French AI startup Pleias and supported by the French Ministry of Culture, among others.

I can’t wait to try a model that’s been trained on this. # 20th March 2024, 7:34 pm

OpenAI and journalism. Bit of a misleading title here: this is OpenAI’s first public response to the lawsuit filed by the New York Times concerning their use of unlicensed NYT content to train their models. # 8th January 2024, 6:33 pm

Stable Diffusion copyright lawsuits could be a legal earthquake for AI. Timothy B. Lee provides a thorough discussion of the copyright lawsuits currently targeting Stable Diffusion and GitHub Copilot, including subtle points about how the interpretation of “fair use” might be applied to the new field of generative AI. # 3rd April 2023, 3:34 pm

mitsua-diffusion-one (via) “Mitsua Diffusion One is a latent text-to-image diffusion model, which is a successor of Mitsua Diffusion CC0. This model is trained from scratch using only public domain/CC0 or copyright images with permission for use.” I’ve been talking about how much I’d like to try out a “vegan” AI model trained entirely on out-of-copyright images for ages, and here one is! It looks like the training data mainly came from CC0 art gallery collections such as the Metropolitan Museum of Art Open Access. # 23rd March 2023, 2:56 pm