Simon Willison’s Weblog

Subscribe

Debate over “open source AI” term brings new push to formalize definition. Benj Edwards reports on the latest draft (v0.0.9) of a definition for "Open Source AI" from the Open Source Initiative.

It's been under active development for around a year now, and I think the definition is looking pretty solid. It starts by emphasizing the key values that make an AI system "open source":

An Open Source AI is an AI system made available under terms and in a way that grant the freedoms to:

  • Use the system for any purpose and without having to ask for permission.
  • Study how the system works and inspect its components.
  • Modify the system for any purpose, including to change its output.
  • Share the system for others to use with or without modifications, for any purpose.

These freedoms apply both to a fully functional system and to discrete elements of a system. A precondition to exercising these freedoms is to have access to the preferred form to make modifications to the system.

There is one very notable absence from the definition: while it requires the code and weights be released under an OSI-approved license, the training data itself is exempt from that requirement.

At first impression this is disappointing, but I think it it's a pragmatic decision. We still haven't seen a model trained entirely on openly licensed data that's anywhere near the same class as the current batch of open weight models, all of which incorporate crawled web data or other proprietary sources.

For the OSI definition to be relevant, it needs to acknowledge this unfortunate reality of how these models are trained. Without that, we risk having a definition of "Open Source AI" that none of the currently popular models can use!

Instead of requiring the training information, the definition calls for "data information" described like this:

Data information: Sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data. Data information shall be made available with licenses that comply with the Open Source Definition.

The OSI's FAQ that accompanies the draft further expands on their reasoning:

Training data is valuable to study AI systems: to understand the biases that have been learned and that can impact system behavior. But training data is not part of the preferred form for making modifications to an existing AI system. The insights and correlations in that data have already been learned.

Data can be hard to share. Laws that permit training on data often limit the resharing of that same data to protect copyright or other interests. Privacy rules also give a person the rightful ability to control their most sensitive information – like decisions about their health. Similarly, much of the world’s Indigenous knowledge is protected through mechanisms that are not compatible with later-developed frameworks for rights exclusivity and sharing.