Simon Willison’s Weblog: Debate over “open source AI” term brings new push to formalize definition

Source URL: https://simonwillison.net/2024/Aug/27/open-source-ai/#atom-everything
Source: Simon Willison’s Weblog
Title: Debate over “open source AI” term brings new push to formalize definition

Feedly Summary: Debate over “open source AI” term brings new push to formalize definition
Benj Edwards reports on the latest draft (v0.0.9) of a definition for “Open Source AI" from the Open Source Initiative.
It’s been under active development for around a year now, and I think the definition is looking pretty solid. It starts by emphasizing the key values that make an AI system "open source":

An Open Source AI is an AI system made available under terms and in a way that grant the freedoms to:

Use the system for any purpose and without having to ask for permission.
Study how the system works and inspect its components.
Modify the system for any purpose, including to change its output.
Share the system for others to use with or without modifications, for any purpose.

These freedoms apply both to a fully functional system and to discrete elements of a system. A precondition to exercising these freedoms is to have access to the preferred form to make modifications to the system.

There is one very notable absence from the definition: while it requires the code and weights be released under an OSI-approved license, the training data itself is exempt from that requirement.
At first impression this is disappointing, but I think it it’s a pragmatic decision. We still haven’t seen a model trained entirely on openly licensed data that’s anywhere near the same class as the current batch of open weight models, all of which incorporate crawled web data or other proprietary sources.
For the OSI definition to be relevant, it needs to acknowledge this unfortunate reality of how these models are trained.
The OSI’s FAQ that accompanies the draft further expands on this:

Training data is valuable to study AI systems: to understand the biases that have been learned and that can impact system behavior. But training data is not part of the preferred form for making modifications to an existing AI system. The insights and correlations in that data have already been learned.
Data can be hard to share. Laws that permit training on data often limit the resharing of that same data to protect copyright or other interests. Privacy rules also give a person the rightful ability to control their most sensitive information – like decisions about their health. Similarly, much of the world’s Indigenous knowledge is protected through mechanisms that are not compatible with later-developed frameworks for rights exclusivity and sharing.

Tags: open-source, benj-edwards, generative-ai, training-data, ai

AI Summary and Description: Yes

Summary: The debate surrounding the definition of “Open Source AI” led by the Open Source Initiative highlights fundamental freedoms associated with AI systems. However, it notes a significant limitation: the training data used for these systems is exempt from open-source licensing requirements. This raises important discussion on the implications for transparency, biases, and compliance in AI development.

Detailed Description: The text discusses the ongoing efforts by the Open Source Initiative (OSI) to formalize a definition for “Open Source AI.” Here are the major points:

– **Key Freedoms of Open Source AI**: The draft definition emphasizes four primary freedoms that characterize an AI system as open source:
– The ability to use the system for any purpose without seeking permission.
– The right to study the system and inspect its components.
– The freedom to modify the system for any purpose, which includes changing its output.
– The capability to share the system with others, whether modified or unmodified.

– **Access to Preferred Form**: A critical requirement to exercise these freedoms is having access to the preferred form that enables modifications to the system.

– **Exemption of Training Data**: A significant aspect of the definition is the exemption of training data from the open-source requirements. This omission has drawn some criticism, as it leaves the data, which is essential for understanding model biases and behaviors, unaccounted for under open-source principles.

– **Pragmatic Approach**: The author expresses mixed feelings about this decision; while disappointing, it reflects a pragmatic approach considering the current landscape of AI training models, which often rely on proprietary or crawled data.

– **Importance of Training Data**: The text highlights the importance of training data not only for understanding AI systems but also for ensuring compliance with various privacy regulations and copyright laws. The challenge lies in balancing the need for openness with the rights of individuals and communities regarding sensitive or proprietary data.

– **Cultural Considerations**: Indigenous knowledge protections are cited as a further complication in the push for open-source data sharing, emphasizing that frameworks for rights exclusivity are often at odds with the principles of open distribution.

Overall, this discussion is highly relevant for professionals in AI and compliance, as it touches on core principles of transparency, fairness, and data governance within the rapidly evolving AI landscape. The implications for bias investigation, ethical AI development, and adherence to privacy laws are critical considerations for security and compliance experts moving forward.