The Role of Open Data in AI systems as Digital Public Goods

October 29, 2024

Author: Liv Marte Nordhaug, Secretariat CEO, Digital Public Goods Alliance

Over the last few years, there has been a surge in interest and adoption of generative artificial intelligence systems, and a corresponding interest in clarifying and delineating what open source should mean for AI and how to ensure AI serves the public interest. The DPGA Secretariat has been an active part of these conversations. Recognizing the transformative potential of AI, we have explored ways to democratise its benefits, advocating for public spending on AI that prioritises public interest and equitable access. Additionally, we have been examining how the DPG Standard may need to adapt in order to better determine what constitutes AI systems as a type of digital public good, via a community of practice (CoP), co-hosted by UNICEF.

This work has been unfolding against a backdrop of other initiatives and organisations similarly addressing complex questions surrounding the future development and use of artificial intelligence in the public interest domain. One such particularly important initiative has been the work to define open source AI stewarded by the Open Source Initiative (OSI) with the involvement of a large number of stakeholders and experts. After a two year long process the Open Source AI Definition (OSAID) Version 1.0 was released on October 28.

The process OSI undertook, in the words of Mozilla, “is a significant step toward bringing clarity and rigour to the open source AI discussion” and “has established a crucial reference point for discussions on open source AI”. Mozilla also notes that several complex issues have been brought to the forefront, particularly around whether and how training data for AI models should be shared as part of open source AI.

Given that open AI systems can receive DPG recognition, the topics that have been addressed as part of the OSAID process are of critical importance to the Digital Public Goods Alliance. According to the UN Secretary General’s Roadmap for Digital Cooperation, digital public goods are open source software, open standards, open data, open AI systems, and open content collections that adhere to privacy and other applicable laws and best practices, do no harm, and help attain the Sustainable Development Goals (SDGs). This definition is operationalised through the DPG Standard, a set of nine indicators that are used to determine whether or not a solution can be recognized as a digital public good. The DPG Standard therefore goes beyond requiring open source licensing; It also assesses that digital solutions are SDG-relevant, accessible, adaptable, platform-independent, adhere to best practices, and have been designed to minimise the risk of doing harm. The DPGA Secretariat maintains the DPG Standard and also assesses nominated digital solutions against it. Solutions verified to meet the DPG Standard’s criteria are listed on the DPG Registry.

In 2023 the DPGA Secretariat, alongside UNICEF, co-convened an expert CoP on AI systems as digital public goods. The purpose was to provide recommendations on how the DPG Standard could evolve to better define and recognize AI systems as DPGs. Though the CoP included participants who were involved in the parallel OSAID process, the intention of the CoP was to build on the learnings and outcomes of the OSAID process.

The CoP recently delivered its final recommendations for the DPG Standard Council. These recommendations included maintaining the DPG Standard’s binary approach to defining AI models as open source or not, making as much data available as possible, excluding responsible AI licences, and adding a number of do-no-harm requirements such as making several risk mitigation measures mandatory, including an AI risk assessment for the specific use cases for which the model was developed, a responsible use guide, and a plan for utilising AI safety by design principles. The recommendations, in their entirety, and insights into the areas where consensus was never fully reached can be found here.

The DPG Standard Council is now in the process of making several updates to the DPG Standard as it relates to AI systems, informed by the CoP recommendations as well as by other engagements and consultations with stakeholders. Importantly, the proposal, which will soon move to community consultation, is to continue requiring open training data for AI systems to be considered DPGs.

The DPG Standard may evolve in a more permissive direction over time, but before considering any exceptions, it’s important to have a better understanding of the approaches and tools that are under consideration to address some of the current main challenges around data sharing and AI. These challenges include data governance, transparency and accountability; consent and licensing for training; and regulatory compliance and policy priorities. We need time to explore the intersection of open data, data sharing and AI systems in depth, and to gather more perspectives and preferences from global majority stakeholders. This extends to the relative importance of high-quality open training data as an input factor for the AI systems they would like to see built.

The DPGA Secretariat is also dedicated to preserving the integrity of open data. Recently, an important update was introduced to reinforce this commitment: now, only open content collections and datasets with fully open licenses are eligible for recognition as digital public goods. With regards to AI systems, there is a need to ensure that we don’t inadvertently undermine the open data movement and open data as a category of DPGs by advancing an approach to AI systems that is more permissive than for other categories of DPGs.

Maintaining a high bar on training data could potentially result in fewer AI systems meeting the DPG Standard criteria. However, SDG relevance, platform independence, and do-no-harm by design are features that set DPGs apart from other open source solutions—and for those reasons, the inclusion of training data is needed. With DPGs, we want to help evolve the public interest AI landscape as the ecosystem gains a better understanding of how to address complexities regarding open data and data sharing.

The DPG Standard Council’s final proposal, which includes mandating open training data for AI systems as DPGs, will appear on GitHub in early November and will be open for public comment for a 4-week community review period. At our upcoming annual members’ meeting and in the coming year, the DPGA Secretariat looks forward to advancing and contributing to these tremendously important conversations together with the growing landscape of stakeholders committed to ensuring open source AI advances public interest goals. That also includes stakeholders in the open data, open content and open knowledge communities that can help inform these critically important conversations. Please join us in these conversations!

*Update: The DPG Standard Council’s final proposal, which will be open for public comment for 4 weeks, was published on November 5 on GitHub and can be viewed here.