OPEN DATA FOR PUBLIC INTEREST AI

We call for DPGs that can make identifying, preparing, sharing, and using higher-quality open training data easier, particularly for the following use cases:

  • Development of language models that address language gaps in AI development.
  • Solutions for public service delivery.
  • Research based climate action (monitoring, mitigation, adaptation).

02
Open-Source First Policies, Principles and Frameworks - icon

PRIMARY GOALS

  • Identify and create open-source tools and toolkits that can increase the availability of high-quality open training data for public interest AI.
  • Increased cooperation, coordination and alignment between groups of stakeholders who are already working on open data for open-source AI.
  • Demonstrate that public interest AI can be built in an open and transparent way, and strengthen and grow the community of stakeholders committed to advancing AI systems that meet the DPG Standard.

WHY THIS

The development of public interest AI, including AI systems as digital public goods, depends on the opportunity to train models on both existing and new high-quality openly licensed datasets. Many challenges exist that impede doing this at a larger scale, one of which is the resources required to produce and share open data in different geographical contexts. One way to address this challenge is by creating an adaptable and reusable toolkit that can be recommended to countries and stakeholders to facilitate the collection, extraction, processing, and preparation of data.

WHY NOW

Generative AI is advancing at break-neck speed, and the term “open-source AI” is often misused to describe systems that only have open weights but where there is no transparency and sharing of the data the system has been trained on. This lack of transparency poses a significant risk, as these systems are increasingly shaping our norms, values, understanding of reality, and access to information and services at the most fundamental level. It is urgent to overcome barriers to a more transparent and open way of building AI systems that serve the public interest. This includes reducing some of the main technical barriers to having more high-quality open training data.

HOW TO SUPPORT

  • Identify the main technical barriers to unlocking more open training data for the priority use cases identified.
  • Share existing open-source AI-development tools and suggest areas where new tools should be developed for addressing these barriers.
  • Fund and/or develop promising toolkit approaches and open-source tools for the priority use cases.
  • Identify and highlight examples of AI-systems that meet the DPG Standard.
  • Advocate to policy-makers and funders that public interest AI systems can and should be built in an open and transparent way.