Mozilla Common Voice dataset

A multilingual open voice dataset created by Mozilla Common Voice, an initiative to help teach machines how real people speak

Website: https://commonvoice.mozilla.org/

Type of Digital Public Good

  • Open content
  • ✅  Open data
  • Open software
  • Open standard
  • Open AI model

1. Is it relevant to one of the Sustainable Development Goals?

  • 5. Gender Equality

    Evidence: Because of diversity issues within AI training sets, technology often displays gender bias. More voice technology performing better for men than women, to safety equipment being designed to support male forms rather than female ones, we have a gender equity problem in the design and roll of tech. Common Voice is a way for women and gender conforming people to be better represented in voice AI training sets. We have a gender action plan, and specific interventions to support people across the gender spectrum to participate, including writeathons, contributhons and gender working group.

  • 9. Industry, Innovation and Infrastructure

    Evidence: Substantial voice AI training sets remain proprietary, held by big companies in such a way that stifles innovation. By making Common Voice open access we have enabled a wide spectrum of innovative organisations to build ASR (automatic speech recognition) for free.

  • 10. Reduced Inequality

    Evidence: The same AI training set inequities that impact women and gender diverse people also impact People of Colour, neurodiverse people and those with speech pathologies. By creating an open platform to crowdsource data from anyone and everyone who can speak, we are creating real representation opportunities in voice datasets. This will support access to technology and reduce the digital divide.

  • 17. Partnerships to achieve the Goal

    Evidence: Common Voice is a consortium project stewarded by Mozilla but with investment from the Gates foundation, GIZ, FCDO, and NVIDIA (with more partners to come). This consortium of for profit, non profit and government are all coming together to advance equitable open technology in MCV.

2. Does it use an appropriate open license?

Yes, this project is licensed under the following license(s):

3. Is ownership clearly defined?

Is the ownership of the project and everything that the project produces clearly defined and documented?

Yes

If yes - please link to the relevant copyright, trademarks, or ownership documentation for the project.

https://commonvoice.mozilla.org/en/terms

4. Does the license of libraries/dependencies undermine the openess of the project?

Does this open project have mandatory dependencies (i.e. libraries, hardware) that create more restrictions than the original license?

No

If yes - are the open source components able to demonstrate independence from the closed component(s) and/or are there functional, open alternatives?

Not Applicable

If yes - please describe how the open source components are independent and/or list the open alternatives for the closed component:

Not Applicable

5. Is there documentation?

Does some documentation exist of the source code, use cases, and/or functional requirements. For software projects, this should be present as technical documentation that would allow a technical person unfamiliar with the project to launch and run the software. For datasets and data projects, this should be present as documentation that describes all the fields in the set, and provides context on how the data was collected and how it should be interpreted. For content collections, this should indicate any relevant compatible apps, software, hardware required to access the content and any instructions about how to use it.

Yes

If yes - please link to the relevant documentation:

6. Is non PII data and/or content accessible?

Does this project collect or use non-personally identifiable information (non-PII) data and/or content?

Yes

If yes - is there a mechanism for extracting or importing non-personally identifiable information (non-PII) from the system in a non-proprietary format?

Yes

If yes - describe the mechanism for extracting or importing non-personally identifiable information from the system in a non-proprietary format:

Downloads can be done in MP3 format here - https://commonvoice.mozilla.org/en/datasets

7. Does the project adhere to privacy and other applicable international and domestic laws?

Has this project taken steps to ensure adherence with relevant privacy, domestic, and international laws? For example, the General Data Protection Regulation (GDPR) in the European Union or the Supplementary Act A/SA.1/01/10 on Personal Data Protection for the Economic Community of West African States (ECOWAS) (yes/no)

Yes

If yes, please list some of relevant laws that the project complies with:

  • GDPR

If yes, please describe the steps this project has taken to ensure adherence (include links to terms of service, privacy policy, or other relevant documentation):

8. Does the project adhere to standards and best practices?

Does this project support standards? (i.e. Web Content Accessibility Guidelines (WCAG) 2.1 or other standards such as those listed on W3C)

Yes

Which standards does this project support (please list)

  • JSON
  • ISO 639-1 codes

Can you point to evidence of your support? (i.e. please link to your validator, open test suite, etc.)

Was this project built and developed according to or in adherence with any design, technical and/or sector best practices or principles? i.e. the Principles for Digital Development?

Yes

Which principles and best practices does this project support (please list)

9. Does the project do no harm by design?

Has this project taken steps to anticipate, prevent and do no harm by design?

On the whole, does this project take steps to ensure that it anticipates, prevents and does no harm by design?

Yes

Is there any additional information you would like to share about the mechanisms, processes or policies that this project uses to avoid doing harm by design?

https://commonvoice.mozilla.org/en/terms - age limits, keeping list of downloaders for right to be forgotten, reducing dataset mirroring and forking, community participation guidelines in place, active community manager

9.a. Data Privacy & Security

Does this project collect or store personally identifiable information (PII) data and/or content?

Yes

If yes - please list the types of data and/or content collected and/or stored by the project:

  • email addresses

If yes - does this project share this data and/or content with third parties?

No

Please describe the circumstances with which this project shares data and/or content with third parties. Please add links as relevant.

Not Applicable

If yes - does the project ensure the privacy, security and integrity of this data and/or content collection and has it taken steps to prevent adverse impacts resulting from its collection, storage and distribution.

Yes

If yes - please describe the steps, and include a link to the privacy policy and/or terms of service:

Account data. You do not need to create an account to use Common Voice. If you decide to create an account, we receive your username and avatar, if you submit one. Your email address is associated with your demographic and interaction data but is not shared to the public. We display leaderboards showing the number of recordings users make. You have the option whether or not you wish to appear on the leaderboards. You can delete your account at any time and your username and email will be removed. For more details - https://commonvoice.mozilla.org/en/privacy

9.b. Inappropriate & Illegal Content

Does this project collect, store or distribute content?

No

If yes - what kinds of content does this project, collect, store or distribute? (i.e. childrens books)

Not Applicable

If yes - does this project have policies that describe what is considered innappropriate content? (i.e. child sexual abuse materials)

Not Applicable

If yes - please link to the relevant policy/guidelines/documentation.

Not Applicable

If yes - does this project have policies and processes for detecting and moderating innappropriate/illegal content?

Not Applicable

If yes - please describe the policies and processes for detecting, reporting and removing innapropriate/illegal content (Please include the average response time for assessment and/or action. Link to any policies or descriptions of how inappropriate content is handled):

Not Applicable

9.c. Protection from harassment

Does this project facilitate interactions with or between users or contributors?

Yes

If yes - does the project take steps to address the safety and security of underage users?

Yes

If yes - please describe the steps this project takes to address risk or prevent access by underage users:

  • children cannot participate without a guardian as per /terms

If yes - does the project help users and contributors protect themselves against grief, abuse, and harassment?

Yes

If yes - please describe the steps taken to help users protect themselves.

Development & deployment countries

List of countries this project was developed in.

  • Canada
  • United States of America
  • United Kingdom

List of countries this project is actively deployed in.

  • United States of America