Mozilla Common Voice dataset
A multilingual open voice dataset created by Mozilla Common Voice, an initiative to help teach machines how real people speak
Type of Digital Public Good
- Open content
- ✅ Open data
- Open software
- Open standard
- Open AI model
1. Is it relevant to one of the Sustainable Development Goals?
- 5. Gender Equality
Evidence: Because of diversity issues within AI training sets, technology often displays gender bias. More voice technology performing better for men than women, to safety equipment being designed to support male forms rather than female ones, we have a gender equity problem in the design and roll of tech. Common Voice is a way for women and gender conforming people to be better represented in voice AI training sets. We have a gender action plan, and specific interventions to support people across the gender spectrum to participate, including writeathons, contributhons and gender working group.
- 9. Industry, Innovation and Infrastructure
Evidence: Substantial voice AI training sets remain proprietary, held by big companies in such a way that stifles innovation. By making Common Voice open access we have enabled a wide spectrum of innovative organisations to build ASR (automatic speech recognition) for free.
- 10. Reduced Inequality
Evidence: The same AI training set inequities that impact women and gender diverse people also impact People of Colour, neurodiverse people and those with speech pathologies. By creating an open platform to crowdsource data from anyone and everyone who can speak, we are creating real representation opportunities in voice datasets. This will support access to technology and reduce the digital divide.
- 17. Partnerships to achieve the Goal
Evidence: Common Voice is a consortium project stewarded by Mozilla but with investment from the Gates foundation, GIZ, FCDO, and NVIDIA (with more partners to come). This consortium of for profit, non profit and government are all coming together to advance equitable open technology in MCV.
2. Does it use an appropriate open license?
Yes, this project is licensed under the following license(s):
3. Is ownership clearly defined?
Is the ownership of the project and everything that the project produces clearly defined and documented?
If yes - please link to the relevant copyright, trademarks, or ownership documentation for the project.
4. Does the license of libraries/dependencies undermine the openess of the project?
Does this open project have mandatory dependencies (i.e. libraries, hardware) that create more restrictions than the original license?
If yes - are the open source components able to demonstrate independence from the closed component(s) and/or are there functional, open alternatives?
If yes - please describe how the open source components are independent and/or list the open alternatives for the closed component:
5. Is there documentation?
Does some documentation exist of the source code, use cases, and/or functional requirements. For software projects, this should be present as technical documentation that would allow a technical person unfamiliar with the project to launch and run the software. For datasets and data projects, this should be present as documentation that describes all the fields in the set, and provides context on how the data was collected and how it should be interpreted. For content collections, this should indicate any relevant compatible apps, software, hardware required to access the content and any instructions about how to use it.
If yes - please link to the relevant documentation:
6. Is non PII data and/or content accessible?
Does this project collect or use non-personally identifiable information (non-PII) data and/or content?
If yes - is there a mechanism for extracting or importing non-personally identifiable information (non-PII) from the system in a non-proprietary format?
If yes - describe the mechanism for extracting or importing non-personally identifiable information from the system in a non-proprietary format:
Downloads can be done in MP3 format here - https://commonvoice.mozilla.org/en/datasets
7. Does the project adhere to privacy and other applicable international and domestic laws?
Has this project taken steps to ensure adherence with relevant privacy, domestic, and international laws? For example, the General Data Protection Regulation (GDPR) in the European Union or the Supplementary Act A/SA.1/01/10 on Personal Data Protection for the Economic Community of West African States (ECOWAS) (yes/no)
If yes, please list some of relevant laws that the project complies with:
8. Does the project adhere to standards and best practices?
Does this project support standards? (i.e. Web Content Accessibility Guidelines (WCAG) 2.1 or other standards such as those listed on W3C)
Which standards does this project support (please list)
- ISO 639-1 codes
Can you point to evidence of your support? (i.e. please link to your validator, open test suite, etc.)
Was this project built and developed according to or in adherence with any design, technical and/or sector best practices or principles? i.e. the Principles for Digital Development?
Which principles and best practices does this project support (please list)
9. Does the project do no harm by design?
Has this project taken steps to anticipate, prevent and do no harm by design?
On the whole, does this project take steps to ensure that it anticipates, prevents and does no harm by design?
Is there any additional information you would like to share about the mechanisms, processes or policies that this project uses to avoid doing harm by design?
https://commonvoice.mozilla.org/en/terms - age limits, keeping list of downloaders for right to be forgotten, reducing dataset mirroring and forking, community participation guidelines in place, active community manager
9.a. Data Privacy & Security
Does this project collect or store personally identifiable information (PII) data and/or content?
If yes - please list the types of data and/or content collected and/or stored by the project:
- email addresses
If yes - does this project share this data and/or content with third parties?
Please describe the circumstances with which this project shares data and/or content with third parties. Please add links as relevant.
If yes - does the project ensure the privacy, security and integrity of this data and/or content collection and has it taken steps to prevent adverse impacts resulting from its collection, storage and distribution.
Account data. You do not need to create an account to use Common Voice. If you decide to create an account, we receive your username and avatar, if you submit one. Your email address is associated with your demographic and interaction data but is not shared to the public. We display leaderboards showing the number of recordings users make. You have the option whether or not you wish to appear on the leaderboards. You can delete your account at any time and your username and email will be removed. For more details - https://commonvoice.mozilla.org/en/privacy
9.b. Inappropriate & Illegal Content
Does this project collect, store or distribute content?
If yes - what kinds of content does this project, collect, store or distribute? (i.e. childrens books)
If yes - does this project have policies that describe what is considered innappropriate content? (i.e. child sexual abuse materials)
If yes - please link to the relevant policy/guidelines/documentation.
If yes - does this project have policies and processes for detecting and moderating innappropriate/illegal content?
If yes - please describe the policies and processes for detecting, reporting and removing innapropriate/illegal content (Please include the average response time for assessment and/or action. Link to any policies or descriptions of how inappropriate content is handled):
9.c. Protection from harassment
Does this project facilitate interactions with or between users or contributors?
If yes - does the project take steps to address the safety and security of underage users?
If yes - please describe the steps this project takes to address risk or prevent access by underage users:
- children cannot participate without a guardian as per /terms
If yes - does the project help users and contributors protect themselves against grief, abuse, and harassment?
If yes - please describe the steps taken to help users protect themselves.
Development & deployment countries
List of countries this project was developed in.
- United States of America
- United Kingdom
List of countries this project is actively deployed in.
- United States of America