Mozilla releases the largest to-date public domain transcribed voice dataset on February 28, 2019. Mozilla crowdsources the largest dataset of human voices available for use, including 18 different languages, adding up to almost 1,400 hours of recorded voice data from more than 42,000 contributors. The files are in MP3 format with corresponding text data file.
- DeepSpeech - The Common Voice dataset complements Mozilla’s open source voice recognition engine Deep Speech, which you can use to build speech recognition applications. Read our Github overview or join the DeepSpeech Discourse to learn how to get started.
- Discourse - Have questions about Common Voice? Join us on our Discourse forum.
- LibriSpeech - LibriSpeech is a corpus of approximately 1000 hours of 16Khz read English speech derived from read audiobooks from the LibriVox project.
- TED-LIUM Corpus - The TED-LIUM corpus was made from audio talks and their transcriptions available on the TED website.
- VoxForge - VoxForge was set up to collect transcribed speech for use with Free and Open Source Speech Recognition Engines.
- Tatoeba - Tatoeba is a large database of sentences, translations, and spoken audio for use in language learning. This download contains spoken English recorded by their community.