Building an ethically sourced Swahili audio corpus

Datasets are not neutral

Every hour of audio in our corpus has a name behind it, a consent form on file, and a payment trail. That sounds obvious; in practice it is rare in African AI work.

How the corpus is structured

3,400 hours across Standard Swahili, Coastal, Unguja, and Pemba dialects
Gender-balanced speakers across five age brackets
Open-licensed subset (400 hours) for research; full corpus available under commercial terms

What we learned about consent

Forms in English, even with Swahili summaries, did not work. We rebuilt the consent flow as a guided audio conversation in Swahili, with the speaker recording their own consent statement. Withdrawal rates dropped, and trust scores went up.

Written by

Swahili Developers

Field notes from the team building Swahili-first AI across East Africa.

Work with us

Building an ethically sourced Swahili audio corpus

Datasets are not neutral

How the corpus is structured

What we learned about consent

More from the field

Why Swahili tokenization matters for African LLMs

Lessons from deploying voice AI in Tanzanian hospitals

What 'sovereign AI' really means for East Africa