Datasets are not neutral
Every hour of audio in our corpus has a name behind it, a consent form on file, and a payment trail. That sounds obvious; in practice it is rare in African AI work.
How the corpus is structured
- 3,400 hours across Standard Swahili, Coastal, Unguja, and Pemba dialects
- Gender-balanced speakers across five age brackets
- Open-licensed subset (400 hours) for research; full corpus available under commercial terms
What we learned about consent
Forms in English, even with Swahili summaries, did not work. We rebuilt the consent flow as a guided audio conversation in Swahili, with the speaker recording their own consent statement. Withdrawal rates dropped, and trust scores went up.
Written by
Swahili Developers
Field notes from the team building Swahili-first AI across East Africa.
Work with us



