Swahili Developers logo
All articles
Data ecosystemMarch 04, 20266 min read

Building an ethically sourced Swahili audio corpus

Consent, compensation, and dialect coverage - the human side of dataset collection.

Swahili Developers

Published March 04, 2026

Datasets are not neutral

Every hour of audio in our corpus has a name behind it, a consent form on file, and a payment trail. That sounds obvious; in practice it is rare in African AI work.

How the corpus is structured

  • 3,400 hours across Standard Swahili, Coastal, Unguja, and Pemba dialects
  • Gender-balanced speakers across five age brackets
  • Open-licensed subset (400 hours) for research; full corpus available under commercial terms

What we learned about consent

Forms in English, even with Swahili summaries, did not work. We rebuilt the consent flow as a guided audio conversation in Swahili, with the speaker recording their own consent statement. Withdrawal rates dropped, and trust scores went up.

Written by

Swahili Developers

Field notes from the team building Swahili-first AI across East Africa.

Work with us