The hidden cost of poor tokenization
When we benchmarked five major LLMs on Swahili reasoning tasks, the same sentence consumed 2.4x more tokens than its English equivalent. That gap is not just an inference-cost problem - it directly degrades model quality on African languages.
What we found
- Sub-word tokenizers split common Swahili morphemes into fragments
- Noun-class prefixes (
m-,wa-,ki-,vi-) get separated from their stems - Long agglutinative verbs become up to 9 tokens for what should be 1-2
What we are doing about it
We are training a Swahili-first BPE tokenizer on 4.2B tokens of locally collected text - news, parliamentary records, Wikipedia, and community-contributed conversational data. Early results show a 38% reduction in token count and measurable improvements on downstream QA.
The full technical report is forthcoming on our Publications page.
Written by
Swahili Developers
Field notes from the team building Swahili-first AI across East Africa.
Work with us



