Why Swahili tokenization matters for African LLMs

The hidden cost of poor tokenization

When we benchmarked five major LLMs on Swahili reasoning tasks, the same sentence consumed 2.4x more tokens than its English equivalent. That gap is not just an inference-cost problem - it directly degrades model quality on African languages.

What we found

Sub-word tokenizers split common Swahili morphemes into fragments
Noun-class prefixes (m-, wa-, ki-, vi-) get separated from their stems
Long agglutinative verbs become up to 9 tokens for what should be 1-2

What we are doing about it

We are training a Swahili-first BPE tokenizer on 4.2B tokens of locally collected text - news, parliamentary records, Wikipedia, and community-contributed conversational data. Early results show a 38% reduction in token count and measurable improvements on downstream QA.

The full technical report is forthcoming on our Publications page.

Written by

Swahili Developers

Field notes from the team building Swahili-first AI across East Africa.

Work with us

Why Swahili tokenization matters for African LLMs

The hidden cost of poor tokenization

What we found

What we are doing about it

More from the field

Lessons from deploying voice AI in Tanzanian hospitals

Building an ethically sourced Swahili audio corpus

What 'sovereign AI' really means for East Africa