Swahili Developers logo
All articles
Swahili NLPApril 12, 20268 min read

Why Swahili tokenization matters for African LLMs

Most LLMs treat Swahili as an afterthought. We dug into the tokenizer to understand what that costs in practice.

Swahili Developers

Published April 12, 2026

The hidden cost of poor tokenization

When we benchmarked five major LLMs on Swahili reasoning tasks, the same sentence consumed 2.4x more tokens than its English equivalent. That gap is not just an inference-cost problem - it directly degrades model quality on African languages.

What we found

  • Sub-word tokenizers split common Swahili morphemes into fragments
  • Noun-class prefixes (m-, wa-, ki-, vi-) get separated from their stems
  • Long agglutinative verbs become up to 9 tokens for what should be 1-2

What we are doing about it

We are training a Swahili-first BPE tokenizer on 4.2B tokens of locally collected text - news, parliamentary records, Wikipedia, and community-contributed conversational data. Early results show a 38% reduction in token count and measurable improvements on downstream QA.

The full technical report is forthcoming on our Publications page.

Written by

Swahili Developers

Field notes from the team building Swahili-first AI across East Africa.

Work with us