Publications
Research that grounds the work.
Peer-reviewed papers, technical reports, and position papers from the Swahili Developers research community and our academic collaborators.
SwahiliBERT: A morphologically-aware language model for East African Swahili
Japhari Mbaru, A. Hassan, J. Kimaro, et al.
AfricaNLP Workshop @ ICLR
We present SwahiliBERT, a 340M-parameter encoder pretrained on 12B tokens of Swahili text with a morphologically-aware tokenizer. SwahiliBERT outperforms multilingual baselines by 14.2 F1 on Swahili NER and 9.8 EM on KiSwQA.
Clinical voice transcription in low-resource multilingual settings
F. Mushi, R. Otieno, P. Salim, S. Ngonyani
ML4H - Machine Learning for Health
A field study of automatic speech recognition deployed across three Tanzanian referral hospitals. We document a 29-point WER reduction on code-switched clinical audio using domain adaptation and on-device decoding.
2026
NLP
SwahiliBERT: A morphologically-aware language model for East African Swahili
Japhari Mbaru, A. Hassan, J. Kimaro, et al.
AfricaNLP Workshop @ ICLR
We present SwahiliBERT, a 340M-parameter encoder pretrained on 12B tokens of Swahili text with a morphologically-aware tokenizer. SwahiliBERT outperforms multilingual baselines by 14.2 F1 on Swahili NER and 9.8 EM on KiSwQA.
Read2025
Computer Vision
Document understanding for Kiswahili administrative records
J. Kimaro, P. Salim, T. Mahmoud
ICDAR Workshop on Document Intelligence
We introduce a layout-aware OCR pipeline for handwritten and typewritten Kiswahili administrative documents, covering land titles, court records, and clinic ledgers.
Read2025
Datasets
An ethically sourced Swahili audio corpus: methodology and benchmarks
A. Hassan, M. Mwangi, L. Komba
LREC - Language Resources and Evaluation Conference
We release a 3,400-hour Swahili speech corpus with documented consent, demographic balance, and dialect coverage. We provide ASR and TTS benchmarks across four dialects.
Read2025
Health AI
Clinical voice transcription in low-resource multilingual settings
F. Mushi, R. Otieno, P. Salim, S. Ngonyani
ML4H - Machine Learning for Health
A field study of automatic speech recognition deployed across three Tanzanian referral hospitals. We document a 29-point WER reduction on code-switched clinical audio using domain adaptation and on-device decoding.
Read2024
Policy
Towards sovereign AI: a policy framework for African data and compute
Swahili Developers Policy Group
Position paper - Swahili Developers
A framework for evaluating AI sovereignty across compute, data, and talent dimensions, with concrete recommendations for East African Community member states.
Read2024
Health AI
Federated learning across African research hospitals: a deployment report
R. Otieno, F. Mushi, M. Mwangi
NeurIPS Workshop on Federated Learning
Lessons from a 14-month federated learning deployment for medical imaging across four hospitals in Tanzania, Kenya, and Uganda. We discuss bandwidth constraints, governance, and model drift.
Read
Collaborate
Co-author with us, or cite our work.
We partner with universities, hospitals, and policy institutes across East Africa and beyond. Reach out if you would like to collaborate, replicate our results, or request a dataset under our research license.
