Building Kikuyu TranslateGemma: From 2.29 to 19.61 BLEU
January 29, 2026

TranslateGemma Fine-Tuning
AI researcher
Mark Gatere
How we achieved a 758% improvement in English-Kikuyu translation by fine-tuning Google's TranslateGemma-12B—from 2.29 BLEU (zero-shot) to 19.61 BLEU through systematic LoRA tuning and production deployment on Modal.
overview
Google's TranslateGemma-12B is a state-of-the-art translation model supporting 137 languages—but not Kikuyu. Zero-shot, it scores just 2.29 BLEU, producing incoherent repetitive text. We set out to change that by fine-tuning on our 30,430 English-Kikuyu sentence pairs.
Our journey involved three major training iterations: V1 established a fine-tuned baseline of 18.16 BLEU, V3 over-regularized and dropped to 15.93, and our final version achieved 19.61 BLEU—a 758% improvement over zero-shot.
The model is now live at: c-elo-ai, deployed on Modal with serverless GPU inference. This post details our technical decisions, lessons learned, and the path to production.
Open Source model at Hugging Face.
Interviewees:
Mark
Gatere
AI researcher
AI researcher focused on low-resource language NLP. Led the TranslateGemma fine-tuning effort for Kikuyu, iterating through multiple training versions to optimize translation quality.
——What was the zero-shot performance before fine-tuning?
Gatere: 2.29 BLEU! Kikuyu isn't in TranslateGemma's 55 supported languages, so the base model produces incoherent repetitive text. Repetitive patterns like 'mũno mũno mũno...' that mean nothing. It couldn't translate a single sentence correctly. That's what makes our 19.61 BLEU result significant—a 758% relative improvement.
——Why did you choose TranslateGemma over other translation models?
Gatere: TranslateGemma is specifically designed for translation with a 12B parameter model that has strong multilingual foundations. Unlike general LLMs that need extensive prompt engineering, TranslateGemma uses a structured chat format with explicit source and target language codes. This made it ideal for our use case.
——What was your training setup?
Gatere: We used Unsloth for 2x faster fine-tuning on NVIDIA H200 and L40S GPUs. Our dataset of 30,430 pairs was split 95/5 for training and evaluation. We used LoRA with rank 128, targeting attention projections and the MLP layers. Training ran for about 900 steps before early stopping kicked in.
——What was the biggest challenge during training?
Gatere: Over-regularization. In V3, we added lora_dropout=0.1, weight_decay=0.02, and neftune_noise_alpha=7 to prevent overfitting. But the BLEU score dropped from 18.16 to 15.93. The translations were grammatically fine but lost semantic precision. We had to dial back the regularization significantly.
——How did you achieve the 19.61 BLEU score?
Gatere: We removed lora_dropout entirely, reduced weight_decay to 0.01, lowered neftune to 5, and increased the learning rate to 2e-4. Crucially, we also removed embed_tokens from the LoRA targets—training the embedding layer was disrupting the model's vocabulary. The result: 19.61 BLEU, a 1.45-point improvement over baseline.
——How is the model deployed?
Gatere: We merged the LoRA adapter into the base model and pushed it to Hugging Face. The merged model runs on Modal's serverless GPUs with automatic scaling. Cold starts take about 2 minutes to load the 25GB model, but subsequent requests complete in 5-15 seconds. Our Next.js frontend proxies requests through an API route.
——What's next for Kikuyu translation at C-elo?
Gatere: Human evaluation is the priority—BLEU scores don't capture everything. We're also working on caching the model in Modal volumes to reduce cold starts, and exploring speech-to-speech chatbots using different African languages. For this one, the goal is real-time voice translation for African speakers.


