Kikuyu TranslateGemma 4B: Faster, Smaller, Better

June 07, 2026

TranslateGemma 4B in Production

Founder & Lead Researcher

Mark Gatere

After building our first English-Kikuyu TranslateGemma model on 12B parameters, the next challenge was product speed. Users liked the idea, but cold starts were too slow. We fine-tuned TranslateGemma-4B through a series of LoRA, DoRA, and rsLoRA experiments, and the final model became both faster and more accurate.

overview

Our first Kikuyu TranslateGemma model proved that English -> Kikuyu translation could work with a carefully fine-tuned open model. The 12B version reached 19.61 BLEU and became the first production C-elo Translate model. But production revealed a different problem: a large model can be technically good and still feel slow to users.

The 12B model was expensive to cold-start on Modal because the weights were large and initial GPU loading took time. We wanted a model that preserved quality, loaded faster, and could become the default web translation experience on c-elo.com/c-elo-ai.

We started with Google's TranslateGemma-4B-it, kept the same 30,430 English-Kikuyu sentence pairs, and ran a focused series of experiments: baseline LoRA, high-rank LoRA, DoRA, longer training, and finally high-rank rsLoRA.

The winning model is c-elo/kikuyu_translategemma_4b_v7_highrank_rslora. It reached 21.93 BLEU and 42.87 chrF++, beating the earlier 12B model while making the production experience much faster after warmup.

Interviewees:

Mark

Gatere

Founder & Lead Researcher

AI researcher and native Kikuyu speaker building production AI systems for low-resource African languages. Led the TranslateGemma 4B experimentation, evaluation, and deployment work for C-elo Translate.

——Why revisit translation after the 12B model was already working?

Gatere: The 12B model was a successful research result, but the product feedback was clear: people were waiting too long for the first translation. A model that takes too long to load feels broken, even if the final output is good. We wanted C-elo Translate to feel natural on the web, so the next goal was not only accuracy. It was accuracy plus speed.

——Why did you stay with TranslateGemma instead of switching to another model family?

Gatere: TranslateGemma was already built for translation and the 12B run had proved it could adapt to Kikuyu. Staying inside the same family let us reuse the same dataset, template format, evaluation split, and deployment assumptions. That made the 4B experiments more controlled. We were testing whether a smaller translation-specialized model could beat a larger one with the right fine-tuning recipe.

——What did the experiment path look like?

Gatere: We started with a faithful 4B baseline using LoRA rank 128, then moved to rank 256 for more adapter capacity. High-rank LoRA improved the model a lot, but it still did not beat 12B. DoRA sounded promising, but it underperformed in our setup. Training longer with a slightly lower learning rate also did not move the model forward. The breakthrough was V7: rank 256 with rsLoRA, alpha 256, three epochs, and a lower learning rate of 1e-4.

——What made V7 different?

Gatere: V7 used rank-stabilized LoRA. With high LoRA ranks, adapter scaling matters. Standard LoRA at rank 256 gave us useful capacity, but rsLoRA stabilized that high-rank setup better. We also avoided DoRA in V7 because DoRA had not helped in the earlier run. The final result was 21.93 BLEU and 42.87 chrF++, and the translations sounded better to me when comparing outputs manually.

——What were the most important engineering fixes?

Gatere: The biggest fixes were around correctness and recovery. TranslateGemma's user message needs structured language codes, but the assistant response must be a plain string. We also had to separate the Gemma3 processor from the underlying text tokenizer for generation. Finally, PEFT had to be pinned to avoid an Unsloth compatibility issue.

——How does the new production deployment work?

Gatere: The V7 model runs on Modal behind a dedicated translation endpoint. The model weights are cached in a Modal volume, and the container uses a scaledown window so it stays warm for a while after traffic without forcing an always-on GPU.

——What changed for users?

Gatere: C-elo Translate now uses the 4B model as the default. Warm requests are much faster, and the model is more comfortable to serve than the old 12B version. The important thing is that this was not a quality tradeoff. The smaller model ended up scoring better and sounding better in manual checks.

——What is the main lesson from this round?

Gatere: For low-resource language AI, the best model is not always the largest model. The winning recipe was careful formatting, the right adapter capacity, stable scaling, reliable evaluation, and listening to real outputs. The 4B model became better because we treated it as a full system problem: training, evaluation, deployment, and user experience together.

researchtranslationgemmakikuyuproduction

Back to all blogs