Adapting the Mimi Neural Codec for Kikuyu Speech

February 14, 2026

Mimi Codec Adaptation

Founder & Lead Researcher

Mark Gatere

How we fine-tuned Kyutai's Mimi neural audio codec on 750+ hours of Kikuyu speech to produce ultra-compact discrete tokens (13 tokens/sec) while preserving tonal fidelity — the foundation for real-time Speech-to-Speech AI.

overview

Speech-to-Speech AI needs a way to compress audio into compact tokens that a language model can learn from — like how text uses word tokens. We chose Kyutai's Mimi neural codec because it produces just 13 tokens per second (6× fewer than Meta's EnCodec), making downstream speech language modeling feasible.

But Mimi was trained on English and European languages. Kikuyu is a tonal Bantu language where pitch carries lexical meaning — 'ĩria' vs 'ĩria' differ only in tone but mean different things. Standard codec training would flatten these distinctions.

We fine-tuned all 79.3M parameters of Mimi on our combined ANV + WAXAL dataset (750+ hours) using a custom loss function that explicitly tracks and preserves pitch contours. After 57,635 training steps on an NVIDIA A100, the adapted codec achieves 0.0710 validation loss with pitch error of just 1.1 Hz.

The model and training script are available: stage1_mimi_modal.py on GitHub.

Interviewees:

Mark

Gatere

Founder & Lead Researcher

AI researcher leading C-elo's efforts to build voice AI for low-resource African languages. Designed and executed the Mimi codec adaptation pipeline for Kikuyu.

——Why Mimi over EnCodec?

Gatere: Token efficiency. EnCodec produces ~600 tokens per second across 8 codebooks at 75 Hz. Mimi produces ~100 tokens per second at 12.5 Hz — 6× fewer. When you're training a speech language model that needs to predict next tokens, that difference is between feasible and impractical.

——What's the custom pitch loss you designed?

Gatere: We extract F0 pitch contours using the pyin algorithm on both original and reconstructed audio, tracking between 80–600 Hz. We compute the mean absolute error on voiced segments only. This runs every 10th batch with a weight of 0.3, alongside reconstruction and multi-scale spectral losses. It's the key innovation — without it, the codec could score well spectrally but flatten tonal distinctions.

——What were the training results?

Gatere: After 5 epochs and 57,635 steps on A100, validation loss reached 0.0710. Pitch MAE dropped to 1.1 Hz — well below the 20 Hz threshold we set. On custom test recordings, we achieved correlation up to 0.964 and SNR of 11.4 dB. Subjectively, the reconstructed audio sounds like an exact match of the original.

——How does this fit into the bigger picture?

Gatere: This is Stage 1 of our 4-stage pipeline to build a full Speech-to-Speech Kikuyu AI. Stage 2 encodes conversations into discrete token sequences. Stage 3 trains a transformer to predict next audio tokens — like GPT but for speech. Stage 4 is the full-duplex real-time agent. The codec is the foundation everything builds on.

researchvoice-aimimikikuyu

Back to all blogs