Post
116
Voice cloning models measured across five languages: OmniVoice, Chatterbox, VoxCPM2, Fish Audio
I published a new Soniqo benchmark post for local voice cloning models across five languages:
https://www.soniqo.audio/blog/voice-cloning-benchmarks
Models:
- OmniVoice int8
- Chatterbox Multilingual fp16
- VoxCPM2 bf16
- Fish Audio S2 Pro fp16
Languages:
- English
- German
- Modern Standard Arabic
- Spanish
- Mandarin Chinese
The benchmark uses Google FLEURS test clips as dataset references. Each row includes the reference audio, generated audio, speaker similarity, WER/CER, generated audio length, and RTF.
Main result in this run: OmniVoice was the strongest all-around row set, with 0.707 mean speaker cosine across all five languages, 0.0% ASR error, and mean RTF 0.45. VoxCPM2 bf16 was especially strong on Arabic speaker match. Fish Audio S2 Pro showed strong German/Arabic similarity but slower RTF. Chatterbox Multilingual was competitive on Arabic and Spanish.
This is an engineering benchmark, not a human MOS study. The speaker-similarity values should be compared within this table because every row uses the same local speaker-embedding pipeline.
Try the stack locally with Speech Studio:
https://www.soniqo.audio/speech-studio
https://github.com/soniqo/speech-studio
Underlying Swift library/CLI:
https://github.com/soniqo/speech-swift
Soniqo models and exports:
soniqo @aufklarer
What model or language should I add next?
I published a new Soniqo benchmark post for local voice cloning models across five languages:
https://www.soniqo.audio/blog/voice-cloning-benchmarks
Models:
- OmniVoice int8
- Chatterbox Multilingual fp16
- VoxCPM2 bf16
- Fish Audio S2 Pro fp16
Languages:
- English
- German
- Modern Standard Arabic
- Spanish
- Mandarin Chinese
The benchmark uses Google FLEURS test clips as dataset references. Each row includes the reference audio, generated audio, speaker similarity, WER/CER, generated audio length, and RTF.
Main result in this run: OmniVoice was the strongest all-around row set, with 0.707 mean speaker cosine across all five languages, 0.0% ASR error, and mean RTF 0.45. VoxCPM2 bf16 was especially strong on Arabic speaker match. Fish Audio S2 Pro showed strong German/Arabic similarity but slower RTF. Chatterbox Multilingual was competitive on Arabic and Spanish.
This is an engineering benchmark, not a human MOS study. The speaker-similarity values should be compared within this table because every row uses the same local speaker-embedding pipeline.
Try the stack locally with Speech Studio:
https://www.soniqo.audio/speech-studio
https://github.com/soniqo/speech-studio
Underlying Swift library/CLI:
https://github.com/soniqo/speech-swift
Soniqo models and exports:
What model or language should I add next?