VoxCPM - Tokenizer-Free TTS & Voice Cloning, Open Source
Overview
VoxCPM is a tokenizer-free text-to-speech model that models audio in a continuous space, avoiding the artifacts that can come with discrete unit tokenizers. It’s tuned for context-aware prosody and zero-shot voice cloning from short reference audio.
Under the hood, it combines diffusion–autoregressive generation, hierarchical language modeling, FSQ, and local Diffusion Transformers (LocDiT) on top of a MiniCPM-4 backbone. In practice, that means natural pacing and tone without heavy prompt engineering.
Key Features & Benefits
- Context-aware prosody → reads the room (narration vs. news vs. dialogue) for more human-like intonation.
- Zero-shot voice cloning → copy timbre and accent from a short sample for quick character/brand voices.
- Fast streaming → reported RTF ≈ 0.17 on RTX 4090 (≈6× faster than real-time). Great for live apps.
- Open source, Apache-2.0 → favorable licensing with clear misuse warnings in the model card.
- Bilingual out-of-box → English and Chinese (other languages not guaranteed).
Use Cases & Audiences
Developers / Product teams
- Real-time voice for assistants and IVR. Preset:
--inference-timesteps 8 --cfg-value 1.6
for speed.
Localization teams
- Cross-lingual voice cloning (EN↔ZH) for dubbing and trailers. Provide a 5–10s clean reference.
Creators & podcasters
- Character voices without training. Use phoneme input for tricky names/brands.
Researchers
- Study tokenizer-free TTS or LocDiT effects; inspect provided benchmarks and architecture notes.
Handy prompt templates
- Neutral narration: “Explain [topic] in an even, warm tone with medium pace.”
- Newsread: “Read like a news bulletin. Keep sentences crisp; emphasize numbers and dates.”
- Dialog line: “Casual, slightly amused tone. Short pauses between clauses.”
Example Outputs / Applications
- Product explainer voiceover – 90s script; expect latency ≈
90 s × 0.17 ≈ 15 s
on an RTX 4090 (longer on smaller GPUs). - Call center TTS – stream chunks as they’re generated for sub-second perceived delay.
- Podcast cold open (cloned host) – 8s clean reference + 30s read; expected time ≈ 5s on a 4090.
- Bilingual trailer – EN prompt with ZH continuation; use phoneme input for proper nouns.
Privacy & Safety
- Local runs keep data on your box. Hosted demos (e.g., HF Space) send data to that host—treat as third-party.
- Misuse warning: the authors flag cloning risks; label AI audio and obtain consent.
- Known limitations: Long, highly expressive scripts can wobble; non-EN/ZH may degrade.
Comparisons & Alternatives (Quick)
- VoxCPM vs. CosyVoice 2: VoxCPM is tokenizer-free with LocDiT; CosyVoice 2 is a strong open baseline—try both for accent fidelity.
- VoxCPM vs. GPT-SoVITS: VoxCPM offers end-to-end TTS + cloning; GPT-SoVITS is popular for cloning quality; pick by dataset fit and latency.
- VoxCPM vs. F5-TTS / IndexTTS2: VoxCPM emphasizes context-aware prosody and streaming; others differ in training & tokenization choices.
FAQ
Is there a free tier?
Yes. It’s completely free.you can use it for online demo.
What formats does it output?
16 kHz mono WAV by default (you can re-encode to MP3/OGG post-hoc).
How fast is it?
On an RTX 4090, RTF ≈ 0.17 (~6× faster than real-time). Time ≈ duration × 0.17
.
Does it support languages other than English/Chinese?
Not guaranteed; quality may be unpredictable.
What’s the best way to get correct pronunciations?
Use phoneme input and keep text normalization off; for standard text, keep normalization on for numbers/dates.
Any tips for stability on long reads?
Split scripts into paragraphs; lower cfg_value
slightly; increase timesteps modestly.