VoxCPM - Tokenizer-Free TTS & Voice Cloning, Open Source

Overview

VoxCPM is a tokenizer-free text-to-speech model that models audio in a continuous space, avoiding the artifacts that can come with discrete unit tokenizers. It’s tuned for context-aware prosody and zero-shot voice cloning from short reference audio.

Under the hood, it combines diffusion–autoregressive generation, hierarchical language modeling, FSQ, and local Diffusion Transformers (LocDiT) on top of a MiniCPM-4 backbone. In practice, that means natural pacing and tone without heavy prompt engineering.

Key Features & Benefits

Context-aware prosody → reads the room (narration vs. news vs. dialogue) for more human-like intonation.
Zero-shot voice cloning → copy timbre and accent from a short sample for quick character/brand voices.
Fast streaming → reported RTF ≈ 0.17 on RTX 4090 (≈6× faster than real-time). Great for live apps.
Open source, Apache-2.0 → favorable licensing with clear misuse warnings in the model card.
Bilingual out-of-box → English and Chinese (other languages not guaranteed).

Use Cases & Audiences

Developers / Product teams

Real-time voice for assistants and IVR. Preset: --inference-timesteps 8 --cfg-value 1.6 for speed.

Localization teams

Cross-lingual voice cloning (EN↔ZH) for dubbing and trailers. Provide a 5–10s clean reference.

Creators & podcasters

Character voices without training. Use phoneme input for tricky names/brands.

Researchers

Study tokenizer-free TTS or LocDiT effects; inspect provided benchmarks and architecture notes.

Handy prompt templates

Neutral narration: “Explain [topic] in an even, warm tone with medium pace.”
Newsread: “Read like a news bulletin. Keep sentences crisp; emphasize numbers and dates.”
Dialog line: “Casual, slightly amused tone. Short pauses between clauses.”

Example Outputs / Applications

Product explainer voiceover – 90s script; expect latency ≈ 90 s × 0.17 ≈ 15 s on an RTX 4090 (longer on smaller GPUs).
Call center TTS – stream chunks as they’re generated for sub-second perceived delay.
Podcast cold open (cloned host) – 8s clean reference + 30s read; expected time ≈ 5s on a 4090.
Bilingual trailer – EN prompt with ZH continuation; use phoneme input for proper nouns.

Privacy & Safety

Local runs keep data on your box. Hosted demos (e.g., HF Space) send data to that host—treat as third-party.
Misuse warning: the authors flag cloning risks; label AI audio and obtain consent.
Known limitations: Long, highly expressive scripts can wobble; non-EN/ZH may degrade.

Comparisons & Alternatives (Quick)

VoxCPM vs. CosyVoice 2: VoxCPM is tokenizer-free with LocDiT; CosyVoice 2 is a strong open baseline—try both for accent fidelity.
VoxCPM vs. GPT-SoVITS: VoxCPM offers end-to-end TTS + cloning; GPT-SoVITS is popular for cloning quality; pick by dataset fit and latency.
VoxCPM vs. F5-TTS / IndexTTS2: VoxCPM emphasizes context-aware prosody and streaming; others differ in training & tokenization choices.

FAQ

Is there a free tier?
Yes. It’s completely free.you can use it for online demo.

What formats does it output?
16 kHz mono WAV by default (you can re-encode to MP3/OGG post-hoc).

How fast is it?
On an RTX 4090, RTF ≈ 0.17 (~6× faster than real-time). Time ≈ duration × 0.17.

Does it support languages other than English/Chinese?
Not guaranteed; quality may be unpredictable.

What’s the best way to get correct pronunciations?
Use phoneme input and keep text normalization off; for standard text, keep normalization on for numbers/dates.

Any tips for stability on long reads?
Split scripts into paragraphs; lower cfg_value slightly; increase timesteps modestly.