VieNeu-TTS

📢 Upcoming Release

VieNeu-TTS-1000h is currently in training, using ~1000 hours of high-quality Vietnamese speech combined with English speech data.
This next version will support bilingual voice synthesis (Vietnamese + English) with consistent speaker identity.

Expected improvements:

More accurate and stable Vietnamese pronunciation

Improved English pronunciation and code-switching

Higher voice cloning fidelity and speaker consistency

A GGUF version is also planned for the earliest possible release.

Current release: VieNeu-TTS-140h (stable & production-ready)

Overview

VieNeu-TTS is an on-device Vietnamese Text-to-Speech (TTS) model with instant voice cloning.
It is fine-tuned from NeuTTS Air and synthesizes natural 24 kHz speech in real time on CPU or GPU.

Support This Project

Training high-quality TTS models requires significant GPU resources and compute time. If you find this model useful, please consider supporting the development:

Your support helps maintain and improve VieNeu-TTS! 🙏

Voice Cloning Inference

Reference Voice (Speaker Example):

Input Text:

Trên bầu trời xanh thẳm, những đám mây trắng lửng lờ trôi như những chiếc thuyền nhỏ đang lướt nhẹ theo dòng gió. Dưới mặt đất, cánh đồng lúa vàng rực trải dài tới tận chân trời, những bông lúa nghiêng mình theo từng làn gió.

Generated Output (Cloned Voice):

Long Text Inference

VieNeu-TTS supports long-form text synthesis (multiple sentences, paragraphs, or entire articles).
For efficient sentence splitting, text normalization, and streaming playback, please refer to the example script in the repository:

🔗 https://github.com/pnnbao97/VieNeu-TTS
Example file: examples/infer_long_text.py

Long-form speech output example:

Model Architecture

Component	Description
Backbone	Qwen 0.5B (chat-format LM)
Codec	NeuCodec (supports ONNX + quantization)
Output	24 kHz waveform synthesis
Context Window	2048 tokens shared text + speech
Watermark	Enabled
Training Data	VieNeuCodec-dataset + Emilia dataset pretraining

Features

High-quality Vietnamese speech
Instant voice cloning (3–5 second reference audio)
Fully offline
Runs real-time or faster
Multi-voice reference support
Python API + CLI + Gradio

Quick Usage (Python)

from pathlib import Path
from vieneu_tts import VieNeuTTS
from utils.normalize_text import VietnameseTTSNormalizer
import soundfile as sf

ref_audio = "sample/id_0001.wav"
ref_text = Path("sample/id_0001.txt").read_text(encoding="utf-8")

normalizer = VietnameseTTSNormalizer()
ref_text_norm = normalizer.normalize(ref_text)

tts = VieNeuTTS(
    backbone_repo="pnnbao-ump/VieNeu-TTS",
    backbone_device="cuda",
    codec_repo="neuphonic/neucodec",
    codec_device="cuda"
)

ref_codes = tts.encode_reference(ref_audio)

text = "Công nghệ giọng nói đang phát triển rất nhanh."
text_norm = normalizer.normalize(text)

wav = tts.infer(text_norm, ref_codes, ref_text_norm)
sf.write("output.wav", wav, 24000)

Gradio Demo

python gradio_app.py

Open your browser at http://127.0.0.1:7860.

Reference Voices

File	Gender	Accent
id_0001	Male	South
id_0002	Female	South
id_0003	Male	South
id_0004	Female	South
id_0005	Male	South
id_0007	Male	South

Odd numbers = Male
Even numbers = Female

Best Practices

Keep input ≤ 250 characters per call
Normalize both text and reference transcript
Use clean reference audio (~3–5s)
For long text, use chunked inference

Troubleshooting

Issue	Cause	Solution
Missing `libespeak`	System dependency	Install eSpeak NG
GPU OOM	VRAM too small	Use CPU or quantized model
Poor voice match	Bad reference sample	Try a clearer reference clip

License

Apache 2.0

Citation

@misc{vieneutts2025,
  title        = {VieNeu-TTS: Vietnamese Text-to-Speech with Instant Voice Cloning},
  author       = {Pham Nguyen Ngoc Bao},
  year         = {2025},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/pnnbao-ump/VieNeu-TTS}}
}

Please also cite the base model:

@misc{neuttsair2025,
  title        = {NeuTTS Air: On-Device Speech Language Model with Instant Voice Cloning},
  author       = {Neuphonic},
  year         = {2025},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/neuphonic/neutts-air}}
}

Downloads last month: 20

Safetensors

Model size

0.6B params

Tensor type

BF16

Model tree for thuongvv/VieNeu-TTS

Base model

neuphonic/neutts-air

Finetuned

(9)

this model

thuongvv
/

VieNeu-TTS