Configuration Parsing Warning: Invalid JSON for config file tokenizer_config.json

educa AI voice (preview)

educa AI voice is our in-house text to speech model developed on top of EuroLLM-1.7B.

This version of the model is trained on a single speaker and is capable of generating natural-sounding German (and to some extent also English) speech.

Be advised that this is a preview model meant to showcase the base model's capability. We are going to publish more advanced models in the near future (see bottom of this model card).

Examples:

Model details

Base LLM: EuroLLM-1.7B
Audio Tokenizer: NeuCodec

Pre-training

We pre-trained the model in two stages, first training on billions of tokens of mixed speech and text data using a next-token-prediction objective. Then, we trained on tens of thousands of hours of German and English TTS data mixed with a little text instruction data to preserve the text understanding capability of the model.

We used the following datasets, as well as some in-house datasets:

HuggingFaceFW/fineweb
HuggingFaceFW/fineweb-2
amphion/Emilia-Dataset (German and English YODAS subsets)
facebook/voxpopuli
uhhlt/Tuda-De
openslr/librispeech_asr
facebook/multilingual_librispeech
Thorsten-Voice/TV-44kHz-Full
CSTR-Edinburgh/vctk
commonvoice_23
kerstin

Inference example

import torch
import torchaudio
from transformers import AutoModelForCausalLM, AutoTokenizer
from neucodec import NeuCodec

device = "cuda"
model_id = "DigitalLearningGmbH/educa-ai-voice-preview"
audio_end_token_id = 128001
audio_tokens_offset = 128006

model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16)
model = model.to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)

codec_model = NeuCodec.from_pretrained("neuphonic/neucodec")
codec_model = codec_model.eval().to(device)

prompt_template = "<|task_tts|>{prompt} <|audio_start|>"
prompt = "Brautkleid bleibt Brautkleid und Blaukraut bleibt Blaukraut."

input_ids = tokenizer.encode(prompt_template.format(prompt=prompt), return_tensors="pt").to(device)

outputs = model.generate(input_ids=input_ids, do_sample=True, temperature=0.6, top_p=0.999, repetition_penalty=1.1, max_new_tokens=2048)
outputs_audio = outputs[0][input_ids.shape[1]:(outputs[0] == audio_end_token_id).nonzero(as_tuple=True)[0][0].item()] - audio_tokens_offset

with torch.no_grad():
    recon = codec_model.decode_code(outputs_audio.unsqueeze(0).unsqueeze(0).to(device)).cpu()

torchaudio.save("tts.wav", recon[0, :, :], 24_000)

For even higher fidelity in German speech, use our finetuned NeuCodec decoder.

What's to come

As stated in the model's name, this is a preview model, mainly meant to showcase the capability of the base model. We trained on a small dataset of a single speaker without any special emotion tagging etc.

We are actively working on

multiple speakers with emotional control and nonverbal elements (fillers, laughing, ...)
fine-tuning for general zero-shot voice cloning
phoneme-based / hybrid generation
post-training with reinforcement learning

Stay tuned - january 2026 is going to be exciting!

Downloads last month: 79

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for DigitalLearningGmbH/educa-ai-voice-preview

Base model

utter-project/EuroLLM-1.7B

Finetuned

(8)

this model

DigitalLearningGmbH
/

educa-ai-voice-preview