If I understood correctly, the two figures in https://huggingface.co/blog/nvidia/nemotron-speech-asr-scaling-voice-agents#results-throughput-accuracy-and-speed-at-scale compare a 1.1B model with a 600M one. This is misleading, as demonstrating the added value of the caching mechanism should be done on models of the same size, otherwise half of the "3x" gain could be attributed to the sole parameter count difference.
Tom Schelsen
AI & ML interests
Recent Activity
Organizations
Actual context length ?
Tokenization in Transformers v5: Simpler, Clearer, and More Modular
- +4
Context size for images
support fill in the middle?
Local Installation Video and Testing - Step by Step
Double click -> right click
First of all thanks for the leaderboard, very useful ressource.
It would be a very nice addition to have an "efficiency" column that directly shows the AverageWER / RTFx ratio, or a 2D plot of it, showing the "Pareto frontier", as is often done for LLMs nowadays.
I don't know if the best place is here or on Github but here we go :
what you describe you want to do seems to directly relate to https://github.com/xinjli/transphone and https://github.com/dmort27/epitran , are you aware of those ? I am not against a more up-to-date library since those don't seem maintained, but maybe there is some things to borrow.
one interesting next step is to not stop at phonemes but go down to articulatory features, see https://github.com/DigitalPhonetics/IMS-Toucan/blob/StochasticProsodyModeling/Preprocessing/articulatory_features.py / https://github.com/DigitalPhonetics/IMS-Toucan/blob/StochasticProsodyModeling/Preprocessing/TextFrontend.py
The main actual problem with G2P/text processing for me currently is that I see too often TTS papers' code be like : "I'll cover English by copy-pasting text preprocessing code from Tacotron (including transforming "Mr." into "mister", etc...), add my own language, and stop there". Having a goto, really massively multilingual text processing library that doesn't need to pull in espeak, would allow researchers to focus on the modeling part, and avoid the rest of us having to systematically fork the code to cover languages other than English/Chinese.
Duration ?
Luth: Efficient French Specialization for Small Language Models
Faster inference
But then it should explicitly fails upon loading if run on incompatible hardware, not happily load then generate "stupid things" (per the OP description).
Just a small typo in Continued Pretraining paragraph : "...performance. On text classification (TC)..." should be Token Classification, as Text Classification, in Huggingface's own Transformers library, is more a synonym of Sentence Classification, which is confusing.
Merci ;)