11 7 4

Tom Schelsen

TomSchelsen

AI & ML interests

None yet

Recent Activity

commented on an article 4 days ago

Scaling Real-Time Voice Agents with Cache-Aware Streaming ASR

new activity 22 days ago

mistralai/Devstral-Small-2-24B-Instruct-2512:Actual context length ?

upvoted an article 22 days ago

Tokenization in Transformers v5: Simpler, Clearer, and More Modular

View all activity

Organizations

None yet

commented on Scaling Real-Time Voice Agents with Cache-Aware Streaming ASR 4 days ago

If I understood correctly, the two figures in https://huggingface.co/blog/nvidia/nemotron-speech-asr-scaling-voice-agents#results-throughput-accuracy-and-speed-at-scale compare a 1.1B model with a 600M one. This is misleading, as demonstrating the added value of the caching mechanism should be done on models of the same size, otherwise half of the "3x" gain could be attributed to the sole parameter count difference.

New activity in mistralai/Devstral-Small-2-24B-Instruct-2512 22 days ago

Actual context length ?

👍 3

#20 opened 24 days ago by

TomSchelsen

upvoted an article 22 days ago

Article

Tokenization in Transformers v5: Simpler, Clearer, and More Modular

23 days ago

•

110

New activity in mistralai/Ministral-3-3B-Instruct-2512 about 1 month ago

Context size for images

#4 opened about 1 month ago by

TomSchelsen

New activity in mistralai/Ministral-3-8B-Base-2512 about 1 month ago

support fill in the middle?

#1 opened about 1 month ago by

ykang7

New activity in mistralai/Ministral-3-3B-Instruct-2512 about 1 month ago

Local Installation Video and Testing - Step by Step

😔 🔥 2

#3 opened about 1 month ago by

fahdmirzac

New activity in smolagents/computer-use-agent about 2 months ago

Double click -> right click

👍 1

#1 opened about 2 months ago by

TomSchelsen

commented on Open ASR Leaderboard: Trends and Insights with New Multilingual & Long-Form Tracks about 2 months ago

First of all thanks for the leaderboard, very useful ressource.

It would be a very nice addition to have an "efficiency" column that directly shows the AverageWER / RTFx ratio, or a 2D plot of it, showing the "Pareto frontier", as is often done for LLMs nowadays.

commented on G2P Shrinks Speech Models 3 months ago

I don't know if the best place is here or on Github but here we go :

what you describe you want to do seems to directly relate to https://github.com/xinjli/transphone and https://github.com/dmort27/epitran , are you aware of those ? I am not against a more up-to-date library since those don't seem maintained, but maybe there is some things to borrow.
one interesting next step is to not stop at phonemes but go down to articulatory features, see https://github.com/DigitalPhonetics/IMS-Toucan/blob/StochasticProsodyModeling/Preprocessing/articulatory_features.py / https://github.com/DigitalPhonetics/IMS-Toucan/blob/StochasticProsodyModeling/Preprocessing/TextFrontend.py

The main actual problem with G2P/text processing for me currently is that I see too often TTS papers' code be like : "I'll cover English by copy-pasting text preprocessing code from Tacotron (including transforming "Mr." into "mister", etc...), add my own language, and stop there". Having a goto, really massively multilingual text processing library that doesn't need to pull in espeak, would allow researchers to focus on the modeling part, and avoid the rest of us having to systematically fork the code to cover languages other than English/Chinese.

New activity in nvidia/Granary 3 months ago

Duration ?

#14 opened 3 months ago by

TomSchelsen

upvoted a collection 3 months ago

ARC-Encoders

Collection

Pretrained ARC-Encoders and a fine-tuning dataset: context compression for unmodified LLMs. • 7 items • Updated 17 days ago • 4

upvoted an article 5 months ago

Article

Luth: Efficient French Specialization for Small Language Models

Aug 11, 2025

•

New activity in RASPIAUDIO/f5-tts_french 5 months ago

Faster inference

#1 opened 5 months ago by

TomSchelsen

commented on Welcome GPT OSS, the new open-source model family from OpenAI! 5 months ago

But then it should explicitly fails upon loading if run on incompatible hardware, not happily load then generate "stupid things" (per the OP description).

commented on Should We Still Pretrain Encoders with Masked Language Modeling? 6 months ago

Just a small typo in Continued Pretraining paragraph : "...performance. On text classification (TC)..." should be Token Classification, as Text Classification, in Huggingface's own Transformers library, is more a synonym of Sentence Classification, which is confusing.

Merci ;)

upvoted 2 articles 6 months ago

Article

Should We Still Pretrain Encoders with Masked Language Modeling?

Jul 2, 2025

•

Article

Mastering Tensor Dimensions in Transformers

Jan 12, 2025

•

128

liked 2 models 7 months ago

kyutai/stt-1b-en_fr-candle

Automatic Speech Recognition • Updated Jul 8, 2025 • 5

kyutai/stt-1b-en_fr

Automatic Speech Recognition • Updated Nov 18, 2025 • 104

New activity in nvidia/parakeet-tdt-0.6b-v2 8 months ago

Typo in the Model Card ?

#5 opened 8 months ago by

TomSchelsen

Tom Schelsen

AI & ML interests

Recent Activity

Organizations

TomSchelsen's activity

Actual context length ?

Tokenization in Transformers v5: Simpler, Clearer, and More Modular

Context size for images

support fill in the middle?

Local Installation Video and Testing - Step by Step

Double click -> right click

Duration ?

Luth: Efficient French Specialization for Small Language Models

Faster inference

Should We Still Pretrain Encoders with Masked Language Modeling?

Mastering Tensor Dimensions in Transformers

Typo in the Model Card ?