--- license: apache-2.0 tags: - automatic-speech-recognition - audio - speech - whisper - multilingual model-index: - name: Jivi-AudioX-North results: - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Vistaar Benchmark Hindi type: vistaar config: hindi split: test metrics: - name: WER type: wer value: 12.14 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Vistaar Benchmark Gujarati type: vistaar config: gujarati split: test metrics: - name: WER type: wer value: 18.66 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Vistaar Benchmark Marathi type: vistaar config: marathi split: test metrics: - name: WER type: wer value: 18.68 language: - hi - gu - mr pipeline_tag: automatic-speech-recognition --- # AudioX: Multilingual Speech-to-Text Model AudioX is a state-of-the-art Indic multilingual automatic speech recognition (ASR) model family developed by Jivi AI. It comprises two specialized variants—AudioX-North and AudioX-South—each optimized for a distinct set of Indian languages to ensure better accuracy. AudioX-North supports **Hindi**, **Gujarati**, and **Marathi**, while AudioX-South covers **Tamil**, **Telugu**, **Kannada**, and **Malayalam**. Trained on a combination of open-source ASR datasets and proprietary audio, the AudioX models offer robust transcription capabilities across accents and acoustic conditions, delivering industry-leading performance across supported languages. AudioX ## Purpose-Built for Indian Languages: AudioX is designed to handle diverse Indian language inputs, supporting real-world applications such as voice assistants, transcription tools, customer service automation, and multilingual content creation. It provides high accuracy across regional accents and varying audio qualities. ## Training Process: AudioX is fine-tuned using **supervised learning** on top of an open-source speech recognition backbone. The training pipeline incorporates domain adaptation, language balancing, and noise augmentation for robustness across real-world scenarios. ## Data Preparation: The model is trained on: - **Open-source multilingual ASR corpora** - **Proprietary Indian language medical datasets**, This hybrid approach boosts the model’s generalization across dialects and acoustic conditions. ## Benchmarks: AudioX achieves top performance across multiple Indian languages, outperforming both open and commercial ASR models. We evaluated AudioX on the [Vistaar Benchmark](https://github.com/AI4Bharat/vistaar/tree/master?tab=readme-ov-file) using the official evaluation script provided by AI4Bharat’s Vistaar suite, ensuring rigorous, standardized comparison across diverse language scenarios. | Provider | Model | Hindi | Gujarati | Marathi | Tamil | Telugu | Kannada | Malayalam | Avg WER | |--------------|-------------------|--------|----------|---------|-------|--------|---------|------------|----------| | **Jivi AI** | **AudioX** | **12.14** | 18.66 | 18.68 | **21.79** | **24.63** | **17.61** | **26.92** | **20.1** | | ElevenLabs | Scribe-v1 | 13.64 | **17.96** | **16.51** | 24.84 | 24.89 | 17.65 | 28.88 | 20.6 | | Sarvam | saarika:v2 | 14.28 | 19.47 | 18.34 | 25.73 | 26.80 | 18.95 | 32.64 | 22.3 | | AI4Bharat | IndicWhisper | 13.59 | 22.84 | 18.25 | 25.27 | 28.82 | 18.33 | 32.34 | 22.8 | | Microsoft | Azure STT | 20.03 | 31.62 | 27.36 | 31.53 | 31.38 | 26.45 | 41.84 | 30.0 | | OpenAI | gpt-4o-transcribe | 18.65 | 31.32 | 25.21 | 39.10 | 33.94 | 32.88 | 46.11 | 32.5 | | Google | Google STT | 23.89 | 36.48 | 26.48 | 33.62 | 42.42 | 31.48 | 47.90 | 34.6 | | OpenAI | Whisper Large v3 | 32.00 | 53.75 | 78.28 | 52.44 | 179.58| 67.02 | 142.98 | 86.6 | ## 🔧 Try This Model You can easily run inference using the 🤗 `transformers` and `librosa` libraries. Here's a minimal example to get started: ```python from transformers import WhisperProcessor, WhisperForConditionalGeneration import librosa # Load model and processor device = "cuda" processor = WhisperProcessor.from_pretrained("jiviai/audioX-north-v1") model = WhisperForConditionalGeneration.from_pretrained("jiviai/audioX-north-v1").to(device) model.config.forced_decoder_ids = None # Load and preprocess audio audio_path = "sample.wav" audio_np, sr = librosa.load(audio_path, sr=None) if sr != 16000: audio_np = librosa.resample(audio_np, orig_sr=sr, target_sr=16000) input_features = processor(audio_np, sampling_rate=16000, return_tensors="pt").to(device).input_features # Generate predictions # Use ISO 639-1 language codes: "hi", "mr", "gu" for North; "ta", "te", "kn", "ml" for South # Or omit the language argument for automatic language detection predicted_ids = model.generate(input_features, task="transcribe", language="hi") # Decode predictions transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] print(transcription)