---
license: apache-2.0
tags:
- automatic-speech-recognition
- audio
- speech
- whisper
- multilingual
model-index:
- name: Jivi-AudioX-North
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Vistaar Benchmark Hindi
      type: vistaar
      config: hindi
      split: test
    metrics:
    - name: WER
      type: wer
      value: 12.14
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Vistaar Benchmark Gujarati
      type: vistaar
      config: gujarati
      split: test
    metrics:
    - name: WER
      type: wer
      value: 18.66
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Vistaar Benchmark Marathi
      type: vistaar
      config: marathi
      split: test
    metrics:
    - name: WER
      type: wer
      value: 18.68
language:
- hi
- gu
- mr
pipeline_tag: automatic-speech-recognition
---
# AudioX: Multilingual Speech-to-Text Model

AudioX is a state-of-the-art Indic multilingual automatic speech recognition (ASR) model family developed by Jivi AI. It comprises two specialized variants—AudioX-North and AudioX-South—each optimized for a distinct set of Indian languages to ensure better accuracy. AudioX-North supports **Hindi**, **Gujarati**, and **Marathi**, while AudioX-South covers **Tamil**, **Telugu**, **Kannada**, and **Malayalam**. Trained on a combination of open-source ASR datasets and proprietary audio, the AudioX models offer robust transcription capabilities across accents and acoustic conditions, delivering industry-leading performance across supported languages.
<img src="https://d3axayv063q8rp.cloudfront.net/hf_resources/audiox.png" alt="AudioX" width="600" height="600">

## Purpose-Built for Indian Languages:
AudioX is designed to handle diverse Indian language inputs, supporting real-world applications such as voice assistants, transcription tools, customer service automation, and multilingual content creation. It provides high accuracy across regional accents and varying audio qualities.

## Training Process:
AudioX is fine-tuned using **supervised learning** on top of an open-source speech recognition backbone. The training pipeline incorporates domain adaptation, language balancing, and noise augmentation for robustness across real-world scenarios.

## Data Preparation:
The model is trained on:
- **Open-source multilingual ASR corpora** 
- **Proprietary Indian language medical datasets**, 

This hybrid approach boosts the model’s generalization across dialects and acoustic conditions.

## Benchmarks:
AudioX achieves top performance across multiple Indian languages, outperforming both open and commercial ASR models.
We evaluated AudioX on the [Vistaar Benchmark](https://github.com/AI4Bharat/vistaar/tree/master?tab=readme-ov-file) using the official evaluation script provided by AI4Bharat’s Vistaar suite, ensuring rigorous, standardized comparison across diverse language scenarios.

| Provider     | Model             | Hindi | Gujarati | Marathi | Tamil | Telugu | Kannada | Malayalam | Avg WER |
|--------------|-------------------|--------|----------|---------|-------|--------|---------|------------|----------|
| **Jivi AI**   | **AudioX**         | **12.14** | 18.66   | 18.68  | **21.79** | **24.63** | **17.61**  | **26.92**     | **20.1**   |
| ElevenLabs   | Scribe-v1         | 13.64 | **17.96**   | **16.51**  | 24.84 | 24.89 | 17.65  | 28.88     | 20.6     |
| Sarvam       | saarika:v2        | 14.28 | 19.47   | 18.34  | 25.73 | 26.80 | 18.95  | 32.64     | 22.3     |
| AI4Bharat    | IndicWhisper      | 13.59 | 22.84   | 18.25  | 25.27 | 28.82 | 18.33  | 32.34     | 22.8     |
| Microsoft    | Azure STT         | 20.03 | 31.62   | 27.36  | 31.53 | 31.38 | 26.45  | 41.84     | 30.0     |
| OpenAI       | gpt-4o-transcribe | 18.65 | 31.32   | 25.21  | 39.10 | 33.94 | 32.88  | 46.11     | 32.5     |
| Google       | Google STT        | 23.89 | 36.48   | 26.48  | 33.62 | 42.42 | 31.48  | 47.90     | 34.6     |
| OpenAI       | Whisper Large v3  | 32.00 | 53.75   | 78.28  | 52.44 | 179.58| 67.02  | 142.98    | 86.6     |


## 🔧 Try This Model

You can easily run inference using the 🤗 `transformers` and `librosa` libraries. Here's a minimal example to get started:

```python
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

# Load model and processor
device = "cuda"
processor = WhisperProcessor.from_pretrained("jiviai/audioX-north-v1")
model = WhisperForConditionalGeneration.from_pretrained("jiviai/audioX-north-v1").to(device)
model.config.forced_decoder_ids = None

# Load and preprocess audio
audio_path = "sample.wav"
audio_np, sr = librosa.load(audio_path, sr=None)
if sr != 16000:
    audio_np = librosa.resample(audio_np, orig_sr=sr, target_sr=16000)

input_features = processor(audio_np, sampling_rate=16000, return_tensors="pt").to(device).input_features

# Generate predictions
# Use ISO 639-1 language codes: "hi", "mr", "gu" for North; "ta", "te", "kn", "ml" for South
# Or omit the language argument for automatic language detection
predicted_ids = model.generate(input_features, task="transcribe", language="hi")

# Decode predictions
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)