--- license: apache-2.0 language: - ne metrics: - wer base_model: - openai/whisper-small tags: - automatic-speech-recognition - whisper - openslr - generated_from_trainer - speech model-index: - name: Whisper Small Nepali (OpenSLR) results: - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: OpenSLR 54 (Nepali Speech Corpus) type: openslr metrics: - type: wer value: 26.69 name: Wer --- # Whisper Small Fine-tuned on Nepali (OpenSLR 54) This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) on the **OpenSLR 54 (Nepali Speech Corpus)** dataset. It achieves state-of-the-art results for an open-source small model on this benchmark, trained on a massive 154-hour dataset. ## Model Details ### Model Description - **Model architecture:** Whisper Small (244M Parameters) - **Language:** Nepali (ne) - **Task:** Automatic Speech Recognition (Transcription) - **Dataset:** OpenSLR 54 (~157,000 utterances) - **Fine-tuning Hardware:** NVIDIA A100 80GB ## Usage ```python from transformers import pipeline transcriber = pipeline("automatic-speech-recognition", model="fnawaraj/whisper-small-nepali-openslr") # Transcribe an audio file transcription = transcriber("path_to_nepali_audio.mp3") print(transcription["text"]) ``` ## Training Data The model was trained on the OpenSLR 54 (Nepali Speech Corpus). Total Audio Duration: ~154 Hours Total Utterances: 157,905 Sampling Rate: 16kHz ### Training Procedure #### Training Hyperparameters The following hyperparameters were used during training: Learning Rate: 1e-05 Train Batch Size: 8 Eval Batch Size: 8 Gradient Accumulation Steps: 4 (Effective Batch Size: 32) Optimizer: AdamW LR Scheduler: Linear decay with warmup (500 steps) Training Steps: 10,000 Mixed Precision: FP16 ## Evaluation Results The model was evaluated on the unseen test split of the OpenSLR dataset (1,580 samples). Metric Score Word Error Rate (WER) 26.69% Validation Loss 0.210 ## Limitations The model performs best on read speech (high quality). It may struggle with extremely fast conversational speech or heavy background noise compared to models trained on diverse noise datasets. Some phonetic spelling variations (e.g., short vs long vowels) may occur as they sound identical in spoken Nepali. This research was supported by the High Performance Computing (HPC) facility at Tribhuvan University, Nepal. We acknowledge the Supercomputer Centre for providing the computational resources required for this wor