Spaces:

Shriyakupp
/

iitm_scraper

Sleeping

App Files Files Community

iitm_scraper / markdown_files /Extracting_Audio_and_Transcripts.md

Shriyakupp

Upload 107 files

980dc8d verified 6 months ago

preview code

raw

history blame contribute delete

10.5 kB

	---
	title: "Extracting Audio and Transcripts"
	original_url: "https://tds.s-anand.net/#/extracting-audio-and-transcripts?id=media-tools-yt-dlp"
	downloaded_at: "2025-06-08T23:25:44.497461"
	---

	[Extracting Audio and Transcripts](#/extracting-audio-and-transcripts?id=extracting-audio-and-transcripts)
	----------------------------------------------------------------------------------------------------------

	[Media Processing: FFmpeg](#/extracting-audio-and-transcripts?id=media-processing-ffmpeg)
	-----------------------------------------------------------------------------------------

	[FFmpeg](https://ffmpeg.org/) is the standard command-line tool for processing video and audio files. It’s essential for data scientists working with media files for:

	* Extracting audio/video for machine learning
	* Converting formats for web deployment
	* Creating visualizations and presentations
	* Processing large media datasets

	Basic Operations:

	```
	# Basic conversion
	ffmpeg -i input.mp4 output.avi

	# Extract audio
	ffmpeg -i input.mp4 -vn output.mp3

	# Convert format without re-encoding
	ffmpeg -i input.mkv -c copy output.mp4

	# High quality encoding (crf: 0-51, lower is better)
	ffmpeg -i input.mp4 -preset slower -crf 18 output.mp4Copy to clipboardErrorCopied
	```

	Common Data Science Tasks:

	```
	# Extract frames for computer vision
	ffmpeg -i input.mp4 -vf "fps=1" frames_%04d.png # 1 frame per second
	ffmpeg -i input.mp4 -vf "select='eq(n,0)'" -vframes 1 first_frame.jpg

	# Create video from image sequence
	ffmpeg -r 1/5 -i img%03d.png -c:v libx264 -vf fps=25 output.mp4

	# Extract audio for speech recognition
	ffmpeg -i input.mp4 -ar 16000 -ac 1 audio.wav # 16kHz mono

	# Trim video/audio for training data
	ffmpeg -ss 00:01:00 -i input.mp4 -t 00:00:30 -c copy clip.mp4Copy to clipboardErrorCopied
	```

	Processing Multiple Files:

	```
	# Concatenate videos (first create files.txt with list of files)
	echo "file 'input1.mp4'
	file 'input2.mp4'" > files.txt
	ffmpeg -f concat -i files.txt -c copy output.mp4

	# Batch process with shell loop
	for f in *.mp4; do
	ffmpeg -i "$f" -vn "audio/${f%.mp4}.wav"
	doneCopy to clipboardErrorCopied
	```

	Data Analysis Features:

	```
	# Get media file information
	ffprobe -v quiet -print_format json -show_format -show_streams input.mp4

	# Display frame metadata
	ffprobe -v quiet -print_format json -show_frames input.mp4

	# Generate video thumbnails
	ffmpeg -i input.mp4 -vf "thumbnail" -frames:v 1 thumb.jpgCopy to clipboardErrorCopied
	```

	Watch this introduction to FFmpeg (12 min):

	[![FFmpeg in 12 Minutes](https://i.ytimg.com/vi_webp/MPV7JXTWPWI/sddefault.webp)](https://youtu.be/MPV7JXTWPWI)

	Tools:

	* [ffmpeg.lav.io](https://ffmpeg.lav.io/): Interactive command builder
	* [FFmpeg Explorer](https://ffmpeg.guide/): Visual FFmpeg command generator
	* [FFmpeg Buddy](https://evanhahn.github.io/ffmpeg-buddy/): Simple command generator

	Tips:

	1. Use `-c copy` when possible to avoid re-encoding
	2. Monitor progress with `-progress pipe:1`
	3. Use `-hide_banner` to reduce output verbosity
	4. Test commands with small clips first
	5. Use hardware acceleration when available (-hwaccel auto)

	Error Handling:

	```
	# Validate file before processing
	ffprobe input.mp4 2>&1 \| grep "Invalid"

	# Continue on errors in batch processing
	ffmpeg -i input.mp4 output.mp4 -xerror

	# Get detailed error information
	ffmpeg -v error -i input.mp4 2>&1 \| grep -A2 "Error"Copy to clipboardErrorCopied
	```



	[Media tools: yt-dlp](#/extracting-audio-and-transcripts?id=media-tools-yt-dlp)
	-------------------------------------------------------------------------------

	[yt-dlp](https://github.com/yt-dlp/yt-dlp) is a feature-rich command-line tool for downloading audio/video from thousands of sites. It’s particularly useful for extracting audio and transcripts from videos.

	Install using your package manager:

	```
	# macOS
	brew install yt-dlp

	# Linux
	curl -L https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp -o ~/.local/bin/yt-dlp
	chmod a+rx ~/.local/bin/yt-dlp

	# Windows
	winget install yt-dlpCopy to clipboardErrorCopied
	```

	Common operations for extracting audio and transcripts:

	```
	# Download audio only at lowest quality suitable for speech
	yt-dlp -f "ba[abr<50]/worstaudio" \
	--extract-audio \
	--audio-format mp3 \
	--audio-quality 32k \
	"https://www.youtube.com/watch?v=VIDEO_ID"

	# Download auto-generated subtitles
	yt-dlp --write-auto-sub \
	--skip-download \
	--sub-format "srt" \
	"https://www.youtube.com/watch?v=VIDEO_ID"

	# Download both audio and subtitles with custom output template
	yt-dlp -f "ba[abr<50]/worstaudio" \
	--extract-audio \
	--audio-format mp3 \
	--audio-quality 32k \
	--write-auto-sub \
	--sub-format "srt" \
	-o "%(title)s.%(ext)s" \
	"https://www.youtube.com/watch?v=VIDEO_ID"

	# Download entire playlist's audio
	yt-dlp -f "ba[abr<50]/worstaudio" \
	--extract-audio \
	--audio-format mp3 \
	--audio-quality 32k \
	-o "%(playlist_index)s-%(title)s.%(ext)s" \
	"https://www.youtube.com/playlist?list=PLAYLIST_ID"Copy to clipboardErrorCopied
	```

	For Python integration:

	```
	# /// script
	# requires-python = ">=3.9"
	# dependencies = ["yt-dlp"]
	# ///

	import yt_dlp

	def download_audio(url: str) -> None:
	"""Download audio at speech-optimized quality."""
	ydl_opts = {
	'format': 'ba[abr<50]/worstaudio',
	'postprocessors': [{
	'key': 'FFmpegExtractAudio',
	'preferredcodec': 'mp3',
	'preferredquality': '32'
	}]
	}

	with yt_dlp.YoutubeDL(ydl_opts) as ydl:
	ydl.download([url])

	# Example usage
	download_audio('https://www.youtube.com/watch?v=VIDEO_ID')Copy to clipboardErrorCopied
	```

	Tools:

	* [ffmpeg](https://ffmpeg.org/): Required for audio extraction and conversion
	* [whisper](https://github.com/openai/whisper): Can be used with yt-dlp for speech-to-text
	* [gallery-dl](https://github.com/mikf/gallery-dl): Alternative for image-focused sites

	Note: Always respect copyright and terms of service when downloading content.

	[Whisper transcription](#/extracting-audio-and-transcripts?id=whisper-transcription)
	------------------------------------------------------------------------------------

	[Faster Whisper](https://github.com/SYSTRAN/faster-whisper) is a highly optimized implementation of OpenAI’s [Whisper model](https://github.com/openai/whisper), offering up to 4x faster transcription while using less memory.

	You can install it via:

	* `pip install faster-whisper`
	* [Download Windows Standalone](https://github.com/Purfview/whisper-standalone-win/releases)

	Here’s a basic usage example:

	```
	faster-whisper-xxl "video.mp4" --model medium --language enCopy to clipboardErrorCopied
	```

	Here’s my recommendation for transcribing videos. This saves the output in JSON as well as SRT format in the source directory.

	```
	faster-whisper-xxl --print_progress --output_dir source --batch_recursive \
	--check_files --standard --output_format json srt \
	--model medium --language en $FILECopy to clipboardErrorCopied
	```

	* `--model`: The OpenAI Whisper model to use. You can choose from:
	+ `tiny`: Fastest but least accurate
	+ `base`: Good for simple audio
	+ `small`: Balanced speed/accuracy
	+ `medium`: Recommended default
	+ `large-v3`: Most accurate but slowest
	* `--output_format`: The output format to use. You can pick multiple formats from:
	+ `json`: Has the most detailed information including timing, text, quality, etc.
	+ `srt`: A popular subtitle format. You can use this in YouTube, for example.
	+ `vtt`: A modern subtitle format.
	+ `txt`: Just the text transcript
	* `--output_dir`: The directory to save the output files. `source` indicates the source directory, i.e. where the input `$FILE` is
	* `--language`: The language of the input file. If you don’t specify it, it analyzes the first 30 seconds to auto-detect. You can speed it up by specifying it.

	Run `faster-whisper-xxl --help` for more options.

	[Gemini transcription](#/extracting-audio-and-transcripts?id=gemini-transcription)
	----------------------------------------------------------------------------------

	The [Gemini](https://gemini.google.com/) models from Google are notable in two ways:

	1. They have a huge input context window. Gemini 2.0 Flash can accept 1M tokens, for example.
	2. They can handle audio input.

	This allows us to use Gemini to transcribe audio files.

	LLMs are not good at transcribing audio faithfully. They tend to correct errors and meander from what was said. But they are intelligent. That enables a few powerful workflows. Here are some examples:

	1. Transcribe into other languages. Gemini will handle the transcription and translation in a single step.
	2. Summarize audio transcripts. For example, convert a podcast into a tutorial, or a meeting recording into actions.
	3. Legal Proceeding Analysis. Extract case citations, dates, and other details from a legal debate.
	4. Medical Consultation Summary. Extract treatments, medications, details of next visit, etc. from a medical consultation.

	Here’s how to use Gemini to transcribe audio files.

	1. Get a [Gemini API key](https://aistudio.google.com/app/apikey) from Google AI Studio.
	2. Set the `GEMINI_API_KEY` environment variable to the API key.
	3. Set the `MP3_FILE` environment variable to the path of the MP3 file you want to transcribe.
	4. Run this code:

	```
	curl -X POST https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash-002:streamGenerateContent?alt=sse \
	-H "X-Goog-API-Key: $GEMINI_API_KEY" \
	-H "Content-Type: application/json" \
	-d "$(cat << EOF
	{
	"contents": [
	{
	"role": "user",
	"parts": [
	{
	"inline_data": {
	"mime_type": "audio/mp3",
	"data": "$(base64 --wrap=0 $MP3_FILE)"
	}
	},
	{"text": "Transcribe this"}
	]
	}
	]
	}
	EOF
	)"Copy to clipboardErrorCopied
	```

	[Previous

	Transforming Images](#/transforming-images)

	[Next

	6. Data Analysis](#/data-analysis)