File size: 10,504 Bytes
980dc8d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
---

title: "Extracting Audio and Transcripts"
original_url: "https://tds.s-anand.net/#/extracting-audio-and-transcripts?id=media-tools-yt-dlp"
downloaded_at: "2025-06-08T23:25:44.497461"
---


[Extracting Audio and Transcripts](#/extracting-audio-and-transcripts?id=extracting-audio-and-transcripts)
----------------------------------------------------------------------------------------------------------

[Media Processing: FFmpeg](#/extracting-audio-and-transcripts?id=media-processing-ffmpeg)
-----------------------------------------------------------------------------------------

[FFmpeg](https://ffmpeg.org/) is the standard command-line tool for processing video and audio files. It’s essential for data scientists working with media files for:

* Extracting audio/video for machine learning
* Converting formats for web deployment
* Creating visualizations and presentations
* Processing large media datasets

Basic Operations:

```

# Basic conversion

ffmpeg -i input.mp4 output.avi



# Extract audio

ffmpeg -i input.mp4 -vn output.mp3



# Convert format without re-encoding

ffmpeg -i input.mkv -c copy output.mp4



# High quality encoding (crf: 0-51, lower is better)

ffmpeg -i input.mp4 -preset slower -crf 18 output.mp4Copy to clipboardErrorCopied

```

Common Data Science Tasks:

```

# Extract frames for computer vision

ffmpeg -i input.mp4 -vf "fps=1" frames_%04d.png    # 1 frame per second

ffmpeg -i input.mp4 -vf "select='eq(n,0)'" -vframes 1 first_frame.jpg



# Create video from image sequence

ffmpeg -r 1/5 -i img%03d.png -c:v libx264 -vf fps=25 output.mp4



# Extract audio for speech recognition

ffmpeg -i input.mp4 -ar 16000 -ac 1 audio.wav      # 16kHz mono



# Trim video/audio for training data

ffmpeg -ss 00:01:00 -i input.mp4 -t 00:00:30 -c copy clip.mp4Copy to clipboardErrorCopied

```

Processing Multiple Files:

```

# Concatenate videos (first create files.txt with list of files)

echo "file 'input1.mp4'

file 'input2.mp4'" > files.txt

ffmpeg -f concat -i files.txt -c copy output.mp4



# Batch process with shell loop

for f in *.mp4; do

    ffmpeg -i "$f" -vn "audio/${f%.mp4}.wav"

doneCopy to clipboardErrorCopied

```

Data Analysis Features:

```

# Get media file information

ffprobe -v quiet -print_format json -show_format -show_streams input.mp4



# Display frame metadata

ffprobe -v quiet -print_format json -show_frames input.mp4



# Generate video thumbnails

ffmpeg -i input.mp4 -vf "thumbnail" -frames:v 1 thumb.jpgCopy to clipboardErrorCopied

```

Watch this introduction to FFmpeg (12 min):

[![FFmpeg in 12 Minutes](https://i.ytimg.com/vi_webp/MPV7JXTWPWI/sddefault.webp)](https://youtu.be/MPV7JXTWPWI)

Tools:

* [ffmpeg.lav.io](https://ffmpeg.lav.io/): Interactive command builder
* [FFmpeg Explorer](https://ffmpeg.guide/): Visual FFmpeg command generator
* [FFmpeg Buddy](https://evanhahn.github.io/ffmpeg-buddy/): Simple command generator

Tips:

1. Use `-c copy` when possible to avoid re-encoding
2. Monitor progress with `-progress pipe:1`
3. Use `-hide_banner` to reduce output verbosity
4. Test commands with small clips first
5. Use hardware acceleration when available (-hwaccel auto)

Error Handling:

```

# Validate file before processing

ffprobe input.mp4 2>&1 | grep "Invalid"



# Continue on errors in batch processing

ffmpeg -i input.mp4 output.mp4 -xerror



# Get detailed error information

ffmpeg -v error -i input.mp4 2>&1 | grep -A2 "Error"Copy to clipboardErrorCopied

```



[Media tools: yt-dlp](#/extracting-audio-and-transcripts?id=media-tools-yt-dlp)
-------------------------------------------------------------------------------

[yt-dlp](https://github.com/yt-dlp/yt-dlp) is a feature-rich command-line tool for downloading audio/video from thousands of sites. It’s particularly useful for extracting audio and transcripts from videos.

Install using your package manager:

```

# macOS

brew install yt-dlp



# Linux

curl -L https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp -o ~/.local/bin/yt-dlp

chmod a+rx ~/.local/bin/yt-dlp



# Windows

winget install yt-dlpCopy to clipboardErrorCopied

```

Common operations for extracting audio and transcripts:

```

# Download audio only at lowest quality suitable for speech

yt-dlp -f "ba[abr<50]/worstaudio" \

       --extract-audio \

       --audio-format mp3 \

       --audio-quality 32k \

       "https://www.youtube.com/watch?v=VIDEO_ID"



# Download auto-generated subtitles

yt-dlp --write-auto-sub \

       --skip-download \

       --sub-format "srt" \

       "https://www.youtube.com/watch?v=VIDEO_ID"



# Download both audio and subtitles with custom output template

yt-dlp -f "ba[abr<50]/worstaudio" \

       --extract-audio \

       --audio-format mp3 \

       --audio-quality 32k \

       --write-auto-sub \

       --sub-format "srt" \

       -o "%(title)s.%(ext)s" \

       "https://www.youtube.com/watch?v=VIDEO_ID"



# Download entire playlist's audio

yt-dlp -f "ba[abr<50]/worstaudio" \

       --extract-audio \

       --audio-format mp3 \

       --audio-quality 32k \

       -o "%(playlist_index)s-%(title)s.%(ext)s" \

       "https://www.youtube.com/playlist?list=PLAYLIST_ID"Copy to clipboardErrorCopied

```

For Python integration:

```

# /// script

# requires-python = ">=3.9"

# dependencies = ["yt-dlp"]

# ///



import yt_dlp



def download_audio(url: str) -> None:

    """Download audio at speech-optimized quality."""

    ydl_opts = {

        'format': 'ba[abr<50]/worstaudio',

        'postprocessors': [{

            'key': 'FFmpegExtractAudio',

            'preferredcodec': 'mp3',

            'preferredquality': '32'

        }]

    }



    with yt_dlp.YoutubeDL(ydl_opts) as ydl:

        ydl.download([url])



# Example usage

download_audio('https://www.youtube.com/watch?v=VIDEO_ID')Copy to clipboardErrorCopied

```

Tools:

* [ffmpeg](https://ffmpeg.org/): Required for audio extraction and conversion
* [whisper](https://github.com/openai/whisper): Can be used with yt-dlp for speech-to-text
* [gallery-dl](https://github.com/mikf/gallery-dl): Alternative for image-focused sites

Note: Always respect copyright and terms of service when downloading content.

[Whisper transcription](#/extracting-audio-and-transcripts?id=whisper-transcription)
------------------------------------------------------------------------------------

[Faster Whisper](https://github.com/SYSTRAN/faster-whisper) is a highly optimized implementation of OpenAI’s [Whisper model](https://github.com/openai/whisper), offering up to 4x faster transcription while using less memory.

You can install it via:

* `pip install faster-whisper`
* [Download Windows Standalone](https://github.com/Purfview/whisper-standalone-win/releases)

Here’s a basic usage example:

```

faster-whisper-xxl "video.mp4" --model medium --language enCopy to clipboardErrorCopied

```

Here’s my recommendation for transcribing videos. This saves the output in JSON as well as SRT format in the source directory.

```

faster-whisper-xxl --print_progress --output_dir source --batch_recursive \

                   --check_files --standard --output_format json srt \

                   --model medium --language en $FILECopy to clipboardErrorCopied

```

* `--model`: The OpenAI Whisper model to use. You can choose from:
  + `tiny`: Fastest but least accurate
  + `base`: Good for simple audio
  + `small`: Balanced speed/accuracy
  + `medium`: Recommended default
  + `large-v3`: Most accurate but slowest
* `--output_format`: The output format to use. You can pick multiple formats from:
  + `json`: Has the most detailed information including timing, text, quality, etc.
  + `srt`: A popular subtitle format. You can use this in YouTube, for example.
  + `vtt`: A modern subtitle format.
  + `txt`: Just the text transcript
* `--output_dir`: The directory to save the output files. `source` indicates the source directory, i.e. where the input `$FILE` is
* `--language`: The language of the input file. If you don’t specify it, it analyzes the first 30 seconds to auto-detect. You can speed it up by specifying it.

Run `faster-whisper-xxl --help` for more options.

[Gemini transcription](#/extracting-audio-and-transcripts?id=gemini-transcription)
----------------------------------------------------------------------------------

The [Gemini](https://gemini.google.com/) models from Google are notable in two ways:

1. They have a *huge* input context window. Gemini 2.0 Flash can accept 1M tokens, for example.
2. They can handle audio input.

This allows us to use Gemini to transcribe audio files.

LLMs are not good at transcribing audio *faithfully*. They tend to correct errors and meander from what was said. But they are intelligent. That enables a few powerful workflows. Here are some examples:

1. **Transcribe into other languages**. Gemini will handle the transcription and translation in a single step.
2. **Summarize audio transcripts**. For example, convert a podcast into a tutorial, or a meeting recording into actions.
3. **Legal Proceeding Analysis**. Extract case citations, dates, and other details from a legal debate.
4. **Medical Consultation Summary**. Extract treatments, medications, details of next visit, etc. from a medical consultation.

Here’s how to use Gemini to transcribe audio files.

1. Get a [Gemini API key](https://aistudio.google.com/app/apikey) from Google AI Studio.
2. Set the `GEMINI_API_KEY` environment variable to the API key.
3. Set the `MP3_FILE` environment variable to the path of the MP3 file you want to transcribe.
4. Run this code:

   ```

   curl -X POST https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash-002:streamGenerateContent?alt=sse \

     -H "X-Goog-API-Key: $GEMINI_API_KEY" \

     -H "Content-Type: application/json" \

     -d "$(cat << EOF

   {

     "contents": [

       {

         "role": "user",

         "parts": [

           {

             "inline_data": {

               "mime_type": "audio/mp3",

               "data": "$(base64 --wrap=0 $MP3_FILE)"

             }

           },

           {"text": "Transcribe this"}

         ]

       }

     ]

   }

   EOF

   )"Copy to clipboardErrorCopied

   ```

[Previous

Transforming Images](#/transforming-images)

[Next

6. Data Analysis](#/data-analysis)