primepake commited on
Commit
5d43438
·
1 Parent(s): 2202b7c

Update dac-vae before subtree push

Browse files
dac-vae/README.md ADDED
@@ -0,0 +1,203 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Descript Audio Codec - VAE Variant (.dac-vae): High-Fidelity Audio Compression with Variational Autoencoder
2
+
3
+ This repository contains training and inference scripts for the Descript Audio Codec VAE variant (.dac-vae), a modified version of the [original DAC](https://github.com/descriptinc/descript-audio-codec) that replaces the RVQGAN architecture with a Variational Autoencoder while maintaining the same high-quality audio compression capabilities.
4
+
5
+ ## Overview
6
+
7
+ Building on the foundation of the [original Descript Audio Codec](https://github.com/descriptinc/descript-audio-codec), **DAC-VAE** adapts the architecture to use Variational Autoencoder principles instead of Residual Vector Quantization (RVQ).
8
+
9
+ ### Key Differences from Original DAC
10
+
11
+ 👉 **DAC-VAE** compresses **24 kHz audio** (instead of 44.1 kHz) using a continuous latent representation through VAE architecture
12
+
13
+ ### 🔄 Architecture Changes:
14
+
15
+ - Replaces the RVQGAN's discrete codebook with VAE's continuous latent space
16
+ - Maintains the same encoder-decoder backbone architecture from the original DAC
17
+ - Swaps vector quantization layers for VAE reparameterization trick
18
+ - Preserves the multi-scale discriminator design for adversarial training
19
+
20
+ ### 🎯 Inherited Features from Original DAC:
21
+
22
+ - High-fidelity neural audio compression
23
+ - Universal model for all audio domains (speech, environment, music, etc.)
24
+ - Efficient encoding and decoding
25
+ - State-of-the-art reconstruction quality
26
+
27
+ ## Why VAE Instead of RVQGAN?
28
+
29
+ This fork explores an alternative approach to the original DAC's discrete coding strategy:
30
+
31
+ | Component | Original DAC (RVQGAN) | DAC-VAE (This Repo) |
32
+ |-----------|----------------------|---------------------|
33
+ | Latent Space | Discrete (VQ codes) | Continuous (Gaussian) |
34
+ | Sampling Rate | 44.1 kHz | 24 kHz |
35
+ | Quantization | Residual VQ with codebooks | VAE reparameterization |
36
+ | Training Objective | Reconstruction + VQ + Adversarial | Reconstruction + KL + Adversarial |
37
+ | Compression | Fixed bitrate (8 kbps) | Variable (KL-controlled) |
38
+
39
+ ## Installation
40
+
41
+ ```bash
42
+ # Clone this repository
43
+ git clone https://github.com/primepake/dac-vae.git
44
+ cd dac-vae
45
+
46
+ # Install dependencies
47
+ pip install -r requirements.txt
48
+ ```
49
+
50
+ ## Usage
51
+
52
+ ### Inference
53
+
54
+ ```bash
55
+ python3 inference.py \
56
+ --checkpoint checkpoint.pt \
57
+ --config configs/configx2.yml \
58
+ --mode encode_decode \
59
+ --input test.wav \
60
+ --output reconstruction.wav
61
+ ```
62
+
63
+ ### Training
64
+
65
+ ```bash
66
+ # Single GPU training
67
+ python3 train.py --run_id factorx2
68
+
69
+ # Multi-GPU training (4 GPUs)
70
+ torchrun --nnodes=1 --nproc_per_node=4 train.py --run_id factorx2
71
+ ```
72
+
73
+ ### Python API
74
+
75
+ ```python
76
+ import torch
77
+ from dac_vae import DACVAE
78
+
79
+ # Load model
80
+ model = DACVAE.load_from_checkpoint("checkpoint.pt")
81
+ model.eval()
82
+
83
+ # Load audio (assuming 24kHz)
84
+ audio, sr = torchaudio.load("input.wav")
85
+ if sr != 24000:
86
+ audio = torchaudio.functional.resample(audio, sr, 24000)
87
+
88
+ # Encode to latent
89
+ with torch.no_grad():
90
+ latent = model.encode(audio)
91
+
92
+ # Decode back to audio
93
+ with torch.no_grad():
94
+ reconstructed = model.decode(latent)
95
+
96
+ # Save output
97
+ torchaudio.save("output.wav", reconstructed, 24000)
98
+ ```
99
+
100
+ ## Model Architecture
101
+
102
+ DAC-VAE preserves most of the original DAC architecture with key modifications:
103
+
104
+ - **Encoder**: Same convolutional architecture as original DAC
105
+ - **Latent Layer**: VAE reparameterization (replaces VQ-VAE quantization)
106
+ - **Decoder**: Identical transposed convolution architecture
107
+ - **Discriminator**: Same multi-scale discriminator for perceptual quality
108
+
109
+ ### Configuration
110
+
111
+ The model can be configured through YAML files in the `configs/` directory:
112
+
113
+ - `configx2.yml`: Default 24kHz configuration with 2x downsampling factor
114
+ - Adjust latent dimensions, KL weight, and other hyperparameters as needed
115
+
116
+ ## Training Details
117
+
118
+ ### Dataset Preparation
119
+
120
+ Prepare your audio dataset with the following structure:
121
+ ```
122
+ dataset/
123
+ ├── train/
124
+ │ ├── audio1.wav
125
+ │ ├── audio2.wav
126
+ │ └── ...
127
+ └── val/
128
+ ├── audio1.wav
129
+ ├── audio2.wav
130
+ └── ...
131
+ ```
132
+
133
+ ### Training Command
134
+
135
+ ```bash
136
+ torchrun --nnodes=1 --nproc_per_node=4 train.py \
137
+ --run_id my_experiment \
138
+ --config configs/configx2.yml \
139
+ --dataset_path /path/to/dataset \
140
+ --num_epochs 200 \
141
+ --batch_size 32
142
+ ```
143
+
144
+ ## Evaluation
145
+
146
+ Evaluate model performance using:
147
+
148
+ ```bash
149
+ python3 evaluate.py \
150
+ --checkpoint checkpoint.pt \
151
+ --test_dir /path/to/test/audio \
152
+ --metrics pesq stoi
153
+ ```
154
+
155
+ ## Pretrained Models
156
+
157
+ | Model | Sample Rate | Config | Download |
158
+ |-------|-------------|---------|----------|
159
+ | dac_vae_24khz_v1 | 24 kHz | config.yml | [64 dim 3x frames](#) |
160
+ | dac_vae_24khz_v1 | 24 kHz | configx2.yml | [80 dim 2x frames](#) |
161
+
162
+
163
+ ## Citation
164
+
165
+ If you use DAC-VAE, please cite both this work and the original DAC paper:
166
+
167
+ ```bibtex
168
+ @misc{dacvae2024,
169
+ title={DAC-VAE: Variational Autoencoder Adaptation of Descript Audio Codec},
170
+ author={primepake},
171
+ year={2024},
172
+ url={https://github.com/primepake/dac-vae}
173
+ }
174
+
175
+ @misc{kumar2023high,
176
+ title={High-Fidelity Audio Compression with Improved RVQGAN},
177
+ author={Kumar, Rithesh and Seetharaman, Prem and Luebs, Alejandro and Kumar, Ishaan and Kumar, Kundan},
178
+ journal={arXiv preprint arXiv:2306.06546},
179
+ year={2023}
180
+ }
181
+ ```
182
+
183
+ ## License
184
+
185
+ This project maintains the same license as the original Descript Audio Codec. See [LICENSE](LICENSE) file for details.
186
+
187
+ ## Acknowledgments
188
+
189
+ This work is built directly on top of the excellent [Descript Audio Codec](https://github.com/descriptinc/descript-audio-codec) by the Descript team. We thank them for open-sourcing their high-quality implementation, which made this VAE exploration possible.
190
+
191
+ ## Related Links
192
+
193
+ - [Original DAC Repository](https://github.com/descriptinc/descript-audio-codec)
194
+ - [Original DAC Paper](https://arxiv.org/abs/2306.06546)
195
+ - [Descript Audio Codec Demo](https://descript.notion.site/Descript-Audio-Codec-11389fce0ce2419891d6591a18f30bfd)
196
+
197
+ ## Contributing
198
+
199
+ Contributions are welcome! Please feel free to submit a Pull Request.
200
+
201
+ ## Contact
202
+
203
+ For questions and feedback, please open an issue in this repository.
dac-vae/extract.sh CHANGED
@@ -1,10 +1,10 @@
1
  python extract_dac_latents.py \
2
  --root_path /data/dataset \
3
- --file_list files.txt \
4
  --output_dir /data/dataset/metadata \
5
- --checkpoint ./checkpoint.pt \
6
- --config ./config.yml \
7
- --num_gpus 1 \
8
  --num_decode_samples 10
9
 
10
 
 
1
  python extract_dac_latents.py \
2
  --root_path /data/dataset \
3
+ --file_list /data/learnable/speech/files.txt \
4
  --output_dir /data/dataset/metadata \
5
+ --checkpoint ./ckpts/300k_20250829_044827/checkpoint.pt\
6
+ --config ./configs/configx2.yml \
7
+ --num_gpus 4 \
8
  --num_decode_samples 10
9
 
10
 
dac-vae/extract_dac_latents.py CHANGED
@@ -168,9 +168,9 @@ def extract_latents_gpu(rank, world_size, args, audio_files):
168
  result = process_single_audio(audio_path, model, sample_rate, device)
169
 
170
  if result['success']:
171
- # Create output path: a/b/c/d.wav -> a/b/c/d_latent.pt
172
  base_path = os.path.splitext(audio_path)[0] # Remove extension
173
- output_path = f"{base_path}_latent.pt"
174
 
175
  # Create directory if it doesn't exist
176
  os.makedirs(os.path.dirname(output_path), exist_ok=True)
@@ -405,7 +405,13 @@ def main():
405
  filtered_files = []
406
  for audio_path in audio_files:
407
  base_path = os.path.splitext(audio_path)[0]
408
- latent_path = f"{base_path}_latent.pt"
 
 
 
 
 
 
409
  if not os.path.exists(latent_path):
410
  filtered_files.append(audio_path)
411
  print(f"Skipping {len(audio_files) - len(filtered_files)} existing files")
 
168
  result = process_single_audio(audio_path, model, sample_rate, device)
169
 
170
  if result['success']:
171
+ # Create output path: a/b/c/d.wav -> a/b/c/d_latent2x.pt
172
  base_path = os.path.splitext(audio_path)[0] # Remove extension
173
+ output_path = f"{base_path}_latent2x.pt"
174
 
175
  # Create directory if it doesn't exist
176
  os.makedirs(os.path.dirname(output_path), exist_ok=True)
 
405
  filtered_files = []
406
  for audio_path in audio_files:
407
  base_path = os.path.splitext(audio_path)[0]
408
+ latent_path = f"{base_path}_latent2x.pt"
409
+
410
+ old_latent_path = f"{base_path}_latent.pt"
411
+ if os.path.exists(old_latent_path):
412
+ os.remove(old_latent_path)
413
+ print(f"Removed old latent file: {old_latent_path}")
414
+
415
  if not os.path.exists(latent_path):
416
  filtered_files.append(audio_path)
417
  print(f"Skipping {len(audio_files) - len(filtered_files)} existing files")
dac-vae/requirements.txt ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ --extra-index-url https://download.pytorch.org/whl/cu121
2
+ --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/ # https://github.com/microsoft/onnxruntime/issues/21684
3
+ conformer==0.3.2
4
+ deepspeed==0.15.1; sys_platform == 'linux'
5
+ diffusers==0.29.0
6
+ fastapi==0.115.6
7
+ fastapi-cli==0.0.4
8
+ gdown==5.1.0
9
+ gradio==5.4.0
10
+ grpcio==1.57.0
11
+ grpcio-tools==1.57.0
12
+ hydra-core==1.3.2
13
+ HyperPyYAML==1.2.2
14
+ inflect==7.3.1
15
+ librosa==0.10.2
16
+ lightning==2.2.4
17
+ matplotlib==3.7.5
18
+ modelscope==1.20.0
19
+ networkx==3.1
20
+ omegaconf==2.3.0
21
+ onnx==1.16.0
22
+ onnxruntime-gpu==1.18.0; sys_platform == 'linux'
23
+ onnxruntime==1.18.0; sys_platform == 'darwin' or sys_platform == 'win32'
24
+ openai-whisper==20231117
25
+ protobuf==4.25
26
+ pyarrow==18.1.0
27
+ pydantic==2.7.0
28
+ pyworld==0.3.4
29
+ rich==13.7.1
30
+ soundfile==0.12.1
31
+ tensorboard==2.14.0
32
+ tensorrt-cu12==10.0.1; sys_platform == 'linux'
33
+ tensorrt-cu12-bindings==10.0.1; sys_platform == 'linux'
34
+ tensorrt-cu12-libs==10.0.1; sys_platform == 'linux'
35
+ torch==2.3.1
36
+ torchaudio==2.3.1
37
+ transformers==4.40.1
38
+ uvicorn==0.30.0
39
+ wetext==0.0.4
40
+ wget==3.2
41
+ flatten_dict
42
+ julius
43
+ importlib_resources
44
+ randomname