Spaces:

mnhatdaous
/

learnable-speech

Sleeping

App Files Files Community

primepake commited on Aug 30

Commit

5d43438

1 Parent(s): 2202b7c

Update dac-vae before subtree push

Browse files

Files changed (4) hide show

dac-vae/README.md +203 -0
dac-vae/extract.sh +4 -4
dac-vae/extract_dac_latents.py +9 -3
dac-vae/requirements.txt +44 -0

dac-vae/README.md ADDED Viewed

	@@ -0,0 +1,203 @@

+# Descript Audio Codec - VAE Variant (.dac-vae): High-Fidelity Audio Compression with Variational Autoencoder
+This repository contains training and inference scripts for the Descript Audio Codec VAE variant (.dac-vae), a modified version of the [original DAC](https://github.com/descriptinc/descript-audio-codec) that replaces the RVQGAN architecture with a Variational Autoencoder while maintaining the same high-quality audio compression capabilities.
+## Overview
+Building on the foundation of the [original Descript Audio Codec](https://github.com/descriptinc/descript-audio-codec), **DAC-VAE** adapts the architecture to use Variational Autoencoder principles instead of Residual Vector Quantization (RVQ).
+### Key Differences from Original DAC
+👉 **DAC-VAE** compresses **24 kHz audio** (instead of 44.1 kHz) using a continuous latent representation through VAE architecture
+### 🔄 Architecture Changes:
+- Replaces the RVQGAN's discrete codebook with VAE's continuous latent space
+- Maintains the same encoder-decoder backbone architecture from the original DAC
+- Swaps vector quantization layers for VAE reparameterization trick
+- Preserves the multi-scale discriminator design for adversarial training
+### 🎯 Inherited Features from Original DAC:
+- High-fidelity neural audio compression
+- Universal model for all audio domains (speech, environment, music, etc.)
+- Efficient encoding and decoding
+- State-of-the-art reconstruction quality
+## Why VAE Instead of RVQGAN?
+This fork explores an alternative approach to the original DAC's discrete coding strategy:
+| Component | Original DAC (RVQGAN) | DAC-VAE (This Repo) |
+|-----------|----------------------|---------------------|
+| Latent Space | Discrete (VQ codes) | Continuous (Gaussian) |
+| Sampling Rate | 44.1 kHz | 24 kHz |
+| Quantization | Residual VQ with codebooks | VAE reparameterization |
+| Training Objective | Reconstruction + VQ + Adversarial | Reconstruction + KL + Adversarial |
+| Compression | Fixed bitrate (8 kbps) | Variable (KL-controlled) |
+## Installation
+```bash
+# Clone this repository
+git clone https://github.com/primepake/dac-vae.git
+cd dac-vae
+# Install dependencies
+pip install -r requirements.txt
+```
+## Usage
+### Inference
+```bash
+python3 inference.py \
+    --checkpoint checkpoint.pt \
+    --config configs/configx2.yml \
+    --mode encode_decode \
+    --input test.wav \
+    --output reconstruction.wav
+```
+### Training
+```bash
+# Single GPU training
+python3 train.py --run_id factorx2
+# Multi-GPU training (4 GPUs)
+torchrun --nnodes=1 --nproc_per_node=4 train.py --run_id factorx2
+```
+### Python API
+```python
+import torch
+from dac_vae import DACVAE
+# Load model
+model = DACVAE.load_from_checkpoint("checkpoint.pt")
+model.eval()
+# Load audio (assuming 24kHz)
+audio, sr = torchaudio.load("input.wav")
+if sr != 24000:
+    audio = torchaudio.functional.resample(audio, sr, 24000)
+# Encode to latent
+with torch.no_grad():
+    latent = model.encode(audio)
+# Decode back to audio
+with torch.no_grad():
+    reconstructed = model.decode(latent)
+# Save output
+torchaudio.save("output.wav", reconstructed, 24000)
+```
+## Model Architecture
+DAC-VAE preserves most of the original DAC architecture with key modifications:
+- **Encoder**: Same convolutional architecture as original DAC
+- **Latent Layer**: VAE reparameterization (replaces VQ-VAE quantization)
+- **Decoder**: Identical transposed convolution architecture
+- **Discriminator**: Same multi-scale discriminator for perceptual quality
+### Configuration
+The model can be configured through YAML files in the `configs/` directory:
+- `configx2.yml`: Default 24kHz configuration with 2x downsampling factor
+- Adjust latent dimensions, KL weight, and other hyperparameters as needed
+## Training Details
+### Dataset Preparation
+Prepare your audio dataset with the following structure:
+```
+dataset/
+├── train/
+│   ├── audio1.wav
+│   ├── audio2.wav
+│   └── ...
+└── val/
+    ├── audio1.wav
+    ├── audio2.wav
+    └── ...
+```
+### Training Command
+```bash
+torchrun --nnodes=1 --nproc_per_node=4 train.py \
+    --run_id my_experiment \
+    --config configs/configx2.yml \
+    --dataset_path /path/to/dataset \
+    --num_epochs 200 \
+    --batch_size 32
+```
+## Evaluation
+Evaluate model performance using:
+```bash
+python3 evaluate.py \
+    --checkpoint checkpoint.pt \
+    --test_dir /path/to/test/audio \
+    --metrics pesq stoi
+```
+## Pretrained Models
+| Model | Sample Rate | Config | Download |
+|-------|-------------|---------|----------|
+| dac_vae_24khz_v1 | 24 kHz | config.yml | [64 dim 3x frames](#) |
+| dac_vae_24khz_v1 | 24 kHz | configx2.yml | [80 dim 2x frames](#) |
+## Citation
+If you use DAC-VAE, please cite both this work and the original DAC paper:
+```bibtex
+@misc{dacvae2024,
+  title={DAC-VAE: Variational Autoencoder Adaptation of Descript Audio Codec},
+  author={primepake},
+  year={2024},
+  url={https://github.com/primepake/dac-vae}
+}
+@misc{kumar2023high,
+  title={High-Fidelity Audio Compression with Improved RVQGAN},
+  author={Kumar, Rithesh and Seetharaman, Prem and Luebs, Alejandro and Kumar, Ishaan and Kumar, Kundan},
+  journal={arXiv preprint arXiv:2306.06546},
+  year={2023}
+}
+```
+## License
+This project maintains the same license as the original Descript Audio Codec. See [LICENSE](LICENSE) file for details.
+## Acknowledgments
+This work is built directly on top of the excellent [Descript Audio Codec](https://github.com/descriptinc/descript-audio-codec) by the Descript team. We thank them for open-sourcing their high-quality implementation, which made this VAE exploration possible.
+## Related Links
+- [Original DAC Repository](https://github.com/descriptinc/descript-audio-codec)
+- [Original DAC Paper](https://arxiv.org/abs/2306.06546)
+- [Descript Audio Codec Demo](https://descript.notion.site/Descript-Audio-Codec-11389fce0ce2419891d6591a18f30bfd)
+## Contributing
+Contributions are welcome! Please feel free to submit a Pull Request.
+## Contact
+For questions and feedback, please open an issue in this repository.

dac-vae/extract.sh CHANGED Viewed

@@ -1,10 +1,10 @@
 python extract_dac_latents.py \
     --root_path /data/dataset \
-    --file_list files.txt \
     --output_dir /data/dataset/metadata \
-    --checkpoint ./checkpoint.pt \
-    --config ./config.yml \
-    --num_gpus 1 \
     --num_decode_samples 10

 python extract_dac_latents.py \
     --root_path /data/dataset \
+    --file_list /data/learnable/speech/files.txt \
     --output_dir /data/dataset/metadata \
+    --checkpoint ./ckpts/300k_20250829_044827/checkpoint.pt\
+    --config ./configs/configx2.yml \
+    --num_gpus 4 \
     --num_decode_samples 10

dac-vae/extract_dac_latents.py CHANGED Viewed

@@ -168,9 +168,9 @@ def extract_latents_gpu(rank, world_size, args, audio_files):
         result = process_single_audio(audio_path, model, sample_rate, device)
         if result['success']:
-            # Create output path: a/b/c/d.wav -> a/b/c/d_latent.pt
             base_path = os.path.splitext(audio_path)[0]  # Remove extension
-            output_path = f"{base_path}_latent.pt"
             # Create directory if it doesn't exist
             os.makedirs(os.path.dirname(output_path), exist_ok=True)
@@ -405,7 +405,13 @@ def main():
         filtered_files = []
         for audio_path in audio_files:
             base_path = os.path.splitext(audio_path)[0]
-            latent_path = f"{base_path}_latent.pt"
             if not os.path.exists(latent_path):
                 filtered_files.append(audio_path)
         print(f"Skipping {len(audio_files) - len(filtered_files)} existing files")

         result = process_single_audio(audio_path, model, sample_rate, device)
         if result['success']:
+            # Create output path: a/b/c/d.wav -> a/b/c/d_latent2x.pt
             base_path = os.path.splitext(audio_path)[0]  # Remove extension
+            output_path = f"{base_path}_latent2x.pt"
             # Create directory if it doesn't exist
             os.makedirs(os.path.dirname(output_path), exist_ok=True)
         filtered_files = []
         for audio_path in audio_files:
             base_path = os.path.splitext(audio_path)[0]
+            latent_path = f"{base_path}_latent2x.pt"
+            old_latent_path = f"{base_path}_latent.pt"
+            if os.path.exists(old_latent_path):
+                os.remove(old_latent_path)
+                print(f"Removed old latent file: {old_latent_path}")
             if not os.path.exists(latent_path):
                 filtered_files.append(audio_path)
         print(f"Skipping {len(audio_files) - len(filtered_files)} existing files")

dac-vae/requirements.txt ADDED Viewed

	@@ -0,0 +1,44 @@

+--extra-index-url https://download.pytorch.org/whl/cu121
+--extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/ # https://github.com/microsoft/onnxruntime/issues/21684
+conformer==0.3.2
+deepspeed==0.15.1; sys_platform == 'linux'
+diffusers==0.29.0
+fastapi==0.115.6
+fastapi-cli==0.0.4
+gdown==5.1.0
+gradio==5.4.0
+grpcio==1.57.0
+grpcio-tools==1.57.0
+hydra-core==1.3.2
+HyperPyYAML==1.2.2
+inflect==7.3.1
+librosa==0.10.2
+lightning==2.2.4
+matplotlib==3.7.5
+modelscope==1.20.0
+networkx==3.1
+omegaconf==2.3.0
+onnx==1.16.0
+onnxruntime-gpu==1.18.0; sys_platform == 'linux'
+onnxruntime==1.18.0; sys_platform == 'darwin' or sys_platform == 'win32'
+openai-whisper==20231117
+protobuf==4.25
+pyarrow==18.1.0
+pydantic==2.7.0
+pyworld==0.3.4
+rich==13.7.1
+soundfile==0.12.1
+tensorboard==2.14.0
+tensorrt-cu12==10.0.1; sys_platform == 'linux'
+tensorrt-cu12-bindings==10.0.1; sys_platform == 'linux'
+tensorrt-cu12-libs==10.0.1; sys_platform == 'linux'
+torch==2.3.1
+torchaudio==2.3.1
+transformers==4.40.1
+uvicorn==0.30.0
+wetext==0.0.4
+wget==3.2
+flatten_dict
+julius
+importlib_resources
+randomname