Spaces:

learnmlf
/

Acfoley

Sleeping

ZJUCQR commited on 9 days ago

Commit

e2bca25

1 Parent(s): e46f0d2

Add hf_AC audio generation demo

- Implement Gradio interface for video-to-audio generation
- Add Chinese language support
- Include hf_AC model integration
- Add requirements.txt and packages.txt for HF Space deployment
- Add example prompts and usage tips
- Include deployment documentation

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitignore +87 -0
DEPLOYMENT.md +103 -0
README.md +43 -8
app.py +348 -58
hf_AC/.gitignore +224 -0
hf_AC/README.md +25 -0
hf_AC/config/__init__.py +0 -0
hf_AC/config/base_config.yaml +62 -0
hf_AC/config/data/base.yaml +69 -0
hf_AC/config/data/base2.yaml +69 -0
hf_AC/config/eval_config.yaml +17 -0
hf_AC/config/eval_data/base.yaml +22 -0
hf_AC/config/hydra/job_logging/custom-eval.yaml +32 -0
hf_AC/config/hydra/job_logging/custom-no-rank.yaml +32 -0
hf_AC/config/hydra/job_logging/custom-simplest.yaml +26 -0
hf_AC/config/hydra/job_logging/custom.yaml +33 -0
hf_AC/config/train_config.yaml +41 -0
hf_AC/config/train_config_2node.yaml +41 -0
hf_AC/config/train_config_2node2.yaml +41 -0
hf_AC/inf.py +181 -0
hf_AC/mmaudio/__init__.py +0 -0
hf_AC/mmaudio/data/__init__.py +0 -0
hf_AC/mmaudio/data/av_utils.py +162 -0
hf_AC/mmaudio/data/data_setup.py +177 -0
hf_AC/mmaudio/data/eval/__init__.py +0 -0
hf_AC/mmaudio/data/eval/audiocaps.py +39 -0
hf_AC/mmaudio/data/eval/moviegen.py +131 -0
hf_AC/mmaudio/data/eval/video_dataset.py +231 -0
hf_AC/mmaudio/data/extracted_audio.py +97 -0
hf_AC/mmaudio/data/extracted_vgg.py +109 -0
hf_AC/mmaudio/data/extraction/__init__.py +0 -0
hf_AC/mmaudio/data/extraction/vgg_sound.py +208 -0
hf_AC/mmaudio/data/extraction/wav_dataset.py +135 -0
hf_AC/mmaudio/data/mm_dataset.py +45 -0
hf_AC/mmaudio/data/utils.py +148 -0
hf_AC/mmaudio/eval_utils.py +249 -0
hf_AC/mmaudio/ext/__init__.py +1 -0
hf_AC/mmaudio/ext/autoencoder/__init__.py +1 -0
hf_AC/mmaudio/ext/autoencoder/autoencoder.py +52 -0
hf_AC/mmaudio/ext/autoencoder/edm2_utils.py +168 -0
hf_AC/mmaudio/ext/autoencoder/vae.py +369 -0
hf_AC/mmaudio/ext/autoencoder/vae_modules.py +117 -0
hf_AC/mmaudio/ext/bigvgan/LICENSE +21 -0
hf_AC/mmaudio/ext/bigvgan/__init__.py +1 -0
hf_AC/mmaudio/ext/bigvgan/activations.py +120 -0
hf_AC/mmaudio/ext/bigvgan/alias_free_torch/__init__.py +6 -0
hf_AC/mmaudio/ext/bigvgan/alias_free_torch/act.py +28 -0
hf_AC/mmaudio/ext/bigvgan/alias_free_torch/filter.py +95 -0
hf_AC/mmaudio/ext/bigvgan/alias_free_torch/resample.py +49 -0
hf_AC/mmaudio/ext/bigvgan/bigvgan.py +32 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,87 @@

+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyTorch
+*.pth
+*.pt
+*.ckpt
+weights/
+checkpoints/
+# Gradio
+gradio_cached_examples/
+flagged/
+# Temporary files
+*.tmp
+*.temp
+/tmp/
+temp/
+# Logs
+*.log
+logs/
+# Environment
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+# OS
+.DS_Store
+.DS_Store?
+._*
+.Spotlight-V100
+.Trashes
+ehthumbs.db
+Thumbs.db
+# Model files (too large for git)
+*.safetensors
+model.pth
+*.bin
+# Audio/Video files
+*.wav
+*.mp3
+*.mp4
+*.avi
+*.mov
+# Jupyter
+.ipynb_checkpoints/
+# Cache
+.cache/
+*.cache

DEPLOYMENT.md ADDED Viewed

	@@ -0,0 +1,103 @@

+# 🚀 Hugging Face Space 部署指南
+## 📁 文件结构
+确保你的HF Space包含以下文件：
+```
+Acfoley/
+├── README.md          # Space配置和说明
+├── app.py            # 主应用文件
+├── requirements.txt  # Python依赖
+├── packages.txt      # 系统依赖
+├── hf_AC/           # hf_AC模型代码
+└── .gitignore       # Git忽略文件
+```
+## 🔧 部署步骤
+### 1. 上传代码到HF Space
+将所有文件上传到你的Hugging Face Space仓库：
+```bash
+git add .
+git commit -m "Add hf_AC audio generation demo"
+git push
+```
+### 2. 模型权重下载
+模型会自动从以下位置下载：
+- 主模型: `https://huggingface.co/FF2416/AC-Foley/resolve/main/model.pth`
+- 其他组件会根据需要自动下载
+### 3. 环境配置
+HF Space会自动：
+- 安装`requirements.txt`中的Python包
+- 安装`packages.txt`中的系统依赖
+- 运行`app.py`启动Gradio界面
+## 📋 README.md 配置
+确保README.md顶部包含正确的YAML配置：
+```yaml
+---
+title: hf_AC Audio Foley Generator
+emoji: 🎵
+colorFrom: blue
+colorTo: green
+sdk: gradio
+sdk_version: 5.42.0
+app_file: app.py
+pinned: false
+license: mit
+---
+```
+## 🔍 故障排除
+### 常见问题
+1. **模型下载失败**
+   - 检查网络连接
+   - 确认模型URL可访问
+2. **依赖安装失败**
+   - 检查`requirements.txt`格式
+   - 确认包版本兼容性
+3. **内存不足**
+   - HF Space免费版有内存限制
+   - 考虑优化模型或升级到付费版
+### 调试方法
+1. 查看Space日志
+2. 运行`test_setup.py`验证环境
+3. 检查模型文件是否正确下载
+## 🎯 使用说明
+部署成功后，用户可以：
+1. 上传MP4视频文件
+2. 输入音频描述文字
+3. 调整生成参数
+4. 点击生成按钮
+5. 下载生成的音频
+## 📊 性能优化
+- 首次运行需要下载模型（约几GB）
+- 生成时间取决于视频长度和硬件
+- 建议视频时长控制在15秒以内
+## 🔗 相关链接
+- [hf_AC GitHub](https://github.com/ff2416/hf_AC)
+- [模型权重](https://huggingface.co/FF2416/AC-Foley)
+- [Gradio文档](https://gradio.app/docs/)
+- [HF Spaces文档](https://huggingface.co/docs/hub/spaces)

README.md CHANGED Viewed

@@ -1,15 +1,50 @@
 ---
-title: Acfoley
-emoji: 💬
-colorFrom: yellow
-colorTo: purple
 sdk: gradio
 sdk_version: 5.42.0
 app_file: app.py
 pinned: false
-hf_oauth: true
-hf_oauth_scopes:
-- inference-api
 ---
-An example chatbot using [Gradio](https://gradio.app), [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/v0.22.2/en/index), and the [Hugging Face Inference API](https://huggingface.co/docs/api-inference/index).

 ---
+title: hf_AC Audio Foley Generator
+emoji: 🎵
+colorFrom: blue
+colorTo: green
 sdk: gradio
 sdk_version: 5.42.0
 app_file: app.py
 pinned: false
+license: mit
 ---
+# 🎵 hf_AC Audio Foley Generator
+A Gradio demo for generating synchronized audio from videos using the hf_AC (Audio-Conditioned Foley) model. This application allows you to upload a video and generate matching audio content based on text descriptions.
+## Features
+- **Video-to-Audio Generation**: Upload a video and generate synchronized audio
+- **Text-Guided Generation**: Use text prompts to describe the desired audio
+- **Customizable Parameters**: Adjust duration, CFG strength, and other generation parameters
+- **Real-time Processing**: Generate audio in real-time with GPU acceleration
+## How to Use
+1. **Load Model**: The model will automatically load when you start the app
+2. **Upload Video**: Choose a video file (MP4 format recommended)
+3. **Describe Audio**: Write a text description of the audio you want to generate
+4. **Generate**: Click the generate button and wait for the audio to be created
+5. **Download**: Listen to and download the generated audio
+## Example Prompts
+- "Crackling fireplace with gentle flames"
+- "Ocean waves crashing on rocky shore"
+- "Busy city street with car horns and chatter"
+- "Forest ambience with bird songs and rustling leaves"
+- "Keyboard typing in a quiet office"
+## Model Information
+This demo uses the hf_AC model, which is designed for audio-visual synchronization and generation. The model can generate high-quality audio that matches the visual content and text descriptions.
+## Technical Details
+- **Framework**: PyTorch, Gradio
+- **Model**: hf_AC (Audio-Conditioned Foley)
+- **Audio Format**: WAV, 44.1kHz
+- **Video Support**: MP4, various resolutions
+- **Processing**: GPU-accelerated when available

app.py CHANGED Viewed

@@ -1,70 +1,360 @@
 import gradio as gr
-from huggingface_hub import InferenceClient
-def respond(
-    message,
-    history: list[dict[str, str]],
-    system_message,
-    max_tokens,
-    temperature,
-    top_p,
-    hf_token: gr.OAuthToken,
-):
-    """
-    For more information on `huggingface_hub` Inference API support, please check the docs: https://huggingface.co/docs/huggingface_hub/v0.22.2/en/guides/inference
-    """
-    client = InferenceClient(token=hf_token.token, model="openai/gpt-oss-20b")
-    messages = [{"role": "system", "content": system_message}]
-    messages.extend(history)
-    messages.append({"role": "user", "content": message})
-    response = ""
-    for message in client.chat_completion(
-        messages,
-        max_tokens=max_tokens,
-        stream=True,
-        temperature=temperature,
-        top_p=top_p,
-    ):
-        choices = message.choices
-        token = ""
-        if len(choices) and choices[0].delta.content:
-            token = choices[0].delta.content
-        response += token
-        yield response
-"""
-For information on how to customize the ChatInterface, peruse the gradio docs: https://www.gradio.app/docs/chatinterface
-"""
-chatbot = gr.ChatInterface(
-    respond,
-    type="messages",
-    additional_inputs=[
-        gr.Textbox(value="You are a friendly Chatbot.", label="System message"),
-        gr.Slider(minimum=1, maximum=2048, value=512, step=1, label="Max new tokens"),
-        gr.Slider(minimum=0.1, maximum=4.0, value=0.7, step=0.1, label="Temperature"),
-        gr.Slider(
-            minimum=0.1,
-            maximum=1.0,
-            value=0.95,
-            step=0.05,
-            label="Top-p (nucleus sampling)",
-        ),
-    ],
-)
-with gr.Blocks() as demo:
-    with gr.Sidebar():
-        gr.LoginButton()
-    chatbot.render()
 if __name__ == "__main__":
-    demo.launch()

 import gradio as gr
+import torch
+import torchaudio
+import logging
+import tempfile
+import os
+import sys
+from pathlib import Path
+import numpy as np
+from typing import Optional, Tuple
+import time
+import traceback
+# Add hf_AC to path
+current_dir = Path(__file__).parent
+hf_ac_path = current_dir / "hf_AC"
+if hf_ac_path.exists():
+    sys.path.insert(0, str(hf_ac_path))
+# Configuration for HF Space
+EXAMPLE_PROMPTS = [
+    "Crackling fireplace with gentle flames",
+    "Ocean waves crashing on rocky shore",
+    "Forest ambience with bird songs",
+    "Keyboard typing sounds",
+    "Footsteps on wooden floor",
+    "Rain on metal roof"
+]
+USAGE_TIPS = """
+### 💡 使用技巧
+1. **视频质量**: 使用清晰、光线良好的视频
+2. **提示词**: 具体描述想要的音频类型
+3. **时长**: 建议1-30秒效果最佳
+4. **CFG强度**: 数值越高越贴合提示词，但可能降低质量
+"""
+# Import hf_AC modules with error handling
+try:
+    from hf_AC.mmaudio.eval_utils import (ModelConfig, all_model_cfg, generate, load_video,
+                                          setup_eval_logging)
+    from hf_AC.mmaudio.model.flow_matching import FlowMatching
+    from hf_AC.mmaudio.model.networks import MMAudio, get_my_mmaudio
+    from hf_AC.mmaudio.model.utils.features_utils import FeaturesUtils
+    # Setup logging
+    setup_eval_logging()
+    log = logging.getLogger()
+    HF_AC_AVAILABLE = True
+except ImportError as e:
+    print(f"Warning: hf_AC modules not available: {e}")
+    log = logging.getLogger()
+    HF_AC_AVAILABLE = False
+class AudioFoleyModel:
+    def __init__(self):
+        self.device = 'cpu'
+        if torch.cuda.is_available():
+            self.device = 'cuda'
+        elif torch.backends.mps.is_available():
+            self.device = 'mps'
+        self.dtype = torch.bfloat16
+        self.model = None
+        self.net = None
+        self.fm = None
+        self.feature_utils = None
+    def load_model(self, variant='large_44k', model_path=None):
+        """Load the hf_AC model"""
+        try:
+            if not HF_AC_AVAILABLE:
+                return "❌ hf_AC modules not available. Please install the hf_AC package."
+            if variant not in all_model_cfg:
+                available_variants = list(all_model_cfg.keys()) if all_model_cfg else []
+                return f"❌ Unknown model variant: {variant}. Available: {available_variants}"
+            log.info(f"Loading model variant: {variant}")
+            self.model: ModelConfig = all_model_cfg[variant]
+            # Download model components if needed
+            try:
+                self.model.download_if_needed()
+            except Exception as e:
+                log.warning(f"Could not download model components: {e}")
+            # Set custom model path if provided
+            if model_path and os.path.exists(model_path):
+                self.model.model_path = Path(model_path)
+                log.info(f"Using custom model path: {model_path}")
+            # Load network
+            self.net: MMAudio = get_my_mmaudio(self.model.model_name).to(self.device, self.dtype).eval()
+            # Load weights
+            if hasattr(self.model, 'model_path') and self.model.model_path and Path(self.model.model_path).exists():
+                try:
+                    weights = torch.load(self.model.model_path, map_location=self.device, weights_only=True)
+                    self.net.load_weights(weights['weights'])
+                    log.info(f'✅ Loaded weights from {self.model.model_path}')
+                except Exception as e:
+                    log.error(f"Failed to load weights: {e}")
+                    return f"❌ Failed to load model weights: {e}"
+            else:
+                log.warning('⚠️ No model weights found, using default initialization')
+                return "⚠️ Model loaded but no weights found. Download model.pth from HuggingFace."
+            # Initialize flow matching
+            self.fm = FlowMatching(min_sigma=0, inference_mode='euler', num_steps=25)
+            # Initialize feature utils
+            try:
+                self.feature_utils = FeaturesUtils(
+                    tod_vae_ckpt=self.model.vae_path,
+                    synchformer_ckpt=self.model.synchformer_ckpt,
+                    enable_conditions=True,
+                    mode=self.model.mode,
+                    bigvgan_vocoder_ckpt=self.model.bigvgan_16k_path,
+                    need_vae_encoder=True
+                )
+                self.feature_utils = self.feature_utils.to(self.device, self.dtype).eval()
+            except Exception as e:
+                log.error(f"Failed to initialize feature utils: {e}")
+                return f"❌ Failed to initialize feature utilities: {e}"
+            return "✅ Model loaded successfully!"
+        except Exception as e:
+            error_msg = f"❌ Error loading model: {str(e)}\n{traceback.format_exc()}"
+            log.error(error_msg)
+            return error_msg
+    def generate_audio(self, video_file, prompt: str, negative_prompt: str = "",
+                      duration: float = 8.0, cfg_strength: float = 4.5,
+                      seed: int = 42) -> Tuple[Optional[str], str]:
+        """Generate audio from video and text prompt"""
+        try:
+            # Validation checks
+            if not HF_AC_AVAILABLE:
+                return None, "❌ hf_AC modules not available."
+            if self.net is None or self.feature_utils is None:
+                return None, "❌ Model not loaded. Please load the model first."
+            if video_file is None:
+                return None, "❌ Please upload a video file."
+            if not prompt.strip():
+                return None, "❌ Please provide a text prompt describing the desired audio."
+            log.info(f'🎬 Processing video: {video_file}')
+            log.info(f'📝 Prompt: "{prompt}"')
+            # Load and process video
+            try:
+                video_path = Path(video_file)
+                if not video_path.exists():
+                    return None, f"❌ Video file not found: {video_file}"
+                video_info = load_video(video_path, duration)
+                clip_frames = video_info.clip_frames
+                sync_frames = video_info.sync_frames
+                duration_sec = video_info.duration_sec
+                log.info(f'📹 Video loaded: {duration_sec:.2f}s duration')
+            except Exception as e:
+                return None, f"❌ Failed to load video: {str(e)}"
+            # Prepare frames
+            clip_frames = clip_frames.unsqueeze(0) if clip_frames is not None else None
+            sync_frames = sync_frames.unsqueeze(0)
+            # Update model sequence configuration
+            try:
+                self.model.seq_cfg.duration = duration_sec
+                self.model.seq_cfg.audio_num_sample = 89088  # Default for 44kHz
+                self.net.update_seq_lengths(
+                    self.model.seq_cfg.latent_seq_len,
+                    self.model.seq_cfg.clip_seq_len,
+                    self.model.seq_cfg.sync_seq_len,
+                    self.model.seq_cfg.audio_seq_len
+                )
+            except Exception as e:
+                return None, f"❌ Failed to configure model: {str(e)}"
+            # Generate audio
+            try:
+                log.info('🎵 Generating audio...')
+                start_time = time.time()
+                with torch.inference_mode():
+                    audios = generate(
+                        clip_frames,
+                        sync_frames,
+                        [prompt],
+                        None,  # No reference audio
+                        negative_text=[negative_prompt] if negative_prompt.strip() else None,
+                        feature_utils=self.feature_utils,
+                        net=self.net,
+                        fm=self.fm,
+                        rng=torch.Generator(device=self.device).manual_seed(seed),
+                        cfg_strength=cfg_strength
+                    )
+                generation_time = time.time() - start_time
+                log.info(f'⏱️ Generation completed in {generation_time:.2f}s')
+            except Exception as e:
+                return None, f"❌ Audio generation failed: {str(e)}"
+            # Save generated audio
+            try:
+                audio = audios.float().cpu()[0]
+                # Create output filename with timestamp
+                timestamp = int(time.time())
+                output_filename = f"generated_audio_{timestamp}.wav"
+                permanent_path = f"/tmp/{output_filename}"
+                # Save audio file
+                torchaudio.save(permanent_path, audio, self.model.seq_cfg.sampling_rate)
+                # Verify file was created
+                if not os.path.exists(permanent_path):
+                    return None, "❌ Failed to save audio file"
+                file_size = os.path.getsize(permanent_path) / 1024  # KB
+                success_msg = f"✅ Audio generated successfully!\n"
+                success_msg += f"📊 Duration: {duration_sec:.2f}s | "
+                success_msg += f"Size: {file_size:.1f}KB | "
+                success_msg += f"Time: {generation_time:.2f}s"
+                return permanent_path, success_msg
+            except Exception as e:
+                return None, f"❌ Failed to save audio: {str(e)}"
+        except Exception as e:
+            error_msg = f"❌ Unexpected error: {str(e)}\n{traceback.format_exc()}"
+            log.error(error_msg)
+            return None, error_msg
+# Initialize model
+audio_model = AudioFoleyModel()
+def generate_audio_interface(video_file, prompt, duration, cfg_strength):
+    """Interface function for generating audio"""
+    # Use fixed seed for consistency in HF Space
+    seed = 42
+    negative_prompt = ""  # Simplified interface
+    audio_path, message = audio_model.generate_audio(
+        video_file, prompt, negative_prompt, duration, cfg_strength, seed
+    )
+    return audio_path, message
+# Create Gradio interface
+with gr.Blocks(title="hf_AC Audio Foley Generator", theme=gr.themes.Soft()) as demo:
+    gr.Markdown("""
+    # 🎵 hf_AC Audio Foley Generator
+    基于AI的视频音频生成工具。上传视频并提供文本描述，模型将生成匹配的音频内容。
+    **注意**: 首次使用时模型需要下载，请耐心等待。
+    """)
+    # Model status display
+    model_status = gr.Textbox(
+        label="模型状态",
+        value="正在初始化模型...",
+        interactive=False
+    )
+    with gr.Row():
+        with gr.Column():
+            video_input = gr.Video(
+                label="上传视频",
+                format="mp4"
+            )
+            prompt_input = gr.Textbox(
+                label="音频描述",
+                placeholder="描述你想要生成的音频 (例如: '脚步声', '鸟叫声', '汽车引擎声')",
+                lines=3
+            )
+            with gr.Row():
+                duration_slider = gr.Slider(
+                    minimum=1.0,
+                    maximum=15.0,
+                    value=8.0,
+                    step=0.5,
+                    label="时长 (秒)"
+                )
+                cfg_strength_slider = gr.Slider(
+                    minimum=1.0,
+                    maximum=8.0,
+                    value=4.5,
+                    step=0.1,
+                    label="CFG强度"
+                )
+        with gr.Column():
+            # Example prompts
+            gr.Markdown("### 🎯 示例提示词")
+            example_buttons = []
+            for prompt in EXAMPLE_PROMPTS[:6]:
+                btn = gr.Button(prompt, size="sm")
+                example_buttons.append(btn)
+                btn.click(
+                    fn=lambda p=prompt: p,
+                    outputs=prompt_input
+                )
+    generate_btn = gr.Button("🎵 生成音频", variant="primary", size="lg")
+    audio_output = gr.Audio(
+        label="生成的音频",
+        type="filepath"
+    )
+    generation_status = gr.Textbox(label="生成状态", interactive=False)
+    generate_btn.click(
+        fn=generate_audio_interface,
+        inputs=[
+            video_input, prompt_input, duration_slider, cfg_strength_slider
+        ],
+        outputs=[audio_output, generation_status]
+    )
+    with gr.Accordion("💡 使用说明", open=False):
+        gr.Markdown(USAGE_TIPS)
+        gr.Markdown("""
+        ### 🎬 更多示例提示词
+        - "壁炉中燃烧的柴火声"
+        - "海浪拍打岩石的声音"
+        - "繁忙街道上的���车和人声"
+        - "森林中的鸟叫和树叶声"
+        - "安静办公室里的键盘敲击声"
+        - "厨房里炒菜和切菜的声音"
+        - "雨滴打在金属屋顶上"
+        - "木地板上轻柔的脚步声"
+        """)
+    # Auto-load model on startup
+    demo.load(
+        fn=lambda: audio_model.load_model(),
+        outputs=[model_status]
+    )
 if __name__ == "__main__":
+    # HF Space will handle the server configuration
+    demo.launch()

hf_AC/.gitignore ADDED Viewed

	@@ -0,0 +1,224 @@

+run_*.sh
+log/
+saves
+saves/
+weights/
+weights
+output/
+output
+pretrained/
+workspace
+workspace/
+ext_weights/
+ext_weights
+.checkpoints/
+.vscode/
+training/example_output/
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[codz]
+*$py.class
+# C extensions
+*.so
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py.cover
+.hypothesis/
+.pytest_cache/
+cover/
+# Translations
+*.mo
+*.pot
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+# Flask stuff:
+instance/
+.webassets-cache
+# Scrapy stuff:
+.scrapy
+# Sphinx documentation
+docs/_build/
+# PyBuilder
+.pybuilder/
+target/
+# Jupyter Notebook
+.ipynb_checkpoints
+# IPython
+profile_default/
+ipython_config.py
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+# UV
+#   Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#uv.lock
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+#poetry.toml
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#   pdm recommends including project-wide configuration in pdm.toml, but excluding .pdm-python.
+#   https://pdm-project.org/en/latest/usage/project/#working-with-version-control
+#pdm.lock
+#pdm.toml
+.pdm-python
+.pdm-build/
+# pixi
+#   Similar to Pipfile.lock, it is generally recommended to include pixi.lock in version control.
+#pixi.lock
+#   Pixi creates a virtual environment in the .pixi directory, just like venv module creates one
+#   in the .venv directory. It is recommended not to include this directory in version control.
+.pixi
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+# SageMath parsed files
+*.sage.py
+# Environments
+.env
+.envrc
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Spyder project settings
+.spyderproject
+.spyproject
+# Rope project settings
+.ropeproject
+# mkdocs documentation
+/site
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+# Pyre type checker
+.pyre/
+# pytype static type analyzer
+.pytype/
+# Cython debug symbols
+cython_debug/
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+#.idea/
+# Abstra
+# Abstra is an AI-powered process automation framework.
+# Ignore directories containing user credentials, local state, and settings.
+# Learn more at https://abstra.io/docs
+.abstra/
+# Visual Studio Code
+#  Visual Studio Code specific template is maintained in a separate VisualStudioCode.gitignore
+#  that can be found at https://github.com/github/gitignore/blob/main/Global/VisualStudioCode.gitignore
+#  and can be added to the global gitignore or merged into this file. However, if you prefer,
+#  you could uncomment the following to ignore the entire vscode folder
+# .vscode/
+# Ruff stuff:
+.ruff_cache/
+# PyPI configuration file
+.pypirc
+# Cursor
+#  Cursor is an AI-powered code editor. `.cursorignore` specifies files/directories to
+#  exclude from AI features like autocomplete and code analysis. Recommended for sensitive data
+#  refer to https://docs.cursor.com/context/ignore-files
+.cursorignore
+.cursorindexingignore
+# Marimo
+marimo/_static/
+marimo/_lsp/
+__marimo__/

hf_AC/README.md ADDED Viewed

	@@ -0,0 +1,25 @@

+# hf_AC
+## Environment Setup
+- Python 3.9+
+- PyTorch **2.5.1+** and corresponding torchvision/torchaudio (pick your CUDA version https://pytorch.org/, pip install recommended)
+```bash
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 --upgrade
+git clone https://github.com/ff2416/hf_AC.git
+cd hf_AC
+pip install -e .
+```
+## Model Installation
+https://huggingface.co/FF2416/AC-Foley/blob/main/model.pth
+## Inference
+```bash
+python inf.py \
+  --model_path <model path> \
+  --duration 8 \
+  --prompt <prompt> \
+  --video_dir <videos directory or video path> \
+  --audio_path <audio path> \
+  --output <output path>
+```

hf_AC/config/__init__.py ADDED Viewed

File without changes

hf_AC/config/base_config.yaml ADDED Viewed

	@@ -0,0 +1,62 @@

+defaults:
+  - data: base
+  - eval_data: base
+  - override hydra/job_logging: custom-simplest
+  - _self_
+hydra:
+  run:
+    dir: ./output/${exp_id}
+  output_subdir: ${now:%Y-%m-%d_%H-%M-%S}-hydra
+enable_email: False
+model: large_16k
+exp_id: default
+debug: False
+cudnn_benchmark: True
+compile: True
+amp: True
+weights: null
+checkpoint: null
+seed: 14159265
+num_workers: 10 # per-GPU
+pin_memory: False # set to True if your system can handle it, i.e., have enough memory
+# NOTE: This DOSE NOT affect the model during inference in any way
+# they are just for the dataloader to fill in the missing data in multi-modal loading
+# to change the sequence length for the model, see networks.py
+data_dim:
+  text_seq_len: 77
+  clip_dim: 1024
+  sync_dim: 768
+  text_dim: 1024
+# ema configuration
+ema:
+  enable: True
+  sigma_rels: [0.05, 0.1]
+  update_every: 1
+  checkpoint_every: 10_000
+  checkpoint_folder: ${hydra:run.dir}/ema_ckpts
+  default_output_sigma: 0.05
+# sampling
+sampling:
+  mean: 0.0
+  scale: 1.0
+  min_sigma: 0.0
+  method: euler
+  num_steps: 25
+# classifier-free guidance
+null_condition_probability: 0.1
+cfg_strength: 4.5
+# checkpoint paths to external modules
+vae_16k_ckpt: ./ext_weights/v1-16.pth
+vae_44k_ckpt: ./ext_weights/v1-44.pth
+bigvgan_vocoder_ckpt: ./ext_weights/best_netG.pt
+synchformer_ckpt: ./ext_weights/synchformer_state_dict.pth

hf_AC/config/data/base.yaml ADDED Viewed

	@@ -0,0 +1,69 @@

+VGGSound:
+  root: /project/llmsvgen/share/data_vggsound/dataset/scratch/shared/beegfs/hchen/train_data/VGGSound_final/video
+  subset_name: /project/llmsvgen/pengjun/MMAudio_dev/tsv/vgg-train.tsv
+  fps: 8
+  height: 384
+  width: 384
+  sample_duration_sec: 8.0
+VGGSound_test:
+  root: /project/llmsvgen/share/data_vggsound/dataset/scratch/shared/beegfs/hchen/train_data/VGGSound_final/video
+  subset_name: /project/llmsvgen/pengjun/MMAudio_dev/tsv/vgg-test.tsv
+  fps: 8
+  height: 384
+  width: 384
+  sample_duration_sec: 8.0
+VGGSound_val:
+  root: /project/llmsvgen/share/data_vggsound/dataset/scratch/shared/beegfs/hchen/train_data/VGGSound_final/video
+  subset_name: /project/llmsvgen/pengjun/MMAudio_dev/tsv/vgg-val.tsv
+  fps: 8
+  height: 384
+  width: 384
+  sample_duration_sec: 8.0
+ExtractedVGG:
+  tsv: /project/llmsvgen/pengjun/MMAudio_dev/training/vgg-clap/memmap/vgg-train.tsv
+  memmap_dir: /project/llmsvgen/pengjun/MMAudio_dev/training/vgg-clap/memmap/vgg-train
+ExtractedVGG_test:
+  tag: test
+  gt_cache: ../data/eval-cache/vggsound-test
+  output_subdir: null
+  tsv: /project/llmsvgen/pengjun/MMAudio_dev/training/vgg-clap/memmap/vgg-test.tsv
+  memmap_dir: /project/llmsvgen/pengjun/MMAudio_dev/training/vgg-clap/memmap/vgg-test
+ExtractedVGG_val:
+  tag: val
+  gt_cache: /project/llmsvgen/pengjun/MMAudio_dev/training/val_cache
+  output_subdir: val
+  tsv: /project/llmsvgen/pengjun/MMAudio_dev/training/vgg-clap/memmap/vgg-val.tsv
+  memmap_dir: /project/llmsvgen/pengjun/MMAudio_dev/training/vgg-clap/memmap/vgg-val
+AudioCaps:
+  tsv: /project/llmsvgen/pengjun/MMAudio_dev/training/audiocaps_clap/memmap/audiocaps_clap.tsv
+  memmap_dir: /project/llmsvgen/pengjun/MMAudio_dev/training/audiocaps_clap/memmap/audiocaps_clap
+AudioSetSL:
+  tsv: /project/llmsvgen/pengjun/MMAudio_dev/training/audioset_clap/memmap/audioset_clap.tsv
+  memmap_dir: /project/llmsvgen/pengjun/MMAudio_dev/training/audioset_clap/memmap/audioset_clap
+# BBCSound:
+#   tsv: ../data/v1-16-memmap/bbcsound.tsv
+#   memmap_dir: ../data/v1-16-memmap/bbcsound
+FreeSound:
+  tsv: /project/llmsvgen/pengjun/MMAudio_dev/training/freesound_clap/memmap/freesound_clap.tsv
+  memmap_dir: /project/llmsvgen/pengjun/MMAudio_dev/training/freesound_clap/memmap/freesound_clap
+# Clotho:
+#   tsv: ../data/v1-16-memmap/clotho.tsv
+#   memmap_dir: ../data/v1-16-memmap/clotho
+# Example_video:
+#   tsv: ./training/example_output/memmap/vgg-example.tsv
+#   memmap_dir: ./training/example_output/memmap/vgg-example
+# Example_audio:
+#   tsv: ./training/example_output/memmap/audio-example.tsv
+#   memmap_dir: ./training/example_output/memmap/audio-example

hf_AC/config/data/base2.yaml ADDED Viewed

	@@ -0,0 +1,69 @@

+VGGSound:
+  root: /project/llmsvgen/share/data_vggsound/dataset/scratch/shared/beegfs/hchen/train_data/VGGSound_final/video
+  subset_name: /project/llmsvgen/pengjun/MMAudio_dev/tsv/vgg-train.tsv
+  fps: 8
+  height: 384
+  width: 384
+  sample_duration_sec: 8.0
+VGGSound_test:
+  root: /project/llmsvgen/share/data_vggsound/dataset/scratch/shared/beegfs/hchen/train_data/VGGSound_final/video
+  subset_name: /project/llmsvgen/pengjun/MMAudio_dev/tsv/vgg-test.tsv
+  fps: 8
+  height: 384
+  width: 384
+  sample_duration_sec: 8.0
+VGGSound_val:
+  root: /project/llmsvgen/share/data_vggsound/dataset/scratch/shared/beegfs/hchen/train_data/VGGSound_final/video
+  subset_name: /project/llmsvgen/pengjun/MMAudio_dev/tsv/vgg-val.tsv
+  fps: 8
+  height: 384
+  width: 384
+  sample_duration_sec: 8.0
+ExtractedVGG:
+  tsv: /project/llmsvgen/pengjun/MMAudio_dev/training/vgg/memmap/vgg-train.tsv
+  memmap_dir: /project/llmsvgen/pengjun/MMAudio_dev/training/vgg/memmap/vgg-train
+ExtractedVGG_test:
+  tag: test
+  gt_cache: ../data/eval-cache/vggsound-test
+  output_subdir: null
+  tsv: /project/llmsvgen/pengjun/MMAudio_dev/training/vgg/memmap/vgg-test.tsv
+  memmap_dir: /project/llmsvgen/pengjun/MMAudio_dev/training/vgg/memmap/vgg-test
+ExtractedVGG_val:
+  tag: val
+  gt_cache: /project/llmsvgen/pengjun/MMAudio_dev/training/val_cache
+  output_subdir: val
+  tsv: /project/llmsvgen/pengjun/MMAudio_dev/training/vgg/memmap/vgg-val.tsv
+  memmap_dir: /project/llmsvgen/pengjun/MMAudio_dev/training/vgg/memmap/vgg-val
+AudioCaps:
+  tsv: /project/llmsvgen/pengjun/MMAudio_dev/training/audiocaps/memmap/audiocaps.tsv
+  memmap_dir: /project/llmsvgen/pengjun/MMAudio_dev/training/audiocaps/memmap/audiocaps
+AudioSetSL:
+  tsv: /project/llmsvgen/pengjun/MMAudio_dev/training/audioset_sl/memmap/audioset_sl.tsv
+  memmap_dir: /project/llmsvgen/pengjun/MMAudio_dev/training/audioset_sl/memmap/audioset_sl
+# BBCSound:
+#   tsv: ../data/v1-16-memmap/bbcsound.tsv
+#   memmap_dir: ../data/v1-16-memmap/bbcsound
+FreeSound:
+  tsv: /project/llmsvgen/pengjun/MMAudio_dev/training/freesound/memmap/freesound.tsv
+  memmap_dir: /project/llmsvgen/pengjun/MMAudio_dev/training/freesound/memmap/freesound
+# Clotho:
+#   tsv: ../data/v1-16-memmap/clotho.tsv
+#   memmap_dir: ../data/v1-16-memmap/clotho
+# Example_video:
+#   tsv: ./training/example_output/memmap/vgg-example.tsv
+#   memmap_dir: ./training/example_output/memmap/vgg-example
+# Example_audio:
+#   tsv: ./training/example_output/memmap/audio-example.tsv
+#   memmap_dir: ./training/example_output/memmap/audio-example

hf_AC/config/eval_config.yaml ADDED Viewed

	@@ -0,0 +1,17 @@

+defaults:
+  - base_config
+  - override hydra/job_logging: custom-simplest
+  - _self_
+hydra:
+  run:
+    dir: ./output/${exp_id}
+  output_subdir: eval-${now:%Y-%m-%d_%H-%M-%S}-hydra
+exp_id: ${model}
+dataset: audiocaps
+duration_s: 8.0
+# for inference, this is the per-GPU batch size
+batch_size: 16
+output_name: null

hf_AC/config/eval_data/base.yaml ADDED Viewed

	@@ -0,0 +1,22 @@

+# AudioCaps:
+#   audio_path: ../data/AudioCaps-test-audioldm-ver
+#   # a csv file, with a header row of 'name' and 'caption'
+#   # name should match the audio file name without extension
+#   # Can be downloaded here: https://github.com/hkchengrex/MMAudio/releases/download/v0.1/AudioCaps_audioldm_data.csv
+#   csv_path: ../data/AudioCaps-test-audioldm-ver/data.csv
+# AudioCaps_full:
+#   audio_path: ../data/AudioCaps-test-full-ver
+#   # a csv file, with a header row of 'name' and 'caption'
+#   # name should match the audio file name without extension
+#   # Can be downloaded here: https://github.com/hkchengrex/MMAudio/releases/download/v0.1/AudioCaps_full_data.csv
+#   csv_path: ../data/AudioCaps-test-full-ver/data.csv
+# MovieGen:
+#   video_path: ../data/MovieGen/MovieGenAudioBenchSfx/video_with_audio
+#   jsonl_path: ../data/MovieGen/MovieGenAudioBenchSfx/metadata
+VGGSound:
+  video_path: /project/llmsvgen/pengjun/MMAudio_dev/training/test_video
+  # from the officially released csv file
+  csv_path: /project/llmsvgen/share/data_vggsound/VGGSound/vggsound.csv

hf_AC/config/hydra/job_logging/custom-eval.yaml ADDED Viewed

	@@ -0,0 +1,32 @@

+# python logging configuration for tasks
+version: 1
+formatters:
+  simple:
+    format: '[%(asctime)s][%(levelname)s][r${oc.env:LOCAL_RANK}] - %(message)s'
+    datefmt: '%Y-%m-%d %H:%M:%S'
+  colorlog:
+    '()': 'colorlog.ColoredFormatter'
+    format: '[%(cyan)s%(asctime)s%(reset)s][%(log_color)s%(levelname)s%(reset)s] - %(message)s'
+    datefmt: '%Y-%m-%d %H:%M:%S'
+    log_colors:
+      DEBUG: purple
+      INFO: green
+      WARNING: yellow
+      ERROR: red
+      CRITICAL: red
+handlers:
+  console:
+    class: logging.StreamHandler
+    formatter: colorlog
+    stream: ext://sys.stdout
+  file:
+    class: logging.FileHandler
+    formatter: simple
+    # absolute file path
+    filename: ${hydra.runtime.output_dir}/eval-${now:%Y-%m-%d_%H-%M-%S}-rank${oc.env:LOCAL_RANK}.log
+    mode: w
+root:
+  level: INFO
+  handlers: [console, file]
+disable_existing_loggers: false

hf_AC/config/hydra/job_logging/custom-no-rank.yaml ADDED Viewed

	@@ -0,0 +1,32 @@

+# python logging configuration for tasks
+version: 1
+formatters:
+  simple:
+    format: '[%(asctime)s][%(levelname)s] - %(message)s'
+    datefmt: '%Y-%m-%d %H:%M:%S'
+  colorlog:
+    '()': 'colorlog.ColoredFormatter'
+    format: '[%(cyan)s%(asctime)s%(reset)s][%(log_color)s%(levelname)s%(reset)s] - %(message)s'
+    datefmt: '%Y-%m-%d %H:%M:%S'
+    log_colors:
+      DEBUG: purple
+      INFO: green
+      WARNING: yellow
+      ERROR: red
+      CRITICAL: red
+handlers:
+  console:
+    class: logging.StreamHandler
+    formatter: colorlog
+    stream: ext://sys.stdout
+  file:
+    class: logging.FileHandler
+    formatter: simple
+    # absolute file path
+    filename: ${hydra.runtime.output_dir}/${now:%Y-%m-%d_%H-%M-%S}-eval.log
+    mode: w
+root:
+  level: INFO
+  handlers: [console, file]
+disable_existing_loggers: false

hf_AC/config/hydra/job_logging/custom-simplest.yaml ADDED Viewed

	@@ -0,0 +1,26 @@

+# python logging configuration for tasks
+version: 1
+formatters:
+  simple:
+    format: '[%(asctime)s][%(levelname)s] - %(message)s'
+    datefmt: '%Y-%m-%d %H:%M:%S'
+  colorlog:
+    '()': 'colorlog.ColoredFormatter'
+    format: '[%(cyan)s%(asctime)s%(reset)s][%(log_color)s%(levelname)s%(reset)s] - %(message)s'
+    datefmt: '%Y-%m-%d %H:%M:%S'
+    log_colors:
+      DEBUG: purple
+      INFO: green
+      WARNING: yellow
+      ERROR: red
+      CRITICAL: red
+handlers:
+  console:
+    class: logging.StreamHandler
+    formatter: colorlog
+    stream: ext://sys.stdout
+root:
+  level: INFO
+  handlers: [console]
+disable_existing_loggers: false

hf_AC/config/hydra/job_logging/custom.yaml ADDED Viewed

	@@ -0,0 +1,33 @@

+# @package hydra.job_logging
+# python logging configuration for tasks
+version: 1
+formatters:
+  simple:
+    format: '[%(asctime)s][%(levelname)s][r${oc.env:LOCAL_RANK}] - %(message)s'
+    datefmt: '%Y-%m-%d %H:%M:%S'
+  colorlog:
+    '()': 'colorlog.ColoredFormatter'
+    format: '[%(cyan)s%(asctime)s%(reset)s][%(blue)sr${oc.env:LOCAL_RANK}%(reset)s][%(log_color)s%(levelname)s%(reset)s] - %(message)s'
+    datefmt: '%Y-%m-%d %H:%M:%S'
+    log_colors:
+      DEBUG: purple
+      INFO: green
+      WARNING: yellow
+      ERROR: red
+      CRITICAL: red
+handlers:
+  console:
+    class: logging.StreamHandler
+    formatter: colorlog
+    stream: ext://sys.stdout
+  file:
+    class: logging.FileHandler
+    formatter: simple
+    # absolute file path
+    filename: ${hydra.runtime.output_dir}/train-${now:%Y-%m-%d_%H-%M-%S}-rank${oc.env:LOCAL_RANK}.log
+    mode: w
+root:
+  level: INFO
+  handlers: [console, file]
+disable_existing_loggers: false

hf_AC/config/train_config.yaml ADDED Viewed

	@@ -0,0 +1,41 @@

+defaults:
+  - base_config
+  - override data: base
+  - override hydra/job_logging: custom
+  - _self_
+hydra:
+  run:
+    dir: ./output/${exp_id}
+  output_subdir: train-${now:%Y-%m-%d_%H-%M-%S}-hydra
+ema:
+  start: 0
+mini_train: False
+example_train: False
+enable_grad_scaler: False
+vgg_oversample_rate: 4
+log_text_interval: 100
+log_extra_interval: 20_000
+val_interval: 10_000
+eval_interval: 20_000
+save_eval_interval: 40_000
+save_weights_interval: 5_000
+save_checkpoint_interval: 5_000
+save_copy_iterations: [50000,100000,150000,200000,220000,240000,260000,280000,300000]
+batch_size: 340
+eval_batch_size: 32 # per-GPU
+num_iterations: 300_000
+learning_rate: 1.0e-4
+linear_warmup_steps: 1_000
+lr_schedule: step
+lr_schedule_steps: [200_000, 240_000]
+lr_schedule_gamma: 0.1
+clip_grad_norm: 1.0
+weight_decay: 1.0e-6

hf_AC/config/train_config_2node.yaml ADDED Viewed

	@@ -0,0 +1,41 @@

+defaults:
+  - base_config
+  - override data: base
+  - override hydra/job_logging: custom
+  - _self_
+hydra:
+  run:
+    dir: ./output/${exp_id}
+  output_subdir: train-${now:%Y-%m-%d_%H-%M-%S}-hydra
+ema:
+  start: 0
+mini_train: False
+example_train: False
+enable_grad_scaler: False
+vgg_oversample_rate: 4
+log_text_interval: 200
+log_extra_interval: 20_000
+val_interval: 5_000
+eval_interval: 20_000
+save_eval_interval: 40_000
+save_weights_interval: 10_000
+save_checkpoint_interval: 10_000
+save_copy_iterations: [40000,60000,80000,100000,150000,200000,220000,240000,260000,280000,300000]
+batch_size: 320
+eval_batch_size: 32 # per-GPU
+num_iterations: 220_000
+learning_rate: 1.0e-4
+linear_warmup_steps: 1_000
+lr_schedule: step
+lr_schedule_steps: [200_000, 240_000, 270_000]
+lr_schedule_gamma: 0.1
+clip_grad_norm: 1.0
+weight_decay: 1.0e-6

hf_AC/config/train_config_2node2.yaml ADDED Viewed

	@@ -0,0 +1,41 @@

+defaults:
+  - base_config
+  - override data: base2
+  - override hydra/job_logging: custom
+  - _self_
+hydra:
+  run:
+    dir: ./output/${exp_id}
+  output_subdir: train-${now:%Y-%m-%d_%H-%M-%S}-hydra
+ema:
+  start: 0
+mini_train: False
+example_train: False
+enable_grad_scaler: False
+vgg_oversample_rate: 4
+log_text_interval: 200
+log_extra_interval: 20_000
+val_interval: 5_000
+eval_interval: 20_000
+save_eval_interval: 40_000
+save_weights_interval: 10_000
+save_checkpoint_interval: 10_000
+save_copy_iterations: [100000,150000,200000,220000,240000,260000,280000,300000]
+batch_size: 320
+eval_batch_size: 32 # per-GPU
+num_iterations: 300_000
+learning_rate: 1.0e-4
+linear_warmup_steps: 1_000
+lr_schedule: step
+lr_schedule_steps: [200_000, 240_000, 270_000]
+lr_schedule_gamma: 0.1
+clip_grad_norm: 1.0
+weight_decay: 1.0e-6

hf_AC/inf.py ADDED Viewed

	@@ -0,0 +1,181 @@

+import logging
+from argparse import ArgumentParser
+from pathlib import Path
+import torch
+import torchaudio
+from mmaudio.eval_utils import (ModelConfig, all_model_cfg, generate, load_video, make_video,
+                                setup_eval_logging)
+from mmaudio.model.flow_matching import FlowMatching
+from mmaudio.model.networks import MMAudio, get_my_mmaudio
+from mmaudio.model.utils.features_utils import FeaturesUtils
+import os
+from mmaudio.ext.mel_converter import get_mel_converter
+from mmaudio.ext.autoencoder import AutoEncoderModule
+import time
+torch.backends.cuda.matmul.allow_tf32 = True
+torch.backends.cudnn.allow_tf32 = True
+import tqdm
+import glob
+log = logging.getLogger()
+class Audio:
+    def __init__(self, audio_path, sample_rate):
+        self.audio_paths = audio_path
+        self.sample_rate = sample_rate
+        self.num_timbre_sample = 89088 if sample_rate == 44100 else 32768
+        self.resampler = {}
+    def load_audio(self):
+        chunk_list=[]
+        for audio_path in self.audio_paths:
+            audio_chunk, sample_rate = torchaudio.load(audio_path)
+            audio_chunk = audio_chunk.mean(dim=0)  # mono
+            abs_max = audio_chunk.abs().max()
+            audio_chunk = audio_chunk / abs_max * 0.95
+            # resample
+            if sample_rate == self.sample_rate:
+                audio_chunk = audio_chunk
+            else:
+                if sample_rate not in self.resampler:
+                    # https://pytorch.org/audio/stable/tutorials/audio_resampling_tutorial.html#kaiser-best
+                    self.resampler[sample_rate] = torchaudio.transforms.Resample(
+                        sample_rate,
+                        self.sample_rate,
+                        lowpass_filter_width=64,
+                        rolloff=0.9475937167399596,
+                        resampling_method='sinc_interp_kaiser',
+                        beta=14.769656459379492,
+                    )
+                audio_chunk = self.resampler[sample_rate](audio_chunk)
+            if audio_chunk.size(0) < self.num_timbre_sample:
+                padding_length = self.num_timbre_sample - audio_chunk.size(0)
+                audio_chunk = torch.cat([audio_chunk, torch.zeros(padding_length)], dim=0)
+            else:
+                audio_chunk = audio_chunk[:self.num_timbre_sample]
+            # audio_chunk = audio_chunk[:self.num_timbre_sample]
+            chunk_list.append(audio_chunk)
+        return chunk_list
+def process_video(video_path: Path, args, model: ModelConfig, net: MMAudio, fm: FlowMatching, feature_utils: FeaturesUtils, device: str, dtype: torch.dtype, audio: torch.Tensor, i):
+    log.info(f'Processing video: {video_path}')
+    t=time.time()
+    audio_num_sample = 89088
+    if audio is not None:
+        audio_num_sample = audio.shape[0]
+    video_info = load_video(video_path, args.duration)
+    clip_frames = video_info.clip_frames
+    sync_frames = video_info.sync_frames
+    duration = video_info.duration_sec
+    if args.mask_away_clip:
+        clip_frames = None
+    else:
+        clip_frames = clip_frames.unsqueeze(0)
+    sync_frames = sync_frames.unsqueeze(0)
+    model.seq_cfg.duration = duration
+    model.seq_cfg.audio_num_sample = audio_num_sample
+    net.update_seq_lengths(model.seq_cfg.latent_seq_len, model.seq_cfg.clip_seq_len, model.seq_cfg.sync_seq_len, model.seq_cfg.audio_seq_len)
+    log.info(f'Prompt: {args.prompt}')
+    log.info(f'Negative prompt: {args.negative_prompt}')
+    audios = generate(clip_frames,
+                      sync_frames, [args.prompt], audio,
+                      negative_text=[args.negative_prompt],
+                      feature_utils=feature_utils,
+                      net=net,
+                      fm=fm,
+                      rng=torch.Generator(device=device).manual_seed(args.seed),
+                      cfg_strength=args.cfg_strength)
+    audio = audios.float().cpu()[0]
+    save_path = args.output / f'{video_path.stem}{i}.wav'
+    torchaudio.save(save_path, audio, model.seq_cfg.sampling_rate)
+    log.info(f'Audio saved to {save_path}')
+    if not args.skip_video_composite:
+        video_save_path = args.output / f'{video_path.stem}{i}.mp4'
+        make_video(video_info, video_save_path, audio, sampling_rate=model.seq_cfg.sampling_rate)
+        log.info(f'Video saved to {video_save_path}')
+@torch.inference_mode()
+def main():
+    setup_eval_logging()
+    parser = ArgumentParser()
+    parser.add_argument('--variant',
+                        type=str,
+                        default='large_44k',)
+    parser.add_argument('--video_dir', type=Path, help='')
+    parser.add_argument('--audio_path', type=str, default='')
+    parser.add_argument('--prompt', type=str, help='Input prompt', default='')
+    parser.add_argument('--negative_prompt', type=str, help='Negative prompt', default='')
+    parser.add_argument('--duration', type=float, default=8.0)
+    parser.add_argument('--cfg_strength', type=float, default=4.5)
+    parser.add_argument('--num_steps', type=int, default=25)
+    parser.add_argument('--mask_away_clip', action='store_true')
+    parser.add_argument('--output', type=Path, help='Output directory', default='./')
+    parser.add_argument('--seed', type=int, help='Random seed', default=42)
+    parser.add_argument('--skip_video_composite', action='store_true')
+    parser.add_argument('--full_precision', action='store_true')
+    parser.add_argument('--model_path', type=str, default='weights/model.pth', help='Path to the model weights')
+    args = parser.parse_args()
+    if args.variant not in all_model_cfg:
+        raise ValueError(f'Unknown model variant: {args.variant}')
+    model: ModelConfig = all_model_cfg[args.variant]
+    model.download_if_needed()
+    device = 'cpu'
+    if torch.cuda.is_available():
+        device = 'cuda'
+    elif torch.backends.mps.is_available():
+        device = 'mps'
+    else:
+        log.warning('CUDA/MPS are not available, running on CPU')
+    dtype = torch.float32 if args.full_precision else torch.bfloat16
+    args.output.mkdir(parents=True, exist_ok=True)
+    if  args.audio_path != '':
+        SAMPLE_RATE = 44100
+        audio = Audio([args.audio_path], SAMPLE_RATE)
+        audio_list = audio.load_audio()
+    else:
+        audio_list = None
+    model.model_path = Path(args.model_path)
+    net: MMAudio = get_my_mmaudio(model.model_name).to(device, dtype).eval()
+    net.load_weights(torch.load(model.model_path, map_location=device, weights_only=True)['weights'])
+    log.info(f'Loaded weights from {model.model_path}')
+    fm = FlowMatching(min_sigma=0, inference_mode='euler', num_steps=args.num_steps)
+    feature_utils = FeaturesUtils(tod_vae_ckpt=model.vae_path,
+                                  synchformer_ckpt=model.synchformer_ckpt,
+                                  enable_conditions=True,
+                                  mode=model.mode,
+                                  bigvgan_vocoder_ckpt=model.bigvgan_16k_path,
+                                  need_vae_encoder=True)
+    feature_utils = feature_utils.to(device, dtype).eval()
+    if args.video_dir:
+        video_dir: Path = args.video_dir.expanduser()
+        video_files = sorted(list(video_dir.glob('*.mp4')))
+        if os.path.isfile(args.video_dir):
+            video_files=[args.video_dir]
+        if not video_files:
+            log.warning(f'No video files found in {video_dir}')
+        else:
+            if audio_list is None:
+                audio_list = [None] * len(video_files)
+            if len(audio_list)==1:
+                audio_list = audio_list * len(video_files)
+            for i in range(1):
+                for video_path, audio in tqdm.tqdm(zip(video_files,audio_list)):
+                    args.seed = torch.seed()
+                    process_video(video_path, args, model, net, fm, feature_utils, device, dtype, audio, i)
+if __name__ == '__main__':
+    main()

hf_AC/mmaudio/__init__.py ADDED Viewed

File without changes

hf_AC/mmaudio/data/__init__.py ADDED Viewed

File without changes

hf_AC/mmaudio/data/av_utils.py ADDED Viewed

	@@ -0,0 +1,162 @@

+from dataclasses import dataclass
+from fractions import Fraction
+from pathlib import Path
+from typing import Optional
+import av
+import numpy as np
+import torch
+from av import AudioFrame
+@dataclass
+class VideoInfo:
+    duration_sec: float
+    fps: Fraction
+    clip_frames: torch.Tensor
+    sync_frames: torch.Tensor
+    all_frames: Optional[list[np.ndarray]]
+    @property
+    def height(self):
+        return self.all_frames[0].shape[0]
+    @property
+    def width(self):
+        return self.all_frames[0].shape[1]
+    @classmethod
+    def from_image_info(cls, image_info: 'ImageInfo', duration_sec: float,
+                        fps: Fraction) -> 'VideoInfo':
+        num_frames = int(duration_sec * fps)
+        all_frames = [image_info.original_frame] * num_frames
+        return cls(duration_sec=duration_sec,
+                   fps=fps,
+                   clip_frames=image_info.clip_frames,
+                   sync_frames=image_info.sync_frames,
+                   all_frames=all_frames)
+@dataclass
+class ImageInfo:
+    clip_frames: torch.Tensor
+    sync_frames: torch.Tensor
+    original_frame: Optional[np.ndarray]
+    @property
+    def height(self):
+        return self.original_frame.shape[0]
+    @property
+    def width(self):
+        return self.original_frame.shape[1]
+def read_frames(video_path: Path, list_of_fps: list[float], start_sec: float, end_sec: float,
+                need_all_frames: bool) -> tuple[list[np.ndarray], list[np.ndarray], Fraction]:
+    output_frames = [[] for _ in list_of_fps]
+    next_frame_time_for_each_fps = [0.0 for _ in list_of_fps]
+    time_delta_for_each_fps = [1 / fps for fps in list_of_fps]
+    all_frames = []
+    # container = av.open(video_path)
+    with av.open(video_path) as container:
+        stream = container.streams.video[0]
+        fps = stream.guessed_rate
+        stream.thread_type = 'AUTO'
+        for packet in container.demux(stream):
+            for frame in packet.decode():
+                frame_time = frame.time
+                if frame_time < start_sec:
+                    continue
+                if frame_time > end_sec:
+                    break
+                frame_np = None
+                if need_all_frames:
+                    frame_np = frame.to_ndarray(format='rgb24')
+                    all_frames.append(frame_np)
+                for i, _ in enumerate(list_of_fps):
+                    this_time = frame_time
+                    while this_time >= next_frame_time_for_each_fps[i]:
+                        if frame_np is None:
+                            frame_np = frame.to_ndarray(format='rgb24')
+                        output_frames[i].append(frame_np)
+                        next_frame_time_for_each_fps[i] += time_delta_for_each_fps[i]
+    output_frames = [np.stack(frames) for frames in output_frames]
+    return output_frames, all_frames, fps
+def reencode_with_audio(video_info: VideoInfo, output_path: Path, audio: torch.Tensor,
+                        sampling_rate: int):
+    container = av.open(output_path, 'w')
+    output_video_stream = container.add_stream('h264', video_info.fps)
+    output_video_stream.codec_context.bit_rate = 10 * 1e6  # 10 Mbps
+    output_video_stream.width = video_info.width
+    output_video_stream.height = video_info.height
+    output_video_stream.pix_fmt = 'yuv420p'
+    output_audio_stream = container.add_stream('aac', sampling_rate)
+    # encode video
+    for image in video_info.all_frames:
+        image = av.VideoFrame.from_ndarray(image)
+        packet = output_video_stream.encode(image)
+        container.mux(packet)
+    for packet in output_video_stream.encode():
+        container.mux(packet)
+    # convert float tensor audio to numpy array
+    audio_np = audio.numpy().astype(np.float32)
+    audio_frame = AudioFrame.from_ndarray(audio_np, format='flt', layout='mono')
+    audio_frame.sample_rate = sampling_rate
+    for packet in output_audio_stream.encode(audio_frame):
+        container.mux(packet)
+    for packet in output_audio_stream.encode():
+        container.mux(packet)
+    container.close()
+def remux_with_audio(video_path: Path, audio: torch.Tensor, output_path: Path, sampling_rate: int):
+    """
+    NOTE: I don't think we can get the exact video duration right without re-encoding
+    so we are not using this but keeping it here for reference
+    """
+    video = av.open(video_path)
+    output = av.open(output_path, 'w')
+    input_video_stream = video.streams.video[0]
+    output_video_stream = output.add_stream(template=input_video_stream)
+    output_audio_stream = output.add_stream('aac', sampling_rate)
+    duration_sec = audio.shape[-1] / sampling_rate
+    for packet in video.demux(input_video_stream):
+        # We need to skip the "flushing" packets that `demux` generates.
+        if packet.dts is None:
+            continue
+        # We need to assign the packet to the new stream.
+        packet.stream = output_video_stream
+        output.mux(packet)
+    # convert float tensor audio to numpy array
+    audio_np = audio.numpy().astype(np.float32)
+    audio_frame = av.AudioFrame.from_ndarray(audio_np, format='flt', layout='mono')
+    audio_frame.sample_rate = sampling_rate
+    for packet in output_audio_stream.encode(audio_frame):
+        output.mux(packet)
+    for packet in output_audio_stream.encode():
+        output.mux(packet)
+    video.close()
+    output.close()
+    output.close()

hf_AC/mmaudio/data/data_setup.py ADDED Viewed

	@@ -0,0 +1,177 @@

+import logging
+import random
+import numpy as np
+import torch
+from omegaconf import DictConfig
+from torch.utils.data import DataLoader, Dataset
+from torch.utils.data.dataloader import default_collate
+from torch.utils.data.distributed import DistributedSampler
+from mmaudio.data.eval.audiocaps import AudioCapsData
+from mmaudio.data.eval.video_dataset import MovieGen, VGGSound
+from mmaudio.data.extracted_audio import ExtractedAudio
+from mmaudio.data.extracted_vgg import ExtractedVGG
+from mmaudio.data.mm_dataset import MultiModalDataset
+from mmaudio.utils.dist_utils import local_rank
+log = logging.getLogger()
+# Re-seed randomness every time we start a worker
+def worker_init_fn(worker_id: int):
+    worker_seed = torch.initial_seed() % (2**31) + worker_id + local_rank * 1000
+    np.random.seed(worker_seed)
+    random.seed(worker_seed)
+    log.debug(f'Worker {worker_id} re-seeded with seed {worker_seed} in rank {local_rank}')
+def load_vgg_data(cfg: DictConfig, data_cfg: DictConfig) -> Dataset:
+    dataset = ExtractedVGG(tsv_path=data_cfg.tsv,
+                           data_dim=cfg.data_dim,
+                           premade_mmap_dir=data_cfg.memmap_dir)
+    return dataset
+def load_audio_data(cfg: DictConfig, data_cfg: DictConfig) -> Dataset:
+    dataset = ExtractedAudio(tsv_path=data_cfg.tsv,
+                             data_dim=cfg.data_dim,
+                             premade_mmap_dir=data_cfg.memmap_dir)
+    return dataset
+def setup_training_datasets(cfg: DictConfig) -> tuple[Dataset, DistributedSampler, DataLoader]:
+    if cfg.mini_train:
+        vgg = load_vgg_data(cfg, cfg.data.ExtractedVGG_val)
+        audiocaps = load_audio_data(cfg, cfg.data.AudioCaps)
+        dataset = MultiModalDataset([vgg], [audiocaps])
+    if cfg.example_train:
+        video = load_vgg_data(cfg, cfg.data.Example_video)
+        audio = load_audio_data(cfg, cfg.data.Example_audio)
+        dataset = MultiModalDataset([video], [audio])
+    else:
+        # load the largest one first
+        freesound = load_audio_data(cfg, cfg.data.FreeSound)
+        vgg = load_vgg_data(cfg, cfg.data.ExtractedVGG)
+        audiocaps = load_audio_data(cfg, cfg.data.AudioCaps)
+        audioset_sl = load_audio_data(cfg, cfg.data.AudioSetSL)
+        # bbcsound = load_audio_data(cfg, cfg.data.BBCSound)
+        # clotho = load_audio_data(cfg, cfg.data.Clotho)
+        dataset = MultiModalDataset([vgg] * cfg.vgg_oversample_rate,
+                                    [audiocaps, audioset_sl, freesound])
+        # dataset = MultiModalDataset([vgg],[])
+    batch_size = cfg.batch_size
+    num_workers = cfg.num_workers
+    pin_memory = cfg.pin_memory
+    sampler, loader = construct_loader(dataset,
+                                       batch_size,
+                                       num_workers,
+                                       shuffle=True,
+                                       drop_last=True,
+                                       pin_memory=pin_memory)
+    return dataset, sampler, loader
+def setup_test_datasets(cfg):
+    dataset = load_vgg_data(cfg, cfg.data.ExtractedVGG_test)
+    batch_size = cfg.batch_size
+    num_workers = cfg.num_workers
+    pin_memory = cfg.pin_memory
+    sampler, loader = construct_loader(dataset,
+                                       batch_size,
+                                       num_workers,
+                                       shuffle=False,
+                                       drop_last=False,
+                                       pin_memory=pin_memory)
+    return dataset, sampler, loader
+def setup_val_datasets(cfg: DictConfig) -> tuple[Dataset, DataLoader, DataLoader]:
+    if cfg.example_train:
+        dataset = load_vgg_data(cfg, cfg.data.Example_video)
+    else:
+        dataset = load_vgg_data(cfg, cfg.data.ExtractedVGG_val)
+    val_batch_size = cfg.batch_size
+    val_eval_batch_size = cfg.eval_batch_size
+    num_workers = cfg.num_workers
+    pin_memory = cfg.pin_memory
+    _, val_loader = construct_loader(dataset,
+                                     val_batch_size,
+                                     num_workers,
+                                     shuffle=False,
+                                     drop_last=False,
+                                     pin_memory=pin_memory)
+    _, eval_loader = construct_loader(dataset,
+                                      val_eval_batch_size,
+                                      num_workers,
+                                      shuffle=False,
+                                      drop_last=False,
+                                      pin_memory=pin_memory)
+    return dataset, val_loader, eval_loader
+def setup_eval_dataset(dataset_name: str, cfg: DictConfig) -> tuple[Dataset, DataLoader]:
+    if dataset_name.startswith('audiocaps_full'):
+        dataset = AudioCapsData(cfg.eval_data.AudioCaps_full.audio_path,
+                                cfg.eval_data.AudioCaps_full.csv_path)
+    elif dataset_name.startswith('audiocaps'):
+        dataset = AudioCapsData(cfg.eval_data.AudioCaps.audio_path,
+                                cfg.eval_data.AudioCaps.csv_path)
+    elif dataset_name.startswith('moviegen'):
+        dataset = MovieGen(cfg.eval_data.MovieGen.video_path,
+                           cfg.eval_data.MovieGen.jsonl_path,
+                           duration_sec=cfg.duration_s)
+    elif dataset_name.startswith('vggsound'):
+        dataset = VGGSound(cfg.eval_data.VGGSound.video_path,
+                           cfg.eval_data.VGGSound.csv_path,
+                           duration_sec=cfg.duration_s)
+    else:
+        raise ValueError(f'Invalid dataset name: {dataset_name}')
+    batch_size = cfg.batch_size
+    num_workers = cfg.num_workers
+    pin_memory = cfg.pin_memory
+    _, loader = construct_loader(dataset,
+                                 batch_size,
+                                 num_workers,
+                                 shuffle=False,
+                                 drop_last=False,
+                                 pin_memory=pin_memory,
+                                 error_avoidance=True)
+    return dataset, loader
+def error_avoidance_collate(batch):
+    batch = list(filter(lambda x: x is not None, batch))
+    if len(batch) == 0:
+        return None
+    return default_collate(batch)
+def construct_loader(dataset: Dataset,
+                     batch_size: int,
+                     num_workers: int,
+                     *,
+                     shuffle: bool = True,
+                     drop_last: bool = True,
+                     pin_memory: bool = False,
+                     error_avoidance: bool = False) -> tuple[DistributedSampler, DataLoader]:
+    train_sampler = DistributedSampler(dataset, rank=local_rank, shuffle=shuffle)
+    train_loader = DataLoader(dataset,
+                              batch_size,
+                              sampler=train_sampler,
+                              num_workers=num_workers,
+                              worker_init_fn=worker_init_fn,
+                              drop_last=drop_last,
+                              persistent_workers=num_workers > 0,
+                              pin_memory=pin_memory,
+                              collate_fn=error_avoidance_collate if error_avoidance else None)
+    return train_sampler, train_loader

hf_AC/mmaudio/data/eval/__init__.py ADDED Viewed

File without changes

hf_AC/mmaudio/data/eval/audiocaps.py ADDED Viewed

	@@ -0,0 +1,39 @@

+import logging
+import os
+from collections import defaultdict
+from pathlib import Path
+from typing import Union
+import pandas as pd
+import torch
+from torch.utils.data.dataset import Dataset
+log = logging.getLogger()
+class AudioCapsData(Dataset):
+    def __init__(self, audio_path: Union[str, Path], csv_path: Union[str, Path]):
+        df = pd.read_csv(csv_path).to_dict(orient='records')
+        audio_files = sorted(os.listdir(audio_path))
+        audio_files = set(
+            [Path(f).stem for f in audio_files if f.endswith('.wav') or f.endswith('.flac')])
+        self.data = []
+        for row in df:
+            self.data.append({
+                'name': row['name'],
+                'caption': row['caption'],
+            })
+        self.audio_path = Path(audio_path)
+        self.csv_path = Path(csv_path)
+        log.info(f'Found {len(self.data)} matching audio files in {self.audio_path}')
+    def __getitem__(self, idx: int) -> torch.Tensor:
+        return self.data[idx]
+    def __len__(self):
+        return len(self.data)

hf_AC/mmaudio/data/eval/moviegen.py ADDED Viewed

	@@ -0,0 +1,131 @@

+import json
+import logging
+import os
+from pathlib import Path
+from typing import Union
+import torch
+from torch.utils.data.dataset import Dataset
+from torchvision.transforms import v2
+from torio.io import StreamingMediaDecoder
+from mmaudio.utils.dist_utils import local_rank
+log = logging.getLogger()
+_CLIP_SIZE = 384
+_CLIP_FPS = 8.0
+_SYNC_SIZE = 224
+_SYNC_FPS = 25.0
+class MovieGenData(Dataset):
+    def __init__(
+        self,
+        video_root: Union[str, Path],
+        sync_root: Union[str, Path],
+        jsonl_root: Union[str, Path],
+        *,
+        duration_sec: float = 10.0,
+        read_clip: bool = True,
+    ):
+        self.video_root = Path(video_root)
+        self.sync_root = Path(sync_root)
+        self.jsonl_root = Path(jsonl_root)
+        self.read_clip = read_clip
+        videos = sorted(os.listdir(self.video_root))
+        videos = [v[:-4] for v in videos]  # remove extensions
+        self.captions = {}
+        for v in videos:
+            with open(self.jsonl_root / (v + '.jsonl')) as f:
+                data = json.load(f)
+                self.captions[v] = data['audio_prompt']
+        if local_rank == 0:
+            log.info(f'{len(videos)} videos found in {video_root}')
+        self.duration_sec = duration_sec
+        self.clip_expected_length = int(_CLIP_FPS * self.duration_sec)
+        self.sync_expected_length = int(_SYNC_FPS * self.duration_sec)
+        self.clip_augment = v2.Compose([
+            v2.Resize((_CLIP_SIZE, _CLIP_SIZE), interpolation=v2.InterpolationMode.BICUBIC),
+            v2.ToImage(),
+            v2.ToDtype(torch.float32, scale=True),
+        ])
+        self.sync_augment = v2.Compose([
+            v2.Resize((_SYNC_SIZE, _SYNC_SIZE), interpolation=v2.InterpolationMode.BICUBIC),
+            v2.CenterCrop(_SYNC_SIZE),
+            v2.ToImage(),
+            v2.ToDtype(torch.float32, scale=True),
+            v2.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
+        ])
+        self.videos = videos
+    def sample(self, idx: int) -> dict[str, torch.Tensor]:
+        video_id = self.videos[idx]
+        caption = self.captions[video_id]
+        reader = StreamingMediaDecoder(self.video_root / (video_id + '.mp4'))
+        reader.add_basic_video_stream(
+            frames_per_chunk=int(_CLIP_FPS * self.duration_sec),
+            frame_rate=_CLIP_FPS,
+            format='rgb24',
+        )
+        reader.add_basic_video_stream(
+            frames_per_chunk=int(_SYNC_FPS * self.duration_sec),
+            frame_rate=_SYNC_FPS,
+            format='rgb24',
+        )
+        reader.fill_buffer()
+        data_chunk = reader.pop_chunks()
+        clip_chunk = data_chunk[0]
+        sync_chunk = data_chunk[1]
+        if clip_chunk is None:
+            raise RuntimeError(f'CLIP video returned None {video_id}')
+        if clip_chunk.shape[0] < self.clip_expected_length:
+            raise RuntimeError(f'CLIP video too short {video_id}')
+        if sync_chunk is None:
+            raise RuntimeError(f'Sync video returned None {video_id}')
+        if sync_chunk.shape[0] < self.sync_expected_length:
+            raise RuntimeError(f'Sync video too short {video_id}')
+        # truncate the video
+        clip_chunk = clip_chunk[:self.clip_expected_length]
+        if clip_chunk.shape[0] != self.clip_expected_length:
+            raise RuntimeError(f'CLIP video wrong length {video_id}, '
+                               f'expected {self.clip_expected_length}, '
+                               f'got {clip_chunk.shape[0]}')
+        clip_chunk = self.clip_augment(clip_chunk)
+        sync_chunk = sync_chunk[:self.sync_expected_length]
+        if sync_chunk.shape[0] != self.sync_expected_length:
+            raise RuntimeError(f'Sync video wrong length {video_id}, '
+                               f'expected {self.sync_expected_length}, '
+                               f'got {sync_chunk.shape[0]}')
+        sync_chunk = self.sync_augment(sync_chunk)
+        data = {
+            'name': video_id,
+            'caption': caption,
+            'clip_video': clip_chunk,
+            'sync_video': sync_chunk,
+        }
+        return data
+    def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
+        return self.sample(idx)
+    def __len__(self):
+        return len(self.captions)

hf_AC/mmaudio/data/eval/video_dataset.py ADDED Viewed

	@@ -0,0 +1,231 @@

+import json
+import logging
+import os
+from pathlib import Path
+from typing import Union
+import pandas as pd
+import torch
+from torch.utils.data.dataset import Dataset
+from torchvision.transforms import v2
+from torio.io import StreamingMediaDecoder
+import torchaudio
+from mmaudio.utils.dist_utils import local_rank
+import random
+log = logging.getLogger()
+_CLIP_SIZE = 384
+_CLIP_FPS = 8.0
+_SYNC_SIZE = 224
+_SYNC_FPS = 25.0
+class VideoDataset(Dataset):
+    def __init__(
+        self,
+        video_root: Union[str, Path],
+        *,
+        duration_sec: float = 8.0,
+    ):
+        self.video_root = Path(video_root)
+        self.duration_sec = duration_sec
+        self.sample_rate = 44100
+        self.resampler = {}
+        self.expected_audio_length = 89088
+        self.clip_expected_length = int(_CLIP_FPS * self.duration_sec)
+        self.sync_expected_length = int(_SYNC_FPS * self.duration_sec)
+        self.clip_transform = v2.Compose([
+            v2.Resize((_CLIP_SIZE, _CLIP_SIZE), interpolation=v2.InterpolationMode.BICUBIC),
+            v2.ToImage(),
+            v2.ToDtype(torch.float32, scale=True),
+        ])
+        self.sync_transform = v2.Compose([
+            v2.Resize(_SYNC_SIZE, interpolation=v2.InterpolationMode.BICUBIC),
+            v2.CenterCrop(_SYNC_SIZE),
+            v2.ToImage(),
+            v2.ToDtype(torch.float32, scale=True),
+            v2.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
+        ])
+        # to be implemented by subclasses
+        self.captions = {}
+        self.videos = sorted(list(self.captions.keys()))
+    def sample(self, idx: int) -> dict[str, torch.Tensor]:
+        video_id = self.videos[idx]
+        caption = self.captions[video_id]
+        reader = StreamingMediaDecoder(self.video_root / (video_id + '.mp4'))
+        reader.add_basic_video_stream(
+            frames_per_chunk=int(_CLIP_FPS * self.duration_sec),
+            frame_rate=_CLIP_FPS,
+            format='rgb24',
+        )
+        reader.add_basic_video_stream(
+            frames_per_chunk=int(_SYNC_FPS * self.duration_sec),
+            frame_rate=_SYNC_FPS,
+            format='rgb24',
+        )
+        reader.add_basic_audio_stream(frames_per_chunk=2**30, )
+        reader.fill_buffer()
+        data_chunk = reader.pop_chunks()
+        clip_chunk = data_chunk[0]
+        sync_chunk = data_chunk[1]
+        audio_chunk = data_chunk[2]
+        if clip_chunk is None:
+            raise RuntimeError(f'CLIP video returned None {video_id}')
+        if clip_chunk.shape[0] < self.clip_expected_length:
+            raise RuntimeError(
+                f'CLIP video too short {video_id}, expected {self.clip_expected_length}, got {clip_chunk.shape[0]}'
+            )
+        if sync_chunk is None:
+            raise RuntimeError(f'Sync video returned None {video_id}')
+        if sync_chunk.shape[0] < self.sync_expected_length:
+            raise RuntimeError(
+                f'Sync video too short {video_id}, expected {self.sync_expected_length}, got {sync_chunk.shape[0]}'
+            )
+         # process audio
+        sample_rate = int(reader.get_out_stream_info(2).sample_rate)
+        audio_chunk = audio_chunk.transpose(0, 1)
+        audio_chunk = audio_chunk.mean(dim=0)  # mono
+        abs_max = audio_chunk.abs().max()
+        audio_chunk = audio_chunk / abs_max * 0.95
+        # resample
+        if sample_rate == self.sample_rate:
+            audio_chunk = audio_chunk
+        else:
+            if sample_rate not in self.resampler:
+                # https://pytorch.org/audio/stable/tutorials/audio_resampling_tutorial.html#kaiser-best
+                self.resampler[sample_rate] = torchaudio.transforms.Resample(
+                    sample_rate,
+                    self.sample_rate,
+                    lowpass_filter_width=64,
+                    rolloff=0.9475937167399596,
+                    resampling_method='sinc_interp_kaiser',
+                    beta=14.769656459379492,
+                )
+            audio_chunk = self.resampler[sample_rate](audio_chunk)
+        if audio_chunk.shape[0] < self.expected_audio_length:
+            raise RuntimeError(f'Audio too short {video_id}')
+        # start_index = random.randint(0, audio_chunk.shape[0] - self.expected_audio_length)
+        timbre_sample = audio_chunk[audio_chunk.shape[0]-self.expected_audio_length:]
+        # truncate the video
+        clip_chunk = clip_chunk[:self.clip_expected_length]
+        if clip_chunk.shape[0] != self.clip_expected_length:
+            raise RuntimeError(f'CLIP video wrong length {video_id}, '
+                               f'expected {self.clip_expected_length}, '
+                               f'got {clip_chunk.shape[0]}')
+        clip_chunk = self.clip_transform(clip_chunk)
+        sync_chunk = sync_chunk[:self.sync_expected_length]
+        if sync_chunk.shape[0] != self.sync_expected_length:
+            raise RuntimeError(f'Sync video wrong length {video_id}, '
+                               f'expected {self.sync_expected_length}, '
+                               f'got {sync_chunk.shape[0]}')
+        sync_chunk = self.sync_transform(sync_chunk)
+        data = {
+            'name': video_id,
+            'caption': caption,
+            'clip_video': clip_chunk,
+            'sync_video': sync_chunk,
+            'audio': timbre_sample
+        }
+        return data
+    def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
+        try:
+            return self.sample(idx)
+        except Exception as e:
+            log.error(f'Error loading video {self.videos[idx]}: {e}')
+            return None
+    def __len__(self):
+        return len(self.captions)
+class VGGSound(VideoDataset):
+    def __init__(
+        self,
+        video_root: Union[str, Path],
+        csv_path: Union[str, Path],
+        *,
+        duration_sec: float = 8.0,
+    ):
+        super().__init__(video_root, duration_sec=duration_sec)
+        self.video_root = Path(video_root)
+        self.csv_path = Path(csv_path)
+        videos = sorted(os.listdir(self.video_root))
+        if local_rank == 0:
+            log.info(f'{len(videos)} videos found in {video_root}')
+        self.captions = {}
+        df = pd.read_csv(csv_path, header=None, names=['id', 'sec', 'caption',
+                                                       'split']).to_dict(orient='records')
+        videos_no_found = []
+        for row in df:
+            if row['split'] == 'test':
+                start_sec = int(row['sec'])
+                video_id = str(row['id'])
+                # this is how our videos are named
+                video_name = f'{video_id}_{start_sec:06d}'
+                if video_name + '.mp4' not in videos:
+                    videos_no_found.append(video_name)
+                    continue
+                self.captions[video_name] = row['caption']
+        if local_rank == 0:
+            log.info(f'{len(videos)} videos found in {video_root}')
+            log.info(f'{len(self.captions)} useable videos found')
+            if videos_no_found:
+                log.info(f'{len(videos_no_found)} found in {csv_path} but not in {video_root}')
+                log.info(
+                    'A small amount is expected, as not all videos are still available on YouTube')
+        self.videos = sorted(list(self.captions.keys()))
+class MovieGen(VideoDataset):
+    def __init__(
+        self,
+        video_root: Union[str, Path],
+        jsonl_root: Union[str, Path],
+        *,
+        duration_sec: float = 10.0,
+    ):
+        super().__init__(video_root, duration_sec=duration_sec)
+        self.video_root = Path(video_root)
+        self.jsonl_root = Path(jsonl_root)
+        videos = sorted(os.listdir(self.video_root))
+        videos = [v[:-4] for v in videos]  # remove extensions
+        self.captions = {}
+        for v in videos:
+            with open(self.jsonl_root / (v + '.jsonl')) as f:
+                data = json.load(f)
+                self.captions[v] = data['audio_prompt']
+        if local_rank == 0:
+            log.info(f'{len(videos)} videos found in {video_root}')
+        self.videos = videos

hf_AC/mmaudio/data/extracted_audio.py ADDED Viewed

	@@ -0,0 +1,97 @@

+import logging
+from pathlib import Path
+from typing import Union
+import pandas as pd
+import torch
+from tensordict import TensorDict
+from torch.utils.data.dataset import Dataset
+from mmaudio.utils.dist_utils import local_rank
+log = logging.getLogger()
+class ExtractedAudio(Dataset):
+    def __init__(
+        self,
+        tsv_path: Union[str, Path],
+        *,
+        premade_mmap_dir: Union[str, Path],
+        data_dim: dict[str, int],
+    ):
+        super().__init__()
+        self.data_dim = data_dim
+        self.df_list = pd.read_csv(tsv_path, sep='\t').to_dict('records')
+        self.ids = [str(d['id']) for d in self.df_list]
+        log.info(f'Loading precomputed mmap from {premade_mmap_dir}')
+        # load precomputed memory mapped tensors
+        premade_mmap_dir = Path(premade_mmap_dir)
+        td = TensorDict.load_memmap(premade_mmap_dir)
+        log.info(f'Loaded precomputed mmap from {premade_mmap_dir}')
+        self.mean = td['mean']
+        self.std = td['std']
+        self.text_features = td['text_features']
+        rng = torch.Generator(device=self.text_features.device)
+        rng.manual_seed(42)
+        randn = torch.empty_like(td['audio_feature_mean']).normal_(generator=rng)
+        self.audio_features = td['audio_feature_mean'] + td['audio_feature_std'] * randn
+        log.info(f'Loaded {len(self)} samples from {premade_mmap_dir}.')
+        log.info(f'Loaded mean: {self.mean.shape}.')
+        log.info(f'Loaded std: {self.std.shape}.')
+        log.info(f'Loaded text features: {self.text_features.shape}.')
+        log.info(f'Loaded audio features: {self.audio_features.shape}.')
+        assert self.mean.shape[1] == self.data_dim['latent_seq_len'], \
+            f'{self.mean.shape[1]} != {self.data_dim["latent_seq_len"]}'
+        assert self.std.shape[1] == self.data_dim['latent_seq_len'], \
+            f'{self.std.shape[1]} != {self.data_dim["latent_seq_len"]}'
+        assert self.text_features.shape[1] == self.data_dim['text_seq_len'], \
+            f'{self.text_features.shape[1]} != {self.data_dim["text_seq_len"]}'
+        assert self.text_features.shape[-1] == self.data_dim['text_dim'], \
+            f'{self.text_features.shape[-1]} != {self.data_dim["text_dim"]}'
+        self.fake_clip_features = torch.zeros(self.data_dim['clip_seq_len'],
+                                              self.data_dim['clip_dim'])
+        self.fake_sync_features = torch.zeros(self.data_dim['sync_seq_len'],
+                                              self.data_dim['sync_dim'])
+        self.video_exist = torch.tensor(0, dtype=torch.bool)
+        self.text_exist = torch.tensor(1, dtype=torch.bool)
+        self.audio_exist = torch.tensor(1, dtype=torch.bool)
+    def compute_latent_stats(self) -> tuple[torch.Tensor, torch.Tensor]:
+        latents = self.mean
+        return latents.mean(dim=(0, 1)), latents.std(dim=(0, 1))
+    def get_memory_mapped_tensor(self) -> TensorDict:
+        td = TensorDict({
+            'mean': self.mean,
+            'std': self.std,
+            'text_features': self.text_features,
+            'audio_features': self.audio_features,
+        })
+        return td
+    def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
+        data = {
+            'id': str(self.df_list[idx]['id']),
+            'a_mean': self.mean[idx],
+            'a_std': self.std[idx],
+            'clip_features': self.fake_clip_features,
+            'sync_features': self.fake_sync_features,
+            'text_features': self.text_features[idx],
+            'audio_features': self.audio_features[idx],
+            'caption': self.df_list[idx]['caption'],
+            'video_exist': self.video_exist,
+            'text_exist': self.text_exist,
+            'audio_exist': self.audio_exist,
+        }
+        return data
+    def __len__(self):
+        return len(self.ids)

hf_AC/mmaudio/data/extracted_vgg.py ADDED Viewed

	@@ -0,0 +1,109 @@

+import logging
+from pathlib import Path
+from typing import Union
+import pandas as pd
+import torch
+from tensordict import TensorDict
+from torch.utils.data.dataset import Dataset
+from mmaudio.utils.dist_utils import local_rank
+log = logging.getLogger()
+class ExtractedVGG(Dataset):
+    def __init__(
+        self,
+        tsv_path: Union[str, Path],
+        *,
+        premade_mmap_dir: Union[str, Path],
+        data_dim: dict[str, int],
+    ):
+        super().__init__()
+        self.data_dim = data_dim
+        self.df_list = pd.read_csv(tsv_path, sep='\t').to_dict('records')
+        self.ids = [d['id'] for d in self.df_list]
+        log.info(f'Loading precomputed mmap from {premade_mmap_dir}')
+        # load precomputed memory mapped tensors
+        premade_mmap_dir = Path(premade_mmap_dir)
+        td = TensorDict.load_memmap(premade_mmap_dir)
+        log.info(f'Loaded precomputed mmap from {premade_mmap_dir}')
+        self.mean = td['mean']
+        self.std = td['std']
+        self.clip_features = td['clip_features']
+        self.sync_features = td['sync_features']
+        self.text_features = td['text_features']
+        rng = torch.Generator(device=self.clip_features.device)
+        rng.manual_seed(14159265)
+        randn = torch.empty_like(td['audio_feature_mean']).normal_(generator=rng)
+        self.audio_features = td['audio_feature_mean'] + td['audio_feature_std'] * randn
+        if local_rank == 0:
+            log.info(f'Loaded {len(self)} samples.')
+            log.info(f'Loaded mean: {self.mean.shape}.')
+            log.info(f'Loaded std: {self.std.shape}.')
+            log.info(f'Loaded clip_features: {self.clip_features.shape}.')
+            log.info(f'Loaded sync_features: {self.sync_features.shape}.')
+            log.info(f'Loaded text_features: {self.text_features.shape}.')
+        assert self.mean.shape[1] == self.data_dim['latent_seq_len'], \
+            f'{self.mean.shape[1]} != {self.data_dim["latent_seq_len"]}'
+        assert self.std.shape[1] == self.data_dim['latent_seq_len'], \
+            f'{self.std.shape[1]} != {self.data_dim["latent_seq_len"]}'
+        assert self.clip_features.shape[1] == self.data_dim['clip_seq_len'], \
+            f'{self.clip_features.shape[1]} != {self.data_dim["clip_seq_len"]}'
+        assert self.sync_features.shape[1] == self.data_dim['sync_seq_len'], \
+            f'{self.sync_features.shape[1]} != {self.data_dim["sync_seq_len"]}'
+        assert self.text_features.shape[1] == self.data_dim['text_seq_len'], \
+            f'{self.text_features.shape[1]} != {self.data_dim["text_seq_len"]}'
+        assert self.clip_features.shape[-1] == self.data_dim['clip_dim'], \
+            f'{self.clip_features.shape[-1]} != {self.data_dim["clip_dim"]}'
+        assert self.sync_features.shape[-1] == self.data_dim['sync_dim'], \
+            f'{self.sync_features.shape[-1]} != {self.data_dim["sync_dim"]}'
+        assert self.text_features.shape[-1] == self.data_dim['text_dim'], \
+            f'{self.text_features.shape[-1]} != {self.data_dim["text_dim"]}'
+        self.video_exist = torch.tensor(1, dtype=torch.bool)
+        self.text_exist = torch.tensor(1, dtype=torch.bool)
+        self.audio_exist = torch.tensor(1, dtype=torch.bool)
+    def compute_latent_stats(self) -> tuple[torch.Tensor, torch.Tensor]:
+        latents = self.mean
+        return latents.mean(dim=(0, 1)), latents.std(dim=(0, 1))
+    def get_memory_mapped_tensor(self) -> TensorDict:
+        td = TensorDict({
+            'mean': self.mean,
+            'std': self.std,
+            'clip_features': self.clip_features,
+            'sync_features': self.sync_features,
+            'text_features': self.text_features,
+            'audio_features': self.audio_features,
+        })
+        return td
+    def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
+        data = {
+            'id': self.df_list[idx]['id'],
+            'a_mean': self.mean[idx],
+            'a_std': self.std[idx],
+            'clip_features': self.clip_features[idx],
+            'sync_features': self.sync_features[idx],
+            'text_features': self.text_features[idx],
+            'audio_features': self.audio_features[idx],
+            'caption': self.df_list[idx]['label'],
+            'video_exist': self.video_exist,
+            'text_exist': self.text_exist,
+            'audio_exist': self.audio_exist,
+        }
+        return data
+    def __len__(self):
+        return len(self.ids)

hf_AC/mmaudio/data/extraction/__init__.py ADDED Viewed

File without changes

hf_AC/mmaudio/data/extraction/vgg_sound.py ADDED Viewed

	@@ -0,0 +1,208 @@

+import logging
+import os
+from pathlib import Path
+from typing import Optional, Union
+import pandas as pd
+import torch
+import torchaudio
+from torch.utils.data.dataset import Dataset
+from torchvision.transforms import v2
+from torio.io import StreamingMediaDecoder
+from mmaudio.utils.dist_utils import local_rank
+import random
+log = logging.getLogger()
+_CLIP_SIZE = 384
+_CLIP_FPS = 8.0
+_SYNC_SIZE = 224
+_SYNC_FPS = 25.0
+class VGGSound(Dataset):
+    def __init__(
+        self,
+        root: Union[str, Path],
+        *,
+        tsv_path: Union[str, Path] = 'sets/vgg3-train.tsv',
+        sample_rate: int = 16_000,
+        duration_sec: float = 8.0,
+        audio_samples: Optional[int] = None,
+        normalize_audio: bool = False,
+        exclude_path: Optional[Union[str, Path]] = None,
+    ):
+        self.root = Path(root)
+        self.normalize_audio = normalize_audio
+        if audio_samples is None:
+            self.audio_samples = int(sample_rate * duration_sec)
+        else:
+            self.audio_samples = audio_samples
+            effective_duration = audio_samples / sample_rate
+            # make sure the duration is close enough, within 15ms
+            assert abs(effective_duration - duration_sec) < 0.015, \
+                f'audio_samples {audio_samples} does not match duration_sec {duration_sec}'
+        videos = sorted(os.listdir(self.root))
+        videos = set([Path(v).stem for v in videos])  # remove extensions
+        self.labels = {}
+        self.videos = []
+        missing_videos = []
+        excluded_videos = []
+        self.exclude_list = []
+        if exclude_path is not None:
+            for t in sorted(os.listdir(exclude_path)):
+                data = torch.load(exclude_path / t, weights_only=True)
+                self.exclude_list.extend(data['id'])
+        # read the tsv for subset information
+        df_list = pd.read_csv(tsv_path, sep='\t', dtype={'id': str}).to_dict('records')
+        for record in df_list:
+            id = record['id']
+            label = record['label']
+            if id in self.exclude_list:
+                excluded_videos.append(id)
+                continue
+            if id in videos:
+                self.labels[id] = label
+                self.videos.append(id)
+            else:
+                missing_videos.append(id)
+        if local_rank == 0:
+            log.info(f'{len(excluded_videos)} videos excluded as per exclude list')
+            log.info(f'{len(videos)} videos found in {root}')
+            log.info(f'{len(self.videos)} videos found in {tsv_path}')
+            log.info(f'{len(missing_videos)} videos missing in {root}')
+        self.sample_rate = sample_rate
+        self.duration_sec = duration_sec
+        self.expected_audio_length = audio_samples
+        self.clip_expected_length = int(_CLIP_FPS * self.duration_sec)
+        self.sync_expected_length = int(_SYNC_FPS * self.duration_sec)
+        self.clip_transform = v2.Compose([
+            v2.Resize((_CLIP_SIZE, _CLIP_SIZE), interpolation=v2.InterpolationMode.BICUBIC),
+            v2.ToImage(),
+            v2.ToDtype(torch.float32, scale=True),
+        ])
+        self.sync_transform = v2.Compose([
+            v2.Resize(_SYNC_SIZE, interpolation=v2.InterpolationMode.BICUBIC),
+            v2.CenterCrop(_SYNC_SIZE),
+            v2.ToImage(),
+            v2.ToDtype(torch.float32, scale=True),
+            v2.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
+        ])
+        self.num_timbre_sample = 89088 if sample_rate == 44100 else 32768
+        self.resampler = {}
+    def sample(self, idx: int) -> dict[str, torch.Tensor]:
+        video_id = self.videos[idx]
+        # if video_id in self.exclude_list:
+        #     raise RuntimeError(f'Video {video_id} is in the exclude list')
+        label = self.labels[video_id]
+        reader = StreamingMediaDecoder(self.root / (video_id + '.mp4'))
+        reader.add_basic_video_stream(
+            frames_per_chunk=int(_CLIP_FPS * self.duration_sec),
+            frame_rate=_CLIP_FPS,
+            format='rgb24',
+        )
+        reader.add_basic_video_stream(
+            frames_per_chunk=int(_SYNC_FPS * self.duration_sec),
+            frame_rate=_SYNC_FPS,
+            format='rgb24',
+        )
+        reader.add_basic_audio_stream(frames_per_chunk=2**30, )
+        reader.fill_buffer()
+        data_chunk = reader.pop_chunks()
+        clip_chunk = data_chunk[0]
+        sync_chunk = data_chunk[1]
+        audio_chunk = data_chunk[2]
+        if clip_chunk is None:
+            raise RuntimeError(f'CLIP video returned None {video_id}')
+        if clip_chunk.shape[0] < self.clip_expected_length:
+            raise RuntimeError(
+                f'CLIP video too short {video_id}, expected {self.clip_expected_length}, got {clip_chunk.shape[0]}'
+            )
+        if sync_chunk is None:
+            raise RuntimeError(f'Sync video returned None {video_id}')
+        if sync_chunk.shape[0] < self.sync_expected_length:
+            raise RuntimeError(
+                f'Sync video too short {video_id}, expected {self.sync_expected_length}, got {sync_chunk.shape[0]}'
+            )
+        # process audio
+        sample_rate = int(reader.get_out_stream_info(2).sample_rate)
+        audio_chunk = audio_chunk.transpose(0, 1)
+        audio_chunk = audio_chunk.mean(dim=0)  # mono
+        if self.normalize_audio:
+            abs_max = audio_chunk.abs().max()
+            audio_chunk = audio_chunk / abs_max * 0.95
+            if abs_max <= 1e-6:
+                raise RuntimeError(f'Audio is silent {video_id}')
+        # resample
+        if sample_rate == self.sample_rate:
+            audio_chunk = audio_chunk
+        else:
+            if sample_rate not in self.resampler:
+                # https://pytorch.org/audio/stable/tutorials/audio_resampling_tutorial.html#kaiser-best
+                self.resampler[sample_rate] = torchaudio.transforms.Resample(
+                    sample_rate,
+                    self.sample_rate,
+                    lowpass_filter_width=64,
+                    rolloff=0.9475937167399596,
+                    resampling_method='sinc_interp_kaiser',
+                    beta=14.769656459379492,
+                )
+            audio_chunk = self.resampler[sample_rate](audio_chunk)
+        if audio_chunk.shape[0] < self.expected_audio_length:
+            raise RuntimeError(f'Audio too short {video_id}')
+        audio_chunk = audio_chunk[:self.expected_audio_length]
+        timbre_sample = audio_chunk[audio_chunk.shape[0]-self.num_timbre_sample:]
+        # truncate the video
+        clip_chunk = clip_chunk[:self.clip_expected_length]
+        if clip_chunk.shape[0] != self.clip_expected_length:
+            raise RuntimeError(f'CLIP video wrong length {video_id}, '
+                               f'expected {self.clip_expected_length}, '
+                               f'got {clip_chunk.shape[0]}')
+        clip_chunk = self.clip_transform(clip_chunk)
+        sync_chunk = sync_chunk[:self.sync_expected_length]
+        if sync_chunk.shape[0] != self.sync_expected_length:
+            raise RuntimeError(f'Sync video wrong length {video_id}, '
+                               f'expected {self.sync_expected_length}, '
+                               f'got {sync_chunk.shape[0]}')
+        sync_chunk = self.sync_transform(sync_chunk)
+        data = {
+            'id': video_id,
+            'caption': label,
+            'audio': audio_chunk,
+            'clip_video': clip_chunk,
+            'sync_video': sync_chunk,
+            'timbre_sample': timbre_sample
+        }
+        return data
+    def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
+        try:
+            return self.sample(idx)
+        except Exception as e:
+            log.error(f'Error loading video {self.videos[idx]}: {e}')
+            return None
+    def __len__(self):
+        return len(self.labels)

hf_AC/mmaudio/data/extraction/wav_dataset.py ADDED Viewed

	@@ -0,0 +1,135 @@

+import logging
+import os
+from pathlib import Path
+from typing import Union
+import open_clip
+import pandas as pd
+import torch
+import torchaudio
+from torch.utils.data.dataset import Dataset
+import random
+log = logging.getLogger()
+class WavTextClipsDataset(Dataset):
+    def __init__(
+        self,
+        root: Union[str, Path],
+        *,
+        captions_tsv: Union[str, Path],
+        clips_tsv: Union[str, Path],
+        sample_rate: int,
+        num_samples: int,
+        normalize_audio: bool = False,
+        reject_silent: bool = False,
+        tokenizer_id: str = 'ViT-H-14-378-quickgelu',
+    ):
+        self.root = Path(root)
+        self.sample_rate = sample_rate
+        self.num_samples = num_samples
+        self.normalize_audio = normalize_audio
+        self.reject_silent = reject_silent
+        self.tokenizer = open_clip.get_tokenizer(tokenizer_id)
+        self.num_timbre_sample = 89088 if sample_rate == 44100 else 32768
+        audios = sorted(os.listdir(self.root))
+        audios = set([
+            Path(audio).stem for audio in audios
+            if audio.endswith('.wav') or audio.endswith('.flac')
+        ])
+        self.captions = {}
+        # read the caption tsv
+        df_list = pd.read_csv(captions_tsv, sep='\t', dtype={'id': str}).to_dict('records')
+        for record in df_list:
+            id = record['id']
+            caption = record['caption']
+            self.captions[id] = caption
+        # read the clip tsv
+        df_list = pd.read_csv(clips_tsv, sep='\t', dtype={
+            'id': str,
+            'name': str
+        }).to_dict('records')
+        self.clips = []
+        for record in df_list:
+            record['id'] = record['id']
+            record['name'] = record['name']
+            id = record['id']
+            name = record['name']
+            if name not in self.captions:
+                log.warning(f'Audio {name} not found in {captions_tsv}')
+                continue
+            record['caption'] = self.captions[name]
+            self.clips.append(record)
+        log.info(f'Found {len(self.clips)} audio files in {self.root}')
+        self.resampler = {}
+    def __getitem__(self, idx: int) -> torch.Tensor:
+        try:
+            clip = self.clips[idx]
+            audio_name = clip['name']
+            audio_id = clip['id']
+            caption = clip['caption']
+            start_sample = clip['start_sample']
+            end_sample = clip['end_sample']
+            audio_path = self.root / f'{audio_name}.flac'
+            if not audio_path.exists():
+                audio_path = self.root / f'{audio_name}.wav'
+                assert audio_path.exists()
+            audio_chunk, sample_rate = torchaudio.load(audio_path)
+            audio_chunk = audio_chunk.mean(dim=0)  # mono
+            abs_max = audio_chunk.abs().max()
+            if self.normalize_audio:
+                audio_chunk = audio_chunk / abs_max * 0.95
+            if self.reject_silent and abs_max < 1e-6:
+                log.warning(f'Rejecting silent audio')
+                return None
+            audio_chunk = audio_chunk[start_sample:end_sample]
+            # resample
+            if sample_rate == self.sample_rate:
+                audio_chunk = audio_chunk
+            else:
+                if sample_rate not in self.resampler:
+                    # https://pytorch.org/audio/stable/tutorials/audio_resampling_tutorial.html#kaiser-best
+                    self.resampler[sample_rate] = torchaudio.transforms.Resample(
+                        sample_rate,
+                        self.sample_rate,
+                        lowpass_filter_width=64,
+                        rolloff=0.9475937167399596,
+                        resampling_method='sinc_interp_kaiser',
+                        beta=14.769656459379492,
+                    )
+                audio_chunk = self.resampler[sample_rate](audio_chunk)
+            if audio_chunk.shape[0] < self.num_samples:
+                raise ValueError('Audio is too short')
+            timbre_sample = audio_chunk[:self.num_timbre_sample]
+            audio_chunk = audio_chunk[audio_chunk.shape[0]-self.num_samples:]
+            tokens = self.tokenizer([caption])[0]
+            output = {
+                'waveform': audio_chunk,
+                'id': audio_id,
+                'caption': caption,
+                'tokens': tokens,
+                'timbre_sample': timbre_sample,
+            }
+            return output
+        except Exception as e:
+            log.error(f'Error reading {audio_path}: {e}')
+            return None
+    def __len__(self):
+        return len(self.clips)

hf_AC/mmaudio/data/mm_dataset.py ADDED Viewed

	@@ -0,0 +1,45 @@

+import bisect
+import torch
+from torch.utils.data.dataset import Dataset
+# modified from https://pytorch.org/docs/stable/_modules/torch/utils/data/dataset.html#ConcatDataset
+class MultiModalDataset(Dataset):
+    datasets: list[Dataset]
+    cumulative_sizes: list[int]
+    @staticmethod
+    def cumsum(sequence):
+        r, s = [], 0
+        for e in sequence:
+            l = len(e)
+            r.append(l + s)
+            s += l
+        return r
+    def __init__(self, video_datasets: list[Dataset], audio_datasets: list[Dataset]):
+        super().__init__()
+        self.video_datasets = list(video_datasets)
+        self.audio_datasets = list(audio_datasets)
+        self.datasets = self.video_datasets + self.audio_datasets
+        self.cumulative_sizes = self.cumsum(self.datasets)
+    def __len__(self):
+        return self.cumulative_sizes[-1]
+    def __getitem__(self, idx):
+        if idx < 0:
+            if -idx > len(self):
+                raise ValueError("absolute value of index should not exceed dataset length")
+            idx = len(self) + idx
+        dataset_idx = bisect.bisect_right(self.cumulative_sizes, idx)
+        if dataset_idx == 0:
+            sample_idx = idx
+        else:
+            sample_idx = idx - self.cumulative_sizes[dataset_idx - 1]
+        return self.datasets[dataset_idx][sample_idx]
+    def compute_latent_stats(self) -> tuple[torch.Tensor, torch.Tensor]:
+        return self.video_datasets[0].compute_latent_stats()

hf_AC/mmaudio/data/utils.py ADDED Viewed

	@@ -0,0 +1,148 @@

+import logging
+import os
+import random
+import tempfile
+from pathlib import Path
+from typing import Any, Optional, Union
+import torch
+import torch.distributed as dist
+from tensordict import MemoryMappedTensor
+from torch.utils.data import DataLoader
+from torch.utils.data.dataset import Dataset
+from tqdm import tqdm
+from mmaudio.utils.dist_utils import local_rank, world_size
+scratch_path = Path(os.environ['SLURM_SCRATCH'] if 'SLURM_SCRATCH' in os.environ else '/dev/shm')
+shm_path = Path('/dev/shm')
+log = logging.getLogger()
+def reseed(seed):
+    random.seed(seed)
+    torch.manual_seed(seed)
+def local_scatter_torch(obj: Optional[Any]):
+    if world_size == 1:
+        # Just one worker. Do nothing.
+        return obj
+    array = [obj] * world_size
+    target_array = [None]
+    if local_rank == 0:
+        dist.scatter_object_list(target_array, scatter_object_input_list=array, src=0)
+    else:
+        dist.scatter_object_list(target_array, scatter_object_input_list=None, src=0)
+    return target_array[0]
+class ShardDataset(Dataset):
+    def __init__(self, root):
+        self.root = root
+        self.shards = sorted(os.listdir(root))
+    def __len__(self):
+        return len(self.shards)
+    def __getitem__(self, idx):
+        return torch.load(os.path.join(self.root, self.shards[idx]), weights_only=True)
+def get_tmp_dir(in_memory: bool) -> Path:
+    return shm_path if in_memory else scratch_path
+def load_shards_and_share(data_path: Union[str, Path], ids: list[int],
+                          in_memory: bool) -> MemoryMappedTensor:
+    if local_rank == 0:
+        with tempfile.NamedTemporaryFile(prefix='shared-tensor-', dir=get_tmp_dir(in_memory)) as f:
+            log.info(f'Loading shards from {data_path} into {f.name}...')
+            data = load_shards(data_path, ids=ids, tmp_file_path=f.name)
+            data = share_tensor_to_all(data)
+            torch.distributed.barrier()
+            f.close()  # why does the context manager not close the file for me?
+    else:
+        log.info('Waiting for the data to be shared with me...')
+        data = share_tensor_to_all(None)
+        torch.distributed.barrier()
+    return data
+def load_shards(
+    data_path: Union[str, Path],
+    ids: list[int],
+    *,
+    tmp_file_path: str,
+) -> Union[torch.Tensor, dict[str, torch.Tensor]]:
+    id_set = set(ids)
+    shards = sorted(os.listdir(data_path))
+    log.info(f'Found {len(shards)} shards in {data_path}.')
+    first_shard = torch.load(os.path.join(data_path, shards[0]), weights_only=True)
+    log.info(f'Rank {local_rank} created file {tmp_file_path}')
+    first_item = next(iter(first_shard.values()))
+    log.info(f'First item shape: {first_item.shape}')
+    mm_tensor = MemoryMappedTensor.empty(shape=(len(ids), *first_item.shape),
+                                         dtype=torch.float32,
+                                         filename=tmp_file_path,
+                                         existsok=True)
+    total_count = 0
+    used_index = set()
+    id_indexing = {i: idx for idx, i in enumerate(ids)}
+    # faster with no workers; otherwise we need to set_sharing_strategy('file_system')
+    loader = DataLoader(ShardDataset(data_path), batch_size=1, num_workers=0)
+    for data in tqdm(loader, desc='Loading shards'):
+        for i, v in data.items():
+            if i not in id_set:
+                continue
+            # tensor_index = ids.index(i)
+            tensor_index = id_indexing[i]
+            if tensor_index in used_index:
+                raise ValueError(f'Duplicate id {i} found in {data_path}.')
+            used_index.add(tensor_index)
+            mm_tensor[tensor_index] = v
+            total_count += 1
+    assert total_count == len(ids), f'Expected {len(ids)} tensors, got {total_count}.'
+    log.info(f'Loaded {total_count} tensors from {data_path}.')
+    return mm_tensor
+def share_tensor_to_all(x: Optional[MemoryMappedTensor]) -> MemoryMappedTensor:
+    """
+    x: the tensor to be shared; None if local_rank != 0
+    return: the shared tensor
+    """
+    # there is no need to share your stuff with anyone if you are alone; must be in memory
+    if world_size == 1:
+        return x
+    if local_rank == 0:
+        assert x is not None, 'x must not be None if local_rank == 0'
+    else:
+        assert x is None, 'x must be None if local_rank != 0'
+    if local_rank == 0:
+        filename = x.filename
+        meta_information = (filename, x.shape, x.dtype)
+    else:
+        meta_information = None
+    filename, data_shape, data_type = local_scatter_torch(meta_information)
+    if local_rank == 0:
+        data = x
+    else:
+        data = MemoryMappedTensor.from_filename(filename=filename,
+                                                dtype=data_type,
+                                                shape=data_shape)
+    return data

hf_AC/mmaudio/eval_utils.py ADDED Viewed

	@@ -0,0 +1,249 @@

+import dataclasses
+import logging
+from pathlib import Path
+from typing import Optional
+import numpy as np
+import torch
+from colorlog import ColoredFormatter
+from PIL import Image
+from torchvision.transforms import v2
+from mmaudio.data.av_utils import ImageInfo, VideoInfo, read_frames, reencode_with_audio
+from mmaudio.model.flow_matching import FlowMatching
+from mmaudio.model.networks import MMAudio
+from mmaudio.model.sequence_config import CONFIG_16K, CONFIG_44K, SequenceConfig
+from mmaudio.model.utils.features_utils import FeaturesUtils
+from mmaudio.utils.download_utils import download_model_if_needed
+log = logging.getLogger()
+@dataclasses.dataclass
+class ModelConfig:
+    model_name: str
+    model_path: Path
+    vae_path: Path
+    bigvgan_16k_path: Optional[Path]
+    mode: str
+    synchformer_ckpt: Path = Path('./ext_weights/synchformer_state_dict.pth')
+    @property
+    def seq_cfg(self) -> SequenceConfig:
+        if self.mode == '16k':
+            return CONFIG_16K
+        elif self.mode == '44k':
+            return CONFIG_44K
+    def download_if_needed(self):
+        # download_model_if_needed(self.model_path)
+        download_model_if_needed(self.vae_path)
+        if self.bigvgan_16k_path is not None:
+            download_model_if_needed(self.bigvgan_16k_path)
+        download_model_if_needed(self.synchformer_ckpt)
+large_44k = ModelConfig(model_name='large_44k',
+                        model_path=Path('./weights/mmaudio_large_44k.pth'),
+                        vae_path=Path('./ext_weights/v1-44.pth'),
+                        bigvgan_16k_path=None,
+                        mode='44k')
+all_model_cfg: dict[str, ModelConfig] = {
+    'large_44k': large_44k,
+}
+def generate(
+    clip_video: Optional[torch.Tensor],
+    sync_video: Optional[torch.Tensor],
+    text: Optional[list[str]],
+    audio: Optional[torch.Tensor],
+    *,
+    negative_text: Optional[list[str]] = None,
+    feature_utils: FeaturesUtils,
+    net: MMAudio,
+    fm: FlowMatching,
+    rng: torch.Generator,
+    cfg_strength: float,
+    clip_batch_size_multiplier: int = 40,
+    sync_batch_size_multiplier: int = 40,
+    image_input: bool = False,
+) -> torch.Tensor:
+    device = feature_utils.device
+    dtype = feature_utils.dtype
+    bs = len(text)
+    if clip_video is not None:
+        clip_video = clip_video.to(device, dtype, non_blocking=True)
+        clip_features = feature_utils.encode_video_with_clip(clip_video,
+                                                             batch_size=bs *
+                                                             clip_batch_size_multiplier)
+        if image_input:
+            clip_features = clip_features.expand(-1, net.clip_seq_len, -1)
+    else:
+        clip_features = net.get_empty_clip_sequence(bs)
+    if sync_video is not None and not image_input:
+        sync_video = sync_video.to(device, dtype, non_blocking=True)
+        sync_features = feature_utils.encode_video_with_sync(sync_video,
+                                                             batch_size=bs *
+                                                             sync_batch_size_multiplier)
+    else:
+        sync_features = net.get_empty_sync_sequence(bs)
+    if text is not None:
+        text_features = feature_utils.encode_text(text)
+    else:
+        text_features = net.get_empty_string_sequence(bs)
+    if negative_text is not None:
+        assert len(negative_text) == bs
+        negative_text_features = feature_utils.encode_text(negative_text)
+    else:
+        negative_text_features = net.get_empty_string_sequence(bs)
+    if audio is None:
+        audio_features = net.get_empty_audio_sequence(bs)
+    else:
+        if len(audio.shape) == 1:
+            audio = audio.cuda().unsqueeze(0)
+            audio = audio.repeat(bs, 1)
+        else:
+            audio = audio.cuda()
+        feature_utils_audio = feature_utils.to(device, torch.float32).eval()
+        dist = feature_utils_audio.encode_audio(audio)
+        audio_mean = dist.mean.detach().cuda().transpose(1, 2)
+        audio_std = dist.std.detach().cuda().transpose(1, 2)
+        randn = torch.empty_like(audio_mean).normal_(generator=rng)
+        audio_features = audio_mean + audio_std * randn
+        audio_features = audio_features.to(device, dtype, non_blocking=True)
+        feature_utils = feature_utils.to(device, dtype).eval()
+    x0 = torch.randn(bs,
+                     net.latent_seq_len,
+                     net.latent_dim,
+                     device=device,
+                     dtype=dtype,
+                     generator=rng)
+    preprocessed_conditions = net.preprocess_conditions(clip_features, sync_features, text_features, audio_features)
+    empty_conditions = net.get_empty_conditions(
+        bs, negative_text_features=negative_text_features if negative_text is not None else None)
+    cfg_ode_wrapper = lambda t, x: net.ode_wrapper(t, x, preprocessed_conditions, empty_conditions,
+                                                   cfg_strength)
+    x1 = fm.to_data(cfg_ode_wrapper, x0)
+    x1 = net.unnormalize(x1)
+    spec = feature_utils.decode(x1)
+    audio = feature_utils.vocode(spec)
+    return audio
+LOGFORMAT = "[%(log_color)s%(levelname)-8s%(reset)s]: %(log_color)s%(message)s%(reset)s"
+def setup_eval_logging(log_level: int = logging.INFO):
+    logging.root.setLevel(log_level)
+    formatter = ColoredFormatter(LOGFORMAT)
+    stream = logging.StreamHandler()
+    stream.setLevel(log_level)
+    stream.setFormatter(formatter)
+    log = logging.getLogger()
+    log.setLevel(log_level)
+    log.addHandler(stream)
+_CLIP_SIZE = 384
+_CLIP_FPS = 8.0
+_SYNC_SIZE = 224
+_SYNC_FPS = 25.0
+def load_video(video_path: Path, duration_sec: float, load_all_frames: bool = True) -> VideoInfo:
+    clip_transform = v2.Compose([
+        v2.Resize((_CLIP_SIZE, _CLIP_SIZE), interpolation=v2.InterpolationMode.BICUBIC),
+        v2.ToImage(),
+        v2.ToDtype(torch.float32, scale=True),
+    ])
+    sync_transform = v2.Compose([
+        v2.Resize(_SYNC_SIZE, interpolation=v2.InterpolationMode.BICUBIC),
+        v2.CenterCrop(_SYNC_SIZE),
+        v2.ToImage(),
+        v2.ToDtype(torch.float32, scale=True),
+        v2.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
+    ])
+    output_frames, all_frames, orig_fps = read_frames(video_path,
+                                                      list_of_fps=[_CLIP_FPS, _SYNC_FPS],
+                                                      start_sec=0,
+                                                      end_sec=duration_sec,
+                                                      need_all_frames=load_all_frames)
+    clip_chunk, sync_chunk = output_frames
+    clip_chunk = torch.from_numpy(clip_chunk).permute(0, 3, 1, 2)
+    sync_chunk = torch.from_numpy(sync_chunk).permute(0, 3, 1, 2)
+    clip_frames = clip_transform(clip_chunk)
+    sync_frames = sync_transform(sync_chunk)
+    clip_length_sec = clip_frames.shape[0] / _CLIP_FPS
+    sync_length_sec = sync_frames.shape[0] / _SYNC_FPS
+    if clip_length_sec < duration_sec:
+        log.warning(f'Clip video is too short: {clip_length_sec:.2f} < {duration_sec:.2f}')
+        log.warning(f'Truncating to {clip_length_sec:.2f} sec')
+        duration_sec = clip_length_sec
+    if sync_length_sec < duration_sec:
+        log.warning(f'Sync video is too short: {sync_length_sec:.2f} < {duration_sec:.2f}')
+        log.warning(f'Truncating to {sync_length_sec:.2f} sec')
+        duration_sec = sync_length_sec
+    clip_frames = clip_frames[:int(_CLIP_FPS * duration_sec)]
+    sync_frames = sync_frames[:int(_SYNC_FPS * duration_sec)]
+    video_info = VideoInfo(
+        duration_sec=duration_sec,
+        fps=orig_fps,
+        clip_frames=clip_frames,
+        sync_frames=sync_frames,
+        all_frames=all_frames if load_all_frames else None,
+    )
+    return video_info
+def load_image(image_path: Path) -> VideoInfo:
+    clip_transform = v2.Compose([
+        v2.Resize((_CLIP_SIZE, _CLIP_SIZE), interpolation=v2.InterpolationMode.BICUBIC),
+        v2.ToImage(),
+        v2.ToDtype(torch.float32, scale=True),
+    ])
+    sync_transform = v2.Compose([
+        v2.Resize(_SYNC_SIZE, interpolation=v2.InterpolationMode.BICUBIC),
+        v2.CenterCrop(_SYNC_SIZE),
+        v2.ToImage(),
+        v2.ToDtype(torch.float32, scale=True),
+        v2.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
+    ])
+    frame = np.array(Image.open(image_path))
+    clip_chunk = torch.from_numpy(frame).unsqueeze(0).permute(0, 3, 1, 2)
+    sync_chunk = torch.from_numpy(frame).unsqueeze(0).permute(0, 3, 1, 2)
+    clip_frames = clip_transform(clip_chunk)
+    sync_frames = sync_transform(sync_chunk)
+    video_info = ImageInfo(
+        clip_frames=clip_frames,
+        sync_frames=sync_frames,
+        original_frame=frame,
+    )
+    return video_info
+def make_video(video_info: VideoInfo, output_path: Path, audio: torch.Tensor, sampling_rate: int):
+    reencode_with_audio(video_info, output_path, audio, sampling_rate)

hf_AC/mmaudio/ext/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+

hf_AC/mmaudio/ext/autoencoder/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ from .autoencoder import AutoEncoderModule

hf_AC/mmaudio/ext/autoencoder/autoencoder.py ADDED Viewed

	@@ -0,0 +1,52 @@

+from typing import Literal, Optional
+import torch
+import torch.nn as nn
+from mmaudio.ext.autoencoder.vae import VAE, get_my_vae
+from mmaudio.ext.bigvgan import BigVGAN
+from mmaudio.ext.bigvgan_v2.bigvgan import BigVGAN as BigVGANv2
+from mmaudio.model.utils.distributions import DiagonalGaussianDistribution
+class AutoEncoderModule(nn.Module):
+    def __init__(self,
+                 *,
+                 vae_ckpt_path,
+                 vocoder_ckpt_path: Optional[str] = None,
+                 mode: Literal['16k', '44k'],
+                 need_vae_encoder: bool = True):
+        super().__init__()
+        self.vae: VAE = get_my_vae(mode).eval()
+        vae_state_dict = torch.load(vae_ckpt_path, weights_only=True, map_location='cpu')
+        self.vae.load_state_dict(vae_state_dict)
+        self.vae.remove_weight_norm()
+        if mode == '16k':
+            assert vocoder_ckpt_path is not None
+            self.vocoder = BigVGAN(vocoder_ckpt_path).eval()
+        elif mode == '44k':
+            self.vocoder = BigVGANv2.from_pretrained('nvidia/bigvgan_v2_44khz_128band_512x',
+                                                     use_cuda_kernel=False)
+            self.vocoder.remove_weight_norm()
+        else:
+            raise ValueError(f'Unknown mode: {mode}')
+        for param in self.parameters():
+            param.requires_grad = False
+        if not need_vae_encoder:
+            del self.vae.encoder
+    @torch.inference_mode()
+    def encode(self, x: torch.Tensor) -> DiagonalGaussianDistribution:
+        return self.vae.encode(x)
+    @torch.inference_mode()
+    def decode(self, z: torch.Tensor) -> torch.Tensor:
+        return self.vae.decode(z)
+    @torch.inference_mode()
+    def vocode(self, spec: torch.Tensor) -> torch.Tensor:
+        return self.vocoder(spec)

hf_AC/mmaudio/ext/autoencoder/edm2_utils.py ADDED Viewed

	@@ -0,0 +1,168 @@

+# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# This work is licensed under a Creative Commons
+# Attribution-NonCommercial-ShareAlike 4.0 International License.
+# You should have received a copy of the license along with this
+# work. If not, see http://creativecommons.org/licenses/by-nc-sa/4.0/
+"""Improved diffusion model architecture proposed in the paper
+"Analyzing and Improving the Training Dynamics of Diffusion Models"."""
+import numpy as np
+import torch
+#----------------------------------------------------------------------------
+# Variant of constant() that inherits dtype and device from the given
+# reference tensor by default.
+_constant_cache = dict()
+def constant(value, shape=None, dtype=None, device=None, memory_format=None):
+    value = np.asarray(value)
+    if shape is not None:
+        shape = tuple(shape)
+    if dtype is None:
+        dtype = torch.get_default_dtype()
+    if device is None:
+        device = torch.device('cpu')
+    if memory_format is None:
+        memory_format = torch.contiguous_format
+    key = (value.shape, value.dtype, value.tobytes(), shape, dtype, device, memory_format)
+    tensor = _constant_cache.get(key, None)
+    if tensor is None:
+        tensor = torch.as_tensor(value.copy(), dtype=dtype, device=device)
+        if shape is not None:
+            tensor, _ = torch.broadcast_tensors(tensor, torch.empty(shape))
+        tensor = tensor.contiguous(memory_format=memory_format)
+        _constant_cache[key] = tensor
+    return tensor
+def const_like(ref, value, shape=None, dtype=None, device=None, memory_format=None):
+    if dtype is None:
+        dtype = ref.dtype
+    if device is None:
+        device = ref.device
+    return constant(value, shape=shape, dtype=dtype, device=device, memory_format=memory_format)
+#----------------------------------------------------------------------------
+# Normalize given tensor to unit magnitude with respect to the given
+# dimensions. Default = all dimensions except the first.
+def normalize(x, dim=None, eps=1e-4):
+    if dim is None:
+        dim = list(range(1, x.ndim))
+    norm = torch.linalg.vector_norm(x, dim=dim, keepdim=True, dtype=torch.float32)
+    norm = torch.add(eps, norm, alpha=np.sqrt(norm.numel() / x.numel()))
+    return x / norm.to(x.dtype)
+class Normalize(torch.nn.Module):
+    def __init__(self, dim=None, eps=1e-4):
+        super().__init__()
+        self.dim = dim
+        self.eps = eps
+    def forward(self, x):
+        return normalize(x, dim=self.dim, eps=self.eps)
+#----------------------------------------------------------------------------
+# Upsample or downsample the given tensor with the given filter,
+# or keep it as is.
+def resample(x, f=[1, 1], mode='keep'):
+    if mode == 'keep':
+        return x
+    f = np.float32(f)
+    assert f.ndim == 1 and len(f) % 2 == 0
+    pad = (len(f) - 1) // 2
+    f = f / f.sum()
+    f = np.outer(f, f)[np.newaxis, np.newaxis, :, :]
+    f = const_like(x, f)
+    c = x.shape[1]
+    if mode == 'down':
+        return torch.nn.functional.conv2d(x,
+                                          f.tile([c, 1, 1, 1]),
+                                          groups=c,
+                                          stride=2,
+                                          padding=(pad, ))
+    assert mode == 'up'
+    return torch.nn.functional.conv_transpose2d(x, (f * 4).tile([c, 1, 1, 1]),
+                                                groups=c,
+                                                stride=2,
+                                                padding=(pad, ))
+#----------------------------------------------------------------------------
+# Magnitude-preserving SiLU (Equation 81).
+def mp_silu(x):
+    return torch.nn.functional.silu(x) / 0.596
+class MPSiLU(torch.nn.Module):
+    def forward(self, x):
+        return mp_silu(x)
+#----------------------------------------------------------------------------
+# Magnitude-preserving sum (Equation 88).
+def mp_sum(a, b, t=0.5):
+    return a.lerp(b, t) / np.sqrt((1 - t)**2 + t**2)
+#----------------------------------------------------------------------------
+# Magnitude-preserving concatenation (Equation 103).
+def mp_cat(a, b, dim=1, t=0.5):
+    Na = a.shape[dim]
+    Nb = b.shape[dim]
+    C = np.sqrt((Na + Nb) / ((1 - t)**2 + t**2))
+    wa = C / np.sqrt(Na) * (1 - t)
+    wb = C / np.sqrt(Nb) * t
+    return torch.cat([wa * a, wb * b], dim=dim)
+#----------------------------------------------------------------------------
+# Magnitude-preserving convolution or fully-connected layer (Equation 47)
+# with force weight normalization (Equation 66).
+class MPConv1D(torch.nn.Module):
+    def __init__(self, in_channels, out_channels, kernel_size):
+        super().__init__()
+        self.out_channels = out_channels
+        self.weight = torch.nn.Parameter(torch.randn(out_channels, in_channels, kernel_size))
+        self.weight_norm_removed = False
+    def forward(self, x, gain=1):
+        assert self.weight_norm_removed, 'call remove_weight_norm() before inference'
+        w = self.weight * gain
+        if w.ndim == 2:
+            return x @ w.t()
+        assert w.ndim == 3
+        return torch.nn.functional.conv1d(x, w, padding=(w.shape[-1] // 2, ))
+    def remove_weight_norm(self):
+        w = self.weight.to(torch.float32)
+        w = normalize(w)  # traditional weight normalization
+        w = w / np.sqrt(w[0].numel())
+        w = w.to(self.weight.dtype)
+        self.weight.data.copy_(w)
+        self.weight_norm_removed = True
+        return self

hf_AC/mmaudio/ext/autoencoder/vae.py ADDED Viewed

	@@ -0,0 +1,369 @@

+import logging
+from typing import Optional
+import torch
+import torch.nn as nn
+from mmaudio.ext.autoencoder.edm2_utils import MPConv1D
+from mmaudio.ext.autoencoder.vae_modules import (AttnBlock1D, Downsample1D, ResnetBlock1D,
+                                                 Upsample1D, nonlinearity)
+from mmaudio.model.utils.distributions import DiagonalGaussianDistribution
+log = logging.getLogger()
+DATA_MEAN_80D = [
+    -1.6058, -1.3676, -1.2520, -1.2453, -1.2078, -1.2224, -1.2419, -1.2439, -1.2922, -1.2927,
+    -1.3170, -1.3543, -1.3401, -1.3836, -1.3907, -1.3912, -1.4313, -1.4152, -1.4527, -1.4728,
+    -1.4568, -1.5101, -1.5051, -1.5172, -1.5623, -1.5373, -1.5746, -1.5687, -1.6032, -1.6131,
+    -1.6081, -1.6331, -1.6489, -1.6489, -1.6700, -1.6738, -1.6953, -1.6969, -1.7048, -1.7280,
+    -1.7361, -1.7495, -1.7658, -1.7814, -1.7889, -1.8064, -1.8221, -1.8377, -1.8417, -1.8643,
+    -1.8857, -1.8929, -1.9173, -1.9379, -1.9531, -1.9673, -1.9824, -2.0042, -2.0215, -2.0436,
+    -2.0766, -2.1064, -2.1418, -2.1855, -2.2319, -2.2767, -2.3161, -2.3572, -2.3954, -2.4282,
+    -2.4659, -2.5072, -2.5552, -2.6074, -2.6584, -2.7107, -2.7634, -2.8266, -2.8981, -2.9673
+]
+DATA_STD_80D = [
+    1.0291, 1.0411, 1.0043, 0.9820, 0.9677, 0.9543, 0.9450, 0.9392, 0.9343, 0.9297, 0.9276, 0.9263,
+    0.9242, 0.9254, 0.9232, 0.9281, 0.9263, 0.9315, 0.9274, 0.9247, 0.9277, 0.9199, 0.9188, 0.9194,
+    0.9160, 0.9161, 0.9146, 0.9161, 0.9100, 0.9095, 0.9145, 0.9076, 0.9066, 0.9095, 0.9032, 0.9043,
+    0.9038, 0.9011, 0.9019, 0.9010, 0.8984, 0.8983, 0.8986, 0.8961, 0.8962, 0.8978, 0.8962, 0.8973,
+    0.8993, 0.8976, 0.8995, 0.9016, 0.8982, 0.8972, 0.8974, 0.8949, 0.8940, 0.8947, 0.8936, 0.8939,
+    0.8951, 0.8956, 0.9017, 0.9167, 0.9436, 0.9690, 1.0003, 1.0225, 1.0381, 1.0491, 1.0545, 1.0604,
+    1.0761, 1.0929, 1.1089, 1.1196, 1.1176, 1.1156, 1.1117, 1.1070
+]
+DATA_MEAN_128D = [
+    -3.3462, -2.6723, -2.4893, -2.3143, -2.2664, -2.3317, -2.1802, -2.4006, -2.2357, -2.4597,
+    -2.3717, -2.4690, -2.5142, -2.4919, -2.6610, -2.5047, -2.7483, -2.5926, -2.7462, -2.7033,
+    -2.7386, -2.8112, -2.7502, -2.9594, -2.7473, -3.0035, -2.8891, -2.9922, -2.9856, -3.0157,
+    -3.1191, -2.9893, -3.1718, -3.0745, -3.1879, -3.2310, -3.1424, -3.2296, -3.2791, -3.2782,
+    -3.2756, -3.3134, -3.3509, -3.3750, -3.3951, -3.3698, -3.4505, -3.4509, -3.5089, -3.4647,
+    -3.5536, -3.5788, -3.5867, -3.6036, -3.6400, -3.6747, -3.7072, -3.7279, -3.7283, -3.7795,
+    -3.8259, -3.8447, -3.8663, -3.9182, -3.9605, -3.9861, -4.0105, -4.0373, -4.0762, -4.1121,
+    -4.1488, -4.1874, -4.2461, -4.3170, -4.3639, -4.4452, -4.5282, -4.6297, -4.7019, -4.7960,
+    -4.8700, -4.9507, -5.0303, -5.0866, -5.1634, -5.2342, -5.3242, -5.4053, -5.4927, -5.5712,
+    -5.6464, -5.7052, -5.7619, -5.8410, -5.9188, -6.0103, -6.0955, -6.1673, -6.2362, -6.3120,
+    -6.3926, -6.4797, -6.5565, -6.6511, -6.8130, -6.9961, -7.1275, -7.2457, -7.3576, -7.4663,
+    -7.6136, -7.7469, -7.8815, -8.0132, -8.1515, -8.3071, -8.4722, -8.7418, -9.3975, -9.6628,
+    -9.7671, -9.8863, -9.9992, -10.0860, -10.1709, -10.5418, -11.2795, -11.3861
+]
+DATA_STD_128D = [
+    2.3804, 2.4368, 2.3772, 2.3145, 2.2803, 2.2510, 2.2316, 2.2083, 2.1996, 2.1835, 2.1769, 2.1659,
+    2.1631, 2.1618, 2.1540, 2.1606, 2.1571, 2.1567, 2.1612, 2.1579, 2.1679, 2.1683, 2.1634, 2.1557,
+    2.1668, 2.1518, 2.1415, 2.1449, 2.1406, 2.1350, 2.1313, 2.1415, 2.1281, 2.1352, 2.1219, 2.1182,
+    2.1327, 2.1195, 2.1137, 2.1080, 2.1179, 2.1036, 2.1087, 2.1036, 2.1015, 2.1068, 2.0975, 2.0991,
+    2.0902, 2.1015, 2.0857, 2.0920, 2.0893, 2.0897, 2.0910, 2.0881, 2.0925, 2.0873, 2.0960, 2.0900,
+    2.0957, 2.0958, 2.0978, 2.0936, 2.0886, 2.0905, 2.0845, 2.0855, 2.0796, 2.0840, 2.0813, 2.0817,
+    2.0838, 2.0840, 2.0917, 2.1061, 2.1431, 2.1976, 2.2482, 2.3055, 2.3700, 2.4088, 2.4372, 2.4609,
+    2.4731, 2.4847, 2.5072, 2.5451, 2.5772, 2.6147, 2.6529, 2.6596, 2.6645, 2.6726, 2.6803, 2.6812,
+    2.6899, 2.6916, 2.6931, 2.6998, 2.7062, 2.7262, 2.7222, 2.7158, 2.7041, 2.7485, 2.7491, 2.7451,
+    2.7485, 2.7233, 2.7297, 2.7233, 2.7145, 2.6958, 2.6788, 2.6439, 2.6007, 2.4786, 2.2469, 2.1877,
+    2.1392, 2.0717, 2.0107, 1.9676, 1.9140, 1.7102, 0.9101, 0.7164
+]
+class VAE(nn.Module):
+    def __init__(
+        self,
+        *,
+        data_dim: int,
+        embed_dim: int,
+        hidden_dim: int,
+    ):
+        super().__init__()
+        if data_dim == 80:
+            self.data_mean = nn.Buffer(torch.tensor(DATA_MEAN_80D, dtype=torch.float32))
+            self.data_std = nn.Buffer(torch.tensor(DATA_STD_80D, dtype=torch.float32))
+        elif data_dim == 128:
+            self.data_mean = nn.Buffer(torch.tensor(DATA_MEAN_128D, dtype=torch.float32))
+            self.data_std = nn.Buffer(torch.tensor(DATA_STD_128D, dtype=torch.float32))
+        self.data_mean = self.data_mean.view(1, -1, 1)
+        self.data_std = self.data_std.view(1, -1, 1)
+        self.encoder = Encoder1D(
+            dim=hidden_dim,
+            ch_mult=(1, 2, 4),
+            num_res_blocks=2,
+            attn_layers=[3],
+            down_layers=[0],
+            in_dim=data_dim,
+            embed_dim=embed_dim,
+        )
+        self.decoder = Decoder1D(
+            dim=hidden_dim,
+            ch_mult=(1, 2, 4),
+            num_res_blocks=2,
+            attn_layers=[3],
+            down_layers=[0],
+            in_dim=data_dim,
+            out_dim=data_dim,
+            embed_dim=embed_dim,
+        )
+        self.embed_dim = embed_dim
+        # self.quant_conv = nn.Conv1d(2 * embed_dim, 2 * embed_dim, 1)
+        # self.post_quant_conv = nn.Conv1d(embed_dim, embed_dim, 1)
+        self.initialize_weights()
+    def initialize_weights(self):
+        pass
+    def encode(self, x: torch.Tensor, normalize: bool = True) -> DiagonalGaussianDistribution:
+        if normalize:
+            x = self.normalize(x)
+        moments = self.encoder(x)
+        posterior = DiagonalGaussianDistribution(moments)
+        return posterior
+    def decode(self, z: torch.Tensor, unnormalize: bool = True) -> torch.Tensor:
+        dec = self.decoder(z)
+        if unnormalize:
+            dec = self.unnormalize(dec)
+        return dec
+    def normalize(self, x: torch.Tensor) -> torch.Tensor:
+        return (x - self.data_mean) / self.data_std
+    def unnormalize(self, x: torch.Tensor) -> torch.Tensor:
+        return x * self.data_std + self.data_mean
+    def forward(
+        self,
+        x: torch.Tensor,
+        sample_posterior: bool = True,
+        rng: Optional[torch.Generator] = None,
+        normalize: bool = True,
+        unnormalize: bool = True,
+    ) -> tuple[torch.Tensor, DiagonalGaussianDistribution]:
+        posterior = self.encode(x, normalize=normalize)
+        if sample_posterior:
+            z = posterior.sample(rng)
+        else:
+            z = posterior.mode()
+        dec = self.decode(z, unnormalize=unnormalize)
+        return dec, posterior
+    def load_weights(self, src_dict) -> None:
+        self.load_state_dict(src_dict, strict=True)
+    @property
+    def device(self) -> torch.device:
+        return next(self.parameters()).device
+    def get_last_layer(self):
+        return self.decoder.conv_out.weight
+    def remove_weight_norm(self):
+        for name, m in self.named_modules():
+            if isinstance(m, MPConv1D):
+                m.remove_weight_norm()
+                log.debug(f"Removed weight norm from {name}")
+        return self
+class Encoder1D(nn.Module):
+    def __init__(self,
+                 *,
+                 dim: int,
+                 ch_mult: tuple[int] = (1, 2, 4, 8),
+                 num_res_blocks: int,
+                 attn_layers: list[int] = [],
+                 down_layers: list[int] = [],
+                 resamp_with_conv: bool = True,
+                 in_dim: int,
+                 embed_dim: int,
+                 double_z: bool = True,
+                 kernel_size: int = 3,
+                 clip_act: float = 256.0):
+        super().__init__()
+        self.dim = dim
+        self.num_layers = len(ch_mult)
+        self.num_res_blocks = num_res_blocks
+        self.in_channels = in_dim
+        self.clip_act = clip_act
+        self.down_layers = down_layers
+        self.attn_layers = attn_layers
+        self.conv_in = MPConv1D(in_dim, self.dim, kernel_size=kernel_size)
+        in_ch_mult = (1, ) + tuple(ch_mult)
+        self.in_ch_mult = in_ch_mult
+        # downsampling
+        self.down = nn.ModuleList()
+        for i_level in range(self.num_layers):
+            block = nn.ModuleList()
+            attn = nn.ModuleList()
+            block_in = dim * in_ch_mult[i_level]
+            block_out = dim * ch_mult[i_level]
+            for i_block in range(self.num_res_blocks):
+                block.append(
+                    ResnetBlock1D(in_dim=block_in,
+                                  out_dim=block_out,
+                                  kernel_size=kernel_size,
+                                  use_norm=True))
+                block_in = block_out
+                if i_level in attn_layers:
+                    attn.append(AttnBlock1D(block_in))
+            down = nn.Module()
+            down.block = block
+            down.attn = attn
+            if i_level in down_layers:
+                down.downsample = Downsample1D(block_in, resamp_with_conv)
+            self.down.append(down)
+        # middle
+        self.mid = nn.Module()
+        self.mid.block_1 = ResnetBlock1D(in_dim=block_in,
+                                         out_dim=block_in,
+                                         kernel_size=kernel_size,
+                                         use_norm=True)
+        self.mid.attn_1 = AttnBlock1D(block_in)
+        self.mid.block_2 = ResnetBlock1D(in_dim=block_in,
+                                         out_dim=block_in,
+                                         kernel_size=kernel_size,
+                                         use_norm=True)
+        # end
+        self.conv_out = MPConv1D(block_in,
+                                 2 * embed_dim if double_z else embed_dim,
+                                 kernel_size=kernel_size)
+        self.learnable_gain = nn.Parameter(torch.zeros([]))
+    def forward(self, x):
+        # downsampling
+        hs = [self.conv_in(x)]
+        for i_level in range(self.num_layers):
+            for i_block in range(self.num_res_blocks):
+                h = self.down[i_level].block[i_block](hs[-1])
+                if len(self.down[i_level].attn) > 0:
+                    h = self.down[i_level].attn[i_block](h)
+                h = h.clamp(-self.clip_act, self.clip_act)
+                hs.append(h)
+            if i_level in self.down_layers:
+                hs.append(self.down[i_level].downsample(hs[-1]))
+        # middle
+        h = hs[-1]
+        h = self.mid.block_1(h)
+        h = self.mid.attn_1(h)
+        h = self.mid.block_2(h)
+        h = h.clamp(-self.clip_act, self.clip_act)
+        # end
+        h = nonlinearity(h)
+        h = self.conv_out(h, gain=(self.learnable_gain + 1))
+        return h
+class Decoder1D(nn.Module):
+    def __init__(self,
+                 *,
+                 dim: int,
+                 out_dim: int,
+                 ch_mult: tuple[int] = (1, 2, 4, 8),
+                 num_res_blocks: int,
+                 attn_layers: list[int] = [],
+                 down_layers: list[int] = [],
+                 kernel_size: int = 3,
+                 resamp_with_conv: bool = True,
+                 in_dim: int,
+                 embed_dim: int,
+                 clip_act: float = 256.0):
+        super().__init__()
+        self.ch = dim
+        self.num_layers = len(ch_mult)
+        self.num_res_blocks = num_res_blocks
+        self.in_channels = in_dim
+        self.clip_act = clip_act
+        self.down_layers = [i + 1 for i in down_layers]  # each downlayer add one
+        # compute in_ch_mult, block_in and curr_res at lowest res
+        block_in = dim * ch_mult[self.num_layers - 1]
+        # z to block_in
+        self.conv_in = MPConv1D(embed_dim, block_in, kernel_size=kernel_size)
+        # middle
+        self.mid = nn.Module()
+        self.mid.block_1 = ResnetBlock1D(in_dim=block_in, out_dim=block_in, use_norm=True)
+        self.mid.attn_1 = AttnBlock1D(block_in)
+        self.mid.block_2 = ResnetBlock1D(in_dim=block_in, out_dim=block_in, use_norm=True)
+        # upsampling
+        self.up = nn.ModuleList()
+        for i_level in reversed(range(self.num_layers)):
+            block = nn.ModuleList()
+            attn = nn.ModuleList()
+            block_out = dim * ch_mult[i_level]
+            for i_block in range(self.num_res_blocks + 1):
+                block.append(ResnetBlock1D(in_dim=block_in, out_dim=block_out, use_norm=True))
+                block_in = block_out
+                if i_level in attn_layers:
+                    attn.append(AttnBlock1D(block_in))
+            up = nn.Module()
+            up.block = block
+            up.attn = attn
+            if i_level in self.down_layers:
+                up.upsample = Upsample1D(block_in, resamp_with_conv)
+            self.up.insert(0, up)  # prepend to get consistent order
+        # end
+        self.conv_out = MPConv1D(block_in, out_dim, kernel_size=kernel_size)
+        self.learnable_gain = nn.Parameter(torch.zeros([]))
+    def forward(self, z):
+        # z to block_in
+        h = self.conv_in(z)
+        # middle
+        h = self.mid.block_1(h)
+        h = self.mid.attn_1(h)
+        h = self.mid.block_2(h)
+        h = h.clamp(-self.clip_act, self.clip_act)
+        # upsampling
+        for i_level in reversed(range(self.num_layers)):
+            for i_block in range(self.num_res_blocks + 1):
+                h = self.up[i_level].block[i_block](h)
+                if len(self.up[i_level].attn) > 0:
+                    h = self.up[i_level].attn[i_block](h)
+                h = h.clamp(-self.clip_act, self.clip_act)
+            if i_level in self.down_layers:
+                h = self.up[i_level].upsample(h)
+        h = nonlinearity(h)
+        h = self.conv_out(h, gain=(self.learnable_gain + 1))
+        return h
+def VAE_16k(**kwargs) -> VAE:
+    return VAE(data_dim=80, embed_dim=20, hidden_dim=384, **kwargs)
+def VAE_44k(**kwargs) -> VAE:
+    return VAE(data_dim=128, embed_dim=40, hidden_dim=512, **kwargs)
+def get_my_vae(name: str, **kwargs) -> VAE:
+    if name == '16k':
+        return VAE_16k(**kwargs)
+    if name == '44k':
+        return VAE_44k(**kwargs)
+    raise ValueError(f'Unknown model: {name}')
+if __name__ == '__main__':
+    network = get_my_vae('standard')
+    # print the number of parameters in terms of millions
+    num_params = sum(p.numel() for p in network.parameters()) / 1e6
+    print(f'Number of parameters: {num_params:.2f}M')

hf_AC/mmaudio/ext/autoencoder/vae_modules.py ADDED Viewed

	@@ -0,0 +1,117 @@

+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from einops import rearrange
+from mmaudio.ext.autoencoder.edm2_utils import (MPConv1D, mp_silu, mp_sum, normalize)
+def nonlinearity(x):
+    # swish
+    return mp_silu(x)
+class ResnetBlock1D(nn.Module):
+    def __init__(self, *, in_dim, out_dim=None, conv_shortcut=False, kernel_size=3, use_norm=True):
+        super().__init__()
+        self.in_dim = in_dim
+        out_dim = in_dim if out_dim is None else out_dim
+        self.out_dim = out_dim
+        self.use_conv_shortcut = conv_shortcut
+        self.use_norm = use_norm
+        self.conv1 = MPConv1D(in_dim, out_dim, kernel_size=kernel_size)
+        self.conv2 = MPConv1D(out_dim, out_dim, kernel_size=kernel_size)
+        if self.in_dim != self.out_dim:
+            if self.use_conv_shortcut:
+                self.conv_shortcut = MPConv1D(in_dim, out_dim, kernel_size=kernel_size)
+            else:
+                self.nin_shortcut = MPConv1D(in_dim, out_dim, kernel_size=1)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        # pixel norm
+        if self.use_norm:
+            x = normalize(x, dim=1)
+        h = x
+        h = nonlinearity(h)
+        h = self.conv1(h)
+        h = nonlinearity(h)
+        h = self.conv2(h)
+        if self.in_dim != self.out_dim:
+            if self.use_conv_shortcut:
+                x = self.conv_shortcut(x)
+            else:
+                x = self.nin_shortcut(x)
+        return mp_sum(x, h, t=0.3)
+class AttnBlock1D(nn.Module):
+    def __init__(self, in_channels, num_heads=1):
+        super().__init__()
+        self.in_channels = in_channels
+        self.num_heads = num_heads
+        self.qkv = MPConv1D(in_channels, in_channels * 3, kernel_size=1)
+        self.proj_out = MPConv1D(in_channels, in_channels, kernel_size=1)
+    def forward(self, x):
+        h = x
+        y = self.qkv(h)
+        y = y.reshape(y.shape[0], self.num_heads, -1, 3, y.shape[-1])
+        q, k, v = normalize(y, dim=2).unbind(3)
+        q = rearrange(q, 'b h c l -> b h l c')
+        k = rearrange(k, 'b h c l -> b h l c')
+        v = rearrange(v, 'b h c l -> b h l c')
+        h = F.scaled_dot_product_attention(q, k, v)
+        h = rearrange(h, 'b h l c -> b (h c) l')
+        h = self.proj_out(h)
+        return mp_sum(x, h, t=0.3)
+class Upsample1D(nn.Module):
+    def __init__(self, in_channels, with_conv):
+        super().__init__()
+        self.with_conv = with_conv
+        if self.with_conv:
+            self.conv = MPConv1D(in_channels, in_channels, kernel_size=3)
+    def forward(self, x):
+        x = F.interpolate(x, scale_factor=2.0, mode='nearest-exact')  # support 3D tensor(B,C,T)
+        if self.with_conv:
+            x = self.conv(x)
+        return x
+class Downsample1D(nn.Module):
+    def __init__(self, in_channels, with_conv):
+        super().__init__()
+        self.with_conv = with_conv
+        if self.with_conv:
+            # no asymmetric padding in torch conv, must do it ourselves
+            self.conv1 = MPConv1D(in_channels, in_channels, kernel_size=1)
+            self.conv2 = MPConv1D(in_channels, in_channels, kernel_size=1)
+    def forward(self, x):
+        if self.with_conv:
+            x = self.conv1(x)
+        x = F.avg_pool1d(x, kernel_size=2, stride=2)
+        if self.with_conv:
+            x = self.conv2(x)
+        return x

hf_AC/mmaudio/ext/bigvgan/LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2022 NVIDIA CORPORATION.
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

hf_AC/mmaudio/ext/bigvgan/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ from .bigvgan import BigVGAN

hf_AC/mmaudio/ext/bigvgan/activations.py ADDED Viewed

	@@ -0,0 +1,120 @@

+# Implementation adapted from https://github.com/EdwardDixon/snake under the MIT license.
+#   LICENSE is in incl_licenses directory.
+import torch
+from torch import nn, sin, pow
+from torch.nn import Parameter
+class Snake(nn.Module):
+    '''
+    Implementation of a sine-based periodic activation function
+    Shape:
+        - Input: (B, C, T)
+        - Output: (B, C, T), same shape as the input
+    Parameters:
+        - alpha - trainable parameter
+    References:
+        - This activation function is from this paper by Liu Ziyin, Tilman Hartwig, Masahito Ueda:
+        https://arxiv.org/abs/2006.08195
+    Examples:
+        >>> a1 = snake(256)
+        >>> x = torch.randn(256)
+        >>> x = a1(x)
+    '''
+    def __init__(self, in_features, alpha=1.0, alpha_trainable=True, alpha_logscale=False):
+        '''
+        Initialization.
+        INPUT:
+            - in_features: shape of the input
+            - alpha: trainable parameter
+            alpha is initialized to 1 by default, higher values = higher-frequency.
+            alpha will be trained along with the rest of your model.
+        '''
+        super(Snake, self).__init__()
+        self.in_features = in_features
+        # initialize alpha
+        self.alpha_logscale = alpha_logscale
+        if self.alpha_logscale: # log scale alphas initialized to zeros
+            self.alpha = Parameter(torch.zeros(in_features) * alpha)
+        else: # linear scale alphas initialized to ones
+            self.alpha = Parameter(torch.ones(in_features) * alpha)
+        self.alpha.requires_grad = alpha_trainable
+        self.no_div_by_zero = 0.000000001
+    def forward(self, x):
+        '''
+        Forward pass of the function.
+        Applies the function to the input elementwise.
+        Snake ∶= x + 1/a * sin^2 (xa)
+        '''
+        alpha = self.alpha.unsqueeze(0).unsqueeze(-1) # line up with x to [B, C, T]
+        if self.alpha_logscale:
+            alpha = torch.exp(alpha)
+        x = x + (1.0 / (alpha + self.no_div_by_zero)) * pow(sin(x * alpha), 2)
+        return x
+class SnakeBeta(nn.Module):
+    '''
+    A modified Snake function which uses separate parameters for the magnitude of the periodic components
+    Shape:
+        - Input: (B, C, T)
+        - Output: (B, C, T), same shape as the input
+    Parameters:
+        - alpha - trainable parameter that controls frequency
+        - beta - trainable parameter that controls magnitude
+    References:
+        - This activation function is a modified version based on this paper by Liu Ziyin, Tilman Hartwig, Masahito Ueda:
+        https://arxiv.org/abs/2006.08195
+    Examples:
+        >>> a1 = snakebeta(256)
+        >>> x = torch.randn(256)
+        >>> x = a1(x)
+    '''
+    def __init__(self, in_features, alpha=1.0, alpha_trainable=True, alpha_logscale=False):
+        '''
+        Initialization.
+        INPUT:
+            - in_features: shape of the input
+            - alpha - trainable parameter that controls frequency
+            - beta - trainable parameter that controls magnitude
+            alpha is initialized to 1 by default, higher values = higher-frequency.
+            beta is initialized to 1 by default, higher values = higher-magnitude.
+            alpha will be trained along with the rest of your model.
+        '''
+        super(SnakeBeta, self).__init__()
+        self.in_features = in_features
+        # initialize alpha
+        self.alpha_logscale = alpha_logscale
+        if self.alpha_logscale: # log scale alphas initialized to zeros
+            self.alpha = Parameter(torch.zeros(in_features) * alpha)
+            self.beta = Parameter(torch.zeros(in_features) * alpha)
+        else: # linear scale alphas initialized to ones
+            self.alpha = Parameter(torch.ones(in_features) * alpha)
+            self.beta = Parameter(torch.ones(in_features) * alpha)
+        self.alpha.requires_grad = alpha_trainable
+        self.beta.requires_grad = alpha_trainable
+        self.no_div_by_zero = 0.000000001
+    def forward(self, x):
+        '''
+        Forward pass of the function.
+        Applies the function to the input elementwise.
+        SnakeBeta ∶= x + 1/b * sin^2 (xa)
+        '''
+        alpha = self.alpha.unsqueeze(0).unsqueeze(-1) # line up with x to [B, C, T]
+        beta = self.beta.unsqueeze(0).unsqueeze(-1)
+        if self.alpha_logscale:
+            alpha = torch.exp(alpha)
+            beta = torch.exp(beta)
+        x = x + (1.0 / (beta + self.no_div_by_zero)) * pow(sin(x * alpha), 2)
+        return x

hf_AC/mmaudio/ext/bigvgan/alias_free_torch/__init__.py ADDED Viewed

	@@ -0,0 +1,6 @@

+# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0
+#   LICENSE is in incl_licenses directory.
+from .filter import *
+from .resample import *
+from .act import *

hf_AC/mmaudio/ext/bigvgan/alias_free_torch/act.py ADDED Viewed

	@@ -0,0 +1,28 @@

+# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0
+#   LICENSE is in incl_licenses directory.
+import torch.nn as nn
+from .resample import UpSample1d, DownSample1d
+class Activation1d(nn.Module):
+    def __init__(self,
+                 activation,
+                 up_ratio: int = 2,
+                 down_ratio: int = 2,
+                 up_kernel_size: int = 12,
+                 down_kernel_size: int = 12):
+        super().__init__()
+        self.up_ratio = up_ratio
+        self.down_ratio = down_ratio
+        self.act = activation
+        self.upsample = UpSample1d(up_ratio, up_kernel_size)
+        self.downsample = DownSample1d(down_ratio, down_kernel_size)
+    # x: [B,C,T]
+    def forward(self, x):
+        x = self.upsample(x)
+        x = self.act(x)
+        x = self.downsample(x)
+        return x

hf_AC/mmaudio/ext/bigvgan/alias_free_torch/filter.py ADDED Viewed

	@@ -0,0 +1,95 @@

+# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0
+#   LICENSE is in incl_licenses directory.
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import math
+if 'sinc' in dir(torch):
+    sinc = torch.sinc
+else:
+    # This code is adopted from adefossez's julius.core.sinc under the MIT License
+    # https://adefossez.github.io/julius/julius/core.html
+    #   LICENSE is in incl_licenses directory.
+    def sinc(x: torch.Tensor):
+        """
+        Implementation of sinc, i.e. sin(pi * x) / (pi * x)
+        __Warning__: Different to julius.sinc, the input is multiplied by `pi`!
+        """
+        return torch.where(x == 0,
+                           torch.tensor(1., device=x.device, dtype=x.dtype),
+                           torch.sin(math.pi * x) / math.pi / x)
+# This code is adopted from adefossez's julius.lowpass.LowPassFilters under the MIT License
+# https://adefossez.github.io/julius/julius/lowpass.html
+#   LICENSE is in incl_licenses directory.
+def kaiser_sinc_filter1d(cutoff, half_width, kernel_size): # return filter [1,1,kernel_size]
+    even = (kernel_size % 2 == 0)
+    half_size = kernel_size // 2
+    #For kaiser window
+    delta_f = 4 * half_width
+    A = 2.285 * (half_size - 1) * math.pi * delta_f + 7.95
+    if A > 50.:
+        beta = 0.1102 * (A - 8.7)
+    elif A >= 21.:
+        beta = 0.5842 * (A - 21)**0.4 + 0.07886 * (A - 21.)
+    else:
+        beta = 0.
+    window = torch.kaiser_window(kernel_size, beta=beta, periodic=False)
+    # ratio = 0.5/cutoff -> 2 * cutoff = 1 / ratio
+    if even:
+        time = (torch.arange(-half_size, half_size) + 0.5)
+    else:
+        time = torch.arange(kernel_size) - half_size
+    if cutoff == 0:
+        filter_ = torch.zeros_like(time)
+    else:
+        filter_ = 2 * cutoff * window * sinc(2 * cutoff * time)
+        # Normalize filter to have sum = 1, otherwise we will have a small leakage
+        # of the constant component in the input signal.
+        filter_ /= filter_.sum()
+        filter = filter_.view(1, 1, kernel_size)
+    return filter
+class LowPassFilter1d(nn.Module):
+    def __init__(self,
+                 cutoff=0.5,
+                 half_width=0.6,
+                 stride: int = 1,
+                 padding: bool = True,
+                 padding_mode: str = 'replicate',
+                 kernel_size: int = 12):
+        # kernel_size should be even number for stylegan3 setup,
+        # in this implementation, odd number is also possible.
+        super().__init__()
+        if cutoff < -0.:
+            raise ValueError("Minimum cutoff must be larger than zero.")
+        if cutoff > 0.5:
+            raise ValueError("A cutoff above 0.5 does not make sense.")
+        self.kernel_size = kernel_size
+        self.even = (kernel_size % 2 == 0)
+        self.pad_left = kernel_size // 2 - int(self.even)
+        self.pad_right = kernel_size // 2
+        self.stride = stride
+        self.padding = padding
+        self.padding_mode = padding_mode
+        filter = kaiser_sinc_filter1d(cutoff, half_width, kernel_size)
+        self.register_buffer("filter", filter)
+    #input [B, C, T]
+    def forward(self, x):
+        _, C, _ = x.shape
+        if self.padding:
+            x = F.pad(x, (self.pad_left, self.pad_right),
+                      mode=self.padding_mode)
+        out = F.conv1d(x, self.filter.expand(C, -1, -1),
+                       stride=self.stride, groups=C)
+        return out

hf_AC/mmaudio/ext/bigvgan/alias_free_torch/resample.py ADDED Viewed

	@@ -0,0 +1,49 @@

+# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0
+#   LICENSE is in incl_licenses directory.
+import torch.nn as nn
+from torch.nn import functional as F
+from .filter import LowPassFilter1d
+from .filter import kaiser_sinc_filter1d
+class UpSample1d(nn.Module):
+    def __init__(self, ratio=2, kernel_size=None):
+        super().__init__()
+        self.ratio = ratio
+        self.kernel_size = int(6 * ratio // 2) * 2 if kernel_size is None else kernel_size
+        self.stride = ratio
+        self.pad = self.kernel_size // ratio - 1
+        self.pad_left = self.pad * self.stride + (self.kernel_size - self.stride) // 2
+        self.pad_right = self.pad * self.stride + (self.kernel_size - self.stride + 1) // 2
+        filter = kaiser_sinc_filter1d(cutoff=0.5 / ratio,
+                                      half_width=0.6 / ratio,
+                                      kernel_size=self.kernel_size)
+        self.register_buffer("filter", filter)
+    # x: [B, C, T]
+    def forward(self, x):
+        _, C, _ = x.shape
+        x = F.pad(x, (self.pad, self.pad), mode='replicate')
+        x = self.ratio * F.conv_transpose1d(
+            x, self.filter.expand(C, -1, -1), stride=self.stride, groups=C)
+        x = x[..., self.pad_left:-self.pad_right]
+        return x
+class DownSample1d(nn.Module):
+    def __init__(self, ratio=2, kernel_size=None):
+        super().__init__()
+        self.ratio = ratio
+        self.kernel_size = int(6 * ratio // 2) * 2 if kernel_size is None else kernel_size
+        self.lowpass = LowPassFilter1d(cutoff=0.5 / ratio,
+                                       half_width=0.6 / ratio,
+                                       stride=ratio,
+                                       kernel_size=self.kernel_size)
+    def forward(self, x):
+        xx = self.lowpass(x)
+        return xx

hf_AC/mmaudio/ext/bigvgan/bigvgan.py ADDED Viewed

	@@ -0,0 +1,32 @@

+from pathlib import Path
+import torch
+import torch.nn as nn
+from omegaconf import OmegaConf
+from mmaudio.ext.bigvgan.models import BigVGANVocoder
+_bigvgan_vocoder_path = Path(__file__).parent / 'bigvgan_vocoder.yml'
+class BigVGAN(nn.Module):
+    def __init__(self, ckpt_path, config_path=_bigvgan_vocoder_path):
+        super().__init__()
+        vocoder_cfg = OmegaConf.load(config_path)
+        self.vocoder = BigVGANVocoder(vocoder_cfg).eval()
+        vocoder_ckpt = torch.load(ckpt_path, map_location='cpu', weights_only=True)['generator']
+        self.vocoder.load_state_dict(vocoder_ckpt)
+        self.weight_norm_removed = False
+        self.remove_weight_norm()
+    @torch.inference_mode()
+    def forward(self, x):
+        assert self.weight_norm_removed, 'call remove_weight_norm() before inference'
+        return self.vocoder(x)
+    def remove_weight_norm(self):
+        self.vocoder.remove_weight_norm()
+        self.weight_norm_removed = True
+        return self