DeepFrame v0.2: Qwen3-VL-8B with Audio Encoder
DeepFrame is a multimodal video understanding model based on Qwen3-VL-8B-Instruct, extended with audio processing capabilities from Video-SALMONN-2+.
⚠️ Important: This is a Full Model (Not LoRA Adapter)
v0.2 is a fully merged model, not a LoRA adapter. Load it directly:
from transformers import AutoModelForCausalLM, AutoProcessor
# Load the full model directly
model = AutoModelForCausalLM.from_pretrained(
"sdioteam/deepframe",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="auto"
)
processor = AutoProcessor.from_pretrained("sdioteam/deepframe")
Do NOT use PeftModel.from_pretrained() - this will fail as there are no LoRA adapter weights.
Architecture
- Base: Qwen3-VL-8B-Instruct
- Vision: Qwen3VLVisionModel with DeepStack features
visual.merger.norm(LayerNorm)visual.merger.linear_fc1/fc2visual.deepstack_merger_list(multi-level feature fusion)
- Audio: WhisperEncoder + QFormer
audio.conv1/conv2(Whisper frontend)audio.layers.*(Whisper encoder blocks)audio.qformer.*(BERT-based projector)audio.q_tokens(learnable queries)audio.audio_proj(projection to LLM hidden size)
- LLM: Qwen3-8B language model
Model Type
{
"model_type": "qwen3_vl_audio",
"architectures": ["VideoSALMONN2_Qwen3"]
}
Versions
| Version | Architecture | Type | Notes |
|---|---|---|---|
| v0.1 | Qwen2.5-VL-7B | LoRA adapter | Deprecated |
| v0.2 | Qwen3-VL-8B | Full model | Current - Fixed audio encoder |
Key Differences from v0.1
- Qwen3-VL (not Qwen2.5-VL) - Includes DeepStack multi-level vision features
visual.merger.norminstead ofvisual.merger.ln_q(architectural change)- Full model weights - No LoRA adapter needed
- Fixed audio encoder - Properly initialized q_tokens and audio_proj
Usage
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
model = AutoModelForCausalLM.from_pretrained(
"sdioteam/deepframe",
revision="v0.2", # Or "main"
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="auto"
)
processor = AutoProcessor.from_pretrained("sdioteam/deepframe")
Training
- Stage 1: Audio alignment on LibriSpeech + YouTube hybrid dataset (~700K samples)
- Stage 2: Audio-Visual SFT on mixed dataset
License
Apache 2.0
Citation
@misc{deepframe2024,
title={DeepFrame: Multimodal Video Understanding with Audio},
year={2024}
}
- Downloads last month
- 423
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for sdioteam/deepframe-vllm
Base model
Qwen/Qwen3-VL-8B-Instruct