DeepFrame v0.2: Qwen3-VL-8B with Audio Encoder

DeepFrame is a multimodal video understanding model based on Qwen3-VL-8B-Instruct, extended with audio processing capabilities from Video-SALMONN-2+.

⚠️ Important: This is a Full Model (Not LoRA Adapter)

v0.2 is a fully merged model, not a LoRA adapter. Load it directly:

from transformers import AutoModelForCausalLM, AutoProcessor

# Load the full model directly
model = AutoModelForCausalLM.from_pretrained(
    "sdioteam/deepframe",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("sdioteam/deepframe")

Do NOT use PeftModel.from_pretrained() - this will fail as there are no LoRA adapter weights.

Architecture

Base: Qwen3-VL-8B-Instruct
Vision: Qwen3VLVisionModel with DeepStack features
- visual.merger.norm (LayerNorm)
- visual.merger.linear_fc1/fc2
- visual.deepstack_merger_list (multi-level feature fusion)
Audio: WhisperEncoder + QFormer
- audio.conv1/conv2 (Whisper frontend)
- audio.layers.* (Whisper encoder blocks)
- audio.qformer.* (BERT-based projector)
- audio.q_tokens (learnable queries)
- audio.audio_proj (projection to LLM hidden size)
LLM: Qwen3-8B language model

Model Type

{
  "model_type": "qwen3_vl_audio",
  "architectures": ["VideoSALMONN2_Qwen3"]
}

Versions

Version	Architecture	Type	Notes
v0.1	Qwen2.5-VL-7B	LoRA adapter	Deprecated
v0.2	Qwen3-VL-8B	Full model	Current - Fixed audio encoder

Key Differences from v0.1

Qwen3-VL (not Qwen2.5-VL) - Includes DeepStack multi-level vision features
visual.merger.norm instead of visual.merger.ln_q (architectural change)
Full model weights - No LoRA adapter needed
Fixed audio encoder - Properly initialized q_tokens and audio_proj

Usage

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

model = AutoModelForCausalLM.from_pretrained(
    "sdioteam/deepframe",
    revision="v0.2",  # Or "main"
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

processor = AutoProcessor.from_pretrained("sdioteam/deepframe")

Training

Stage 1: Audio alignment on LibriSpeech + YouTube hybrid dataset (~700K samples)
Stage 2: Audio-Visual SFT on mixed dataset

License

Apache 2.0

Citation

@misc{deepframe2024,
  title={DeepFrame: Multimodal Video Understanding with Audio},
  year={2024}
}

Downloads last month: 423

Safetensors

Model size

10B params

Tensor type

BF16

I64

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sdioteam/deepframe-vllm

Base model

Qwen/Qwen3-VL-8B-Instruct

Finetuned

(119)

this model