DeepFrame v0.2: Qwen3-VL-8B with Audio Encoder

DeepFrame is a multimodal video understanding model based on Qwen3-VL-8B-Instruct, extended with audio processing capabilities from Video-SALMONN-2+.

⚠️ Important: This is a Full Model (Not LoRA Adapter)

v0.2 is a fully merged model, not a LoRA adapter. Load it directly:

from transformers import AutoModelForCausalLM, AutoProcessor

# Load the full model directly
model = AutoModelForCausalLM.from_pretrained(
    "sdioteam/deepframe",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("sdioteam/deepframe")

Do NOT use PeftModel.from_pretrained() - this will fail as there are no LoRA adapter weights.

Architecture

  • Base: Qwen3-VL-8B-Instruct
  • Vision: Qwen3VLVisionModel with DeepStack features
    • visual.merger.norm (LayerNorm)
    • visual.merger.linear_fc1/fc2
    • visual.deepstack_merger_list (multi-level feature fusion)
  • Audio: WhisperEncoder + QFormer
    • audio.conv1/conv2 (Whisper frontend)
    • audio.layers.* (Whisper encoder blocks)
    • audio.qformer.* (BERT-based projector)
    • audio.q_tokens (learnable queries)
    • audio.audio_proj (projection to LLM hidden size)
  • LLM: Qwen3-8B language model

Model Type

{
  "model_type": "qwen3_vl_audio",
  "architectures": ["VideoSALMONN2_Qwen3"]
}

Versions

Version Architecture Type Notes
v0.1 Qwen2.5-VL-7B LoRA adapter Deprecated
v0.2 Qwen3-VL-8B Full model Current - Fixed audio encoder

Key Differences from v0.1

  • Qwen3-VL (not Qwen2.5-VL) - Includes DeepStack multi-level vision features
  • visual.merger.norm instead of visual.merger.ln_q (architectural change)
  • Full model weights - No LoRA adapter needed
  • Fixed audio encoder - Properly initialized q_tokens and audio_proj

Usage

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

model = AutoModelForCausalLM.from_pretrained(
    "sdioteam/deepframe",
    revision="v0.2",  # Or "main"
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

processor = AutoProcessor.from_pretrained("sdioteam/deepframe")

Training

  • Stage 1: Audio alignment on LibriSpeech + YouTube hybrid dataset (~700K samples)
  • Stage 2: Audio-Visual SFT on mixed dataset

License

Apache 2.0

Citation

@misc{deepframe2024,
  title={DeepFrame: Multimodal Video Understanding with Audio},
  year={2024}
}
Downloads last month
423
Safetensors
Model size
10B params
Tensor type
BF16
·
I64
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sdioteam/deepframe-vllm

Finetuned
(119)
this model