You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Moondream 3 (Preview) 4-Bit

Moondream 3 (Preview) 4-Bit is the INT4 quantized version of Moondream3-Preview, reducing model size from ~18GB to ~6GB (~66% reduction) and allowing to run in <12 GB VRAM environments while mostly maintaining quality.

This is a vision language model with a mixture-of-experts architecture (9B total parameters, 2B active), now optimized for deployment with as little as 8 GB VRAM.

Features

66% smaller: ~6GB vs ~18GB original
Lower memory: Runs on 7GB VRAM (vs 20GB for FP16)
Same capabilities: Retains original Moondream3 skills & API
Minimal quality loss: ~2-5% degradation on benchmarks
HuggingFace compatible: Load with AutoModelForCausalLM.from_pretrained()

VRAM & Time Savings

Configuration	Model Size	VRAM usage	s/query*
FP16 (original)	18.5 GB	19,594 MiB	4.19
INT4 (this one)	6.18 GB	7,332 MiB	2.65
Reduction	66 %	62 %	37 %

(* averaged over vision-ai-checkup & CountBenchQA benchmarks on L40S GPU)

Evaluation Results

Test	time (4-bit)	accuracy (4-bit)		time (base)	accuracy (base)
vision-ai-checkup	156 s	42.8 %		223 s	47.2 %
CountBenchQA	22.9 min	91.2 %		36.6 min	93.2 %

Architecture

Quantized Components (INT4):

Text attention QKV/projection layers
Dense MLP layers (layers 0-3)
MoE expert weights (layers 4-23, 64 experts each)
Region model encoder/decoder

Preserved in FP16:

Vision encoder (SigLIP)
MoE routers (critical for expert selection)
Temperature (tau) parameters
LayerNorms, embeddings, LM head

Slow First-Time Compile and Inference

A note on first-time compilation time: Due to the MoE architecture and the nature of INT4 quants, I had to do some voodoos to get input-invariant compilation graphs for both execution paths (T=1 and T>1 respectively). This results in a longer first-time compilation time (1-3 minutes for me) compared to the original Moondream3-preview model (~30 seconds). Torch's End to end caching (also known as Mega-Cache) makes subsequent compilations on the same machine much faster, given it's correctly configured. I'll remove this note once I found a faster solution (contributions always welcome of course!) in case that's possible, until then Caches are your friend :)

Quick Start (HuggingFace Style)

The easiest way to use Moondream3-4bit is via the HuggingFace Transformers API:

import torch
from PIL import Image
from transformers import AutoModelForCausalLM

# Load quantized model (same API as original Moondream3-preview)
moondream = AutoModelForCausalLM.from_pretrained(
    "alecccdd/moondream3-preview-4bit",
    trust_remote_code=True,
    dtype=torch.bfloat16,
    device_map={"": "cuda"},
)
moondream.compile()  # Critical for fast inference

# Load an image
image = Image.open("photo.jpg")

# Ask a question
result = moondream.query(image=image, question="What's in this image?")
print(result["answer"])

Alternative: Manual Loading

If you prefer more control, you can load the model directly:

import torch
from PIL import Image
from config import MoondreamConfig
from moondream import MoondreamModel
from weights import load_weights

# Load quantized model
model = MoondreamModel(MoondreamConfig())
load_weights("./", model, device="cuda")
model.compile()  # Critical for fast inference

# Load an image
image = Image.open("photo.jpg")

# Ask a question
result = model.query(image=image, question="What's in this image?")
print(result["answer"])

Skills

API of all skills remains identical to the original moondream3-preview model.

License

This is a derivative work of Moondream 3 (Preview) which was originally released under the Business Source License 1.1.

Original Copyright (c) M87 Labs, Inc.

Downloads last month: 237

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for alecccdd/moondream3-preview-4bit

Base model

moondream/moondream3-preview

Quantized

(1)

this model