Moondream 3 (Preview) 4-Bit
Moondream 3 (Preview) 4-Bit is the INT4 quantized version of Moondream3-Preview, reducing model size from ~18GB to ~6GB (~66% reduction) and allowing to run in <12 GB VRAM environments while mostly maintaining quality.
This is a vision language model with a mixture-of-experts architecture (9B total parameters, 2B active), now optimized for deployment with as little as 8 GB VRAM.
Features
- 66% smaller: ~6GB vs ~18GB original
- Lower memory: Runs on 7GB VRAM (vs 20GB for FP16)
- Same capabilities: Retains original Moondream3 skills & API
- Minimal quality loss: ~2-5% degradation on benchmarks
- HuggingFace compatible: Load with
AutoModelForCausalLM.from_pretrained()
VRAM & Time Savings
| Configuration | Model Size | VRAM usage | s/query* |
|---|---|---|---|
| FP16 (original) | 18.5 GB | 19,594 MiB | 4.19 |
| INT4 (this one) | 6.18 GB | 7,332 MiB | 2.65 |
| Reduction | 66 % | 62 % | 37 % |
(* averaged over vision-ai-checkup & CountBenchQA benchmarks on L40S GPU)
Evaluation Results
| Test | time (4-bit) | accuracy (4-bit) | time (base) | accuracy (base) | |
|---|---|---|---|---|---|
| vision-ai-checkup | 156 s | 42.8 % | 223 s | 47.2 % | |
| CountBenchQA | 22.9 min | 91.2 % | 36.6 min | 93.2 % |
Architecture
Quantized Components (INT4):
- Text attention QKV/projection layers
- Dense MLP layers (layers 0-3)
- MoE expert weights (layers 4-23, 64 experts each)
- Region model encoder/decoder
Preserved in FP16:
- Vision encoder (SigLIP)
- MoE routers (critical for expert selection)
- Temperature (tau) parameters
- LayerNorms, embeddings, LM head
Slow First-Time Compile and Inference
A note on first-time compilation time: Due to the MoE architecture and the nature of INT4 quants, I had to do some voodoos to get input-invariant compilation graphs for both execution paths (T=1 and T>1 respectively). This results in a longer first-time compilation time (1-3 minutes for me) compared to the original Moondream3-preview model (~30 seconds). Torch's End to end caching (also known as Mega-Cache) makes subsequent compilations on the same machine much faster, given it's correctly configured. I'll remove this note once I found a faster solution (contributions always welcome of course!) in case that's possible, until then Caches are your friend :)
Quick Start (HuggingFace Style)
The easiest way to use Moondream3-4bit is via the HuggingFace Transformers API:
import torch
from PIL import Image
from transformers import AutoModelForCausalLM
# Load quantized model (same API as original Moondream3-preview)
moondream = AutoModelForCausalLM.from_pretrained(
"alecccdd/moondream3-preview-4bit",
trust_remote_code=True,
dtype=torch.bfloat16,
device_map={"": "cuda"},
)
moondream.compile() # Critical for fast inference
# Load an image
image = Image.open("photo.jpg")
# Ask a question
result = moondream.query(image=image, question="What's in this image?")
print(result["answer"])
Alternative: Manual Loading
If you prefer more control, you can load the model directly:
import torch
from PIL import Image
from config import MoondreamConfig
from moondream import MoondreamModel
from weights import load_weights
# Load quantized model
model = MoondreamModel(MoondreamConfig())
load_weights("./", model, device="cuda")
model.compile() # Critical for fast inference
# Load an image
image = Image.open("photo.jpg")
# Ask a question
result = model.query(image=image, question="What's in this image?")
print(result["answer"])
Skills
API of all skills remains identical to the original moondream3-preview model.
License
This is a derivative work of Moondream 3 (Preview) which was originally released under the Business Source License 1.1.
Original Copyright (c) M87 Labs, Inc.
Quantization and conversion code: Copyright (c) 2025 Alicius Schröder
- Downloads last month
- 237
Model tree for alecccdd/moondream3-preview-4bit
Base model
moondream/moondream3-preview

