SigLIP2 NaViT Vision Encoder (with Google Pretrained Weights)

This is a SigLIP2 NaViT (Native Resolution Vision Transformer) vision encoder model, initialized with Google's pretrained SigLIP2 checkpoint. The vision encoder weights are from Google's checkpoint, while the merger layer is randomly initialized.

Model Details

Model Type: Vision Encoder
Architecture: SigLIP2 with Native Resolution ViT
Base Checkpoint: Google SigLIP2
Precision: FP16 (float16) for reduced storage
Hidden Size: 768
Number of Layers: 12
Number of Attention Heads: 12
Patch Size: 16
Spatial Merge Size: 2
Output Hidden Size: 896

Initialization

Vision Encoder: Initialized from Google's SigLIP2 pretrained checkpoint
Vision Merger: Randomly initialized (ready for fine-tuning)

Usage

from transformers import AutoModel, AutoImageProcessor
from PIL import Image
import torch

# Load model and processor
model = AutoModel.from_pretrained("wtzhang-nlp/siglip2-navit-google", trust_remote_code=True)
processor = AutoImageProcessor.from_pretrained("wtzhang-nlp/siglip2-navit-google", trust_remote_code=True)

# Load and process image
image = Image.open("path/to/image.jpg")
inputs = processor(images=image, return_tensors="pt")

# Forward pass
with torch.no_grad():
    outputs = model(**inputs)

print(f"Output shape: {outputs.last_hidden_state.shape}")
# Expected: [batch_size, num_patches, 896]

License

Apache 2.0

Downloads last month: 56

Safetensors

Model size

97.4M params

Tensor type

F16

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support