SigLIP2 NaViT Vision Encoder (with Google Pretrained Weights)
This is a SigLIP2 NaViT (Native Resolution Vision Transformer) vision encoder model, initialized with Google's pretrained SigLIP2 checkpoint. The vision encoder weights are from Google's checkpoint, while the merger layer is randomly initialized.
Model Details
- Model Type: Vision Encoder
- Architecture: SigLIP2 with Native Resolution ViT
- Base Checkpoint: Google SigLIP2
- Precision: FP16 (float16) for reduced storage
- Hidden Size: 768
- Number of Layers: 12
- Number of Attention Heads: 12
- Patch Size: 16
- Spatial Merge Size: 2
- Output Hidden Size: 896
Initialization
- Vision Encoder: Initialized from Google's SigLIP2 pretrained checkpoint
- Vision Merger: Randomly initialized (ready for fine-tuning)
Usage
from transformers import AutoModel, AutoImageProcessor
from PIL import Image
import torch
# Load model and processor
model = AutoModel.from_pretrained("wtzhang-nlp/siglip2-navit-google", trust_remote_code=True)
processor = AutoImageProcessor.from_pretrained("wtzhang-nlp/siglip2-navit-google", trust_remote_code=True)
# Load and process image
image = Image.open("path/to/image.jpg")
inputs = processor(images=image, return_tensors="pt")
# Forward pass
with torch.no_grad():
outputs = model(**inputs)
print(f"Output shape: {outputs.last_hidden_state.shape}")
# Expected: [batch_size, num_patches, 896]
License
Apache 2.0
- Downloads last month
- 56
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support