Jina AI: Your Search Foundation, Supercharged!

jina-vlm: Small Multilingual Vision Language Model

Blog | API | AWS | Azure | GCP | Arxiv

jina-vlm is a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. Training data comprises approximately 5M multimodal samples and 12B text tokens across 29 languages, with roughly half in English and the remainder spanning high- and moderate-resource languages.

jina-vlm architecture

Built on Qwen3-1.7B-Base with SigLIP2-So400M, it processes images via overlapping tiling with attention-based token pooling that reduces visual tokens by 4x while preserving spatial information. The model achieves the highest average score (72.3) across eight VQA benchmarks while leading on multilingual multimodal understanding (MMMB: 78.8, Multilingual MMBench: 74.3).

Model Params VQA Avg MMMB MM-Bench RealWorld QA
jina-vlm 2.4B 72.3 78.8 74.3 68.2
Qwen2-VL-2B 2.2B 66.4 71.3 69.4 62.9
Qwen3-VL-2B 2.2B 71.6 75.0 72.3 63.9
InternVL3-2B 2.2B 69.2 73.6 71.9 64.3
InternVL3.5-2B 2.2B 71.6 74.6 70.9 62.0

Via Jina API

We provide an OpenAI-compatible API at https://api-beta-vlm.jina.ai. All requests require a Jina API key in the Authorization header, get your API key at jina.ai.

Image from URL

Format Example
HTTP/HTTPS URL https://example.com/image.jpg
Base64 data URI data:image/jpeg;base64,/9j/4AAQ...
curl https://api-beta-vlm.jina.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $JINA_API_KEY" \
  -d '{
    "model": "jina-vlm",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe this image"},
        {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}}
      ]
    }]
  }'

Local image (base64)

curl https://api-beta-vlm.jina.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $JINA_API_KEY" \
  -d '{
    "model": "jina-vlm",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What is in this image?"},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,'$(base64 -i image.jpg)'"}}
      ]
    }]
  }'

Text-only query

curl https://api-beta-vlm.jina.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $JINA_API_KEY" \
  -d '{
    "model": "jina-vlm",
    "messages": [{"role": "user", "content": "What is the capital of France?"}]
  }'

Streaming response

Add "stream": true to receive tokens as they're generated:

curl https://api-beta-vlm.jina.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $JINA_API_KEY" \
  -d '{
    "model": "jina-vlm",
    "stream": true,
    "messages": [{"role": "user", "content": "Write a haiku about coding"}]
  }'

When the service is cold starting, you'll receive:

{
  "error": {
    "message": "Model is loading, please retry in 30-60 seconds. Cold start takes ~30s after the service scales up.",
    "code": 503
  }
}

Simply retry your request after waiting.

Local Installation

uv sync

For CUDA users with FlashAttention2 support:

uv sync --extra flash-attn

Using the CLI

You can directly chat with jina-vlm using the infer.py CLI:

# Single image
python infer.py -i image.jpg -p "What's in this image?"

# Streaming output
python infer.py -i image.jpg -p "Describe this image" --stream

# Multiple images
python infer.py -i img1.jpg -i img2.jpg -p "Compare these images"

# Text-only
python infer.py -p "What is the capital of France?"

Options:

  • -m, --model: Model path. Auto-detects local repo (if config.json exists) or falls back to jinaai/jina-vlm from HuggingFace.
  • -i, --image: Image path, URL, or glob pattern (can specify multiple times).
  • -p, --prompt: Text prompt (can specify multiple times).
  • --max-crops: Maximum crops (default: 12).
  • --max-tokens: Maximum output tokens (default: 1024).
  • --max-pixels: Max pixels per image, larger images are resized preserving aspect ratio.
  • --stream: Enable streaming output.

Example:

python infer.py -i assets/the_persistence_of_memory.jpg -p "Describe this picture"
Input Output
* Conversation 1/1
β”œβ”€β”€ πŸ–ΌοΈImages: ['the_persistence_of_memory.jpg']
β”œβ”€β”€ πŸ“œPrompt: Describe this picture
└── 🧠Response: This image is a surreal painting
by Salvador DalΓ­, titled "The Persistence of
Memory." It features a dreamlike landscape with
a variety of melting clocks and other objects.
The central focus is a melting clock with a blue
face and yellow hands, which is hanging from a
branch...

Token usage: 1753 tokens (4.3%)
Generated in 8.68s | 20.04 tok/s

Using Transformers

import torch
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

processor = AutoProcessor.from_pretrained(
    'jinaai/jina-vlm', use_fast=False, trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    'jinaai/jina-vlm',
    device_map='auto',
    trust_remote_code=True
)

image = 'https://picsum.photos/800/600'
conversation = [
    {
        'role': 'user',
        'content': [
            {'type': 'image', 'image': image},
            {'type': 'text', 'text': 'Describe this image'},
        ],
    }
]

text = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], padding='longest', return_tensors='pt')
inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

output = model.generate(
    **inputs,
    generation_config=GenerationConfig(max_new_tokens=512, do_sample=False),
    return_dict_in_generate=True,
    use_model_defaults=True,
)

response = processor.tokenizer.decode(
    output.sequences[0][inputs['input_ids'].shape[-1]:],
    skip_special_tokens=True
)
print(response)
Multi-image inference
images = ['https://picsum.photos/id/1/800/600', 'https://picsum.photos/id/2/800/600']
conversation = [
    {
        'role': 'user',
        'content': [
            {'type': 'image', 'image': images[0]},
            {'type': 'image', 'image': images[1]},
            {'type': 'text', 'text': 'What is the difference between these images?'},
        ],
    }
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(text=[text], images=images, padding='longest', return_tensors='pt')
inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

output = model.generate(
    **inputs,
    generation_config=GenerationConfig(max_new_tokens=512, do_sample=False),
    return_dict_in_generate=True,
    use_model_defaults=True,
)
response = processor.tokenizer.decode(
    output.sequences[0][inputs['input_ids'].shape[-1]:],
    skip_special_tokens=True
)
print(response)
Text-only inference
conversation = [
    {
        'role': 'user',
        'content': [
            {'type': 'text', 'text': 'Explain quantum computing in simple terms'},
        ],
    }
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(text=[text], padding='longest', return_tensors='pt')
inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

output = model.generate(
    **inputs,
    generation_config=GenerationConfig(max_new_tokens=512, do_sample=False),
    return_dict_in_generate=True,
    use_model_defaults=True,
)
response = processor.tokenizer.decode(
    output.sequences[0][inputs['input_ids'].shape[-1]:],
    skip_special_tokens=True
)
print(response)
Batch inference
import torch
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

processor = AutoProcessor.from_pretrained(
    'jinaai/jina-vlm', use_fast=False, trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    'jinaai/jina-vlm',
    device_map='auto',
    torch_dtype=torch.bfloat16,
    attn_implementation='flash_attention_2',
    trust_remote_code=True
)

images = [
    'https://picsum.photos/id/22/800/600',
    'https://picsum.photos/id/49/800/600'
]
conversations = [
    [
        {
            'role': 'user',
            'content': [
                {'type': 'image', 'image': images[0]},
                {'type': 'text', 'text': 'What is the man doing in this image?'},
            ],
        }
    ],
    [
        {
            'role': 'user',
            'content': [
                {'type': 'image', 'image': images[1]},
                {'type': 'text', 'text': 'What country\'s flag is in this image?'},
            ],
        }
    ],
]

texts = processor.apply_chat_template(conversations, add_generation_prompt=True)
inputs = processor(text=texts, images=images, padding='longest', return_tensors='pt')
inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

output = model.generate(
    **inputs,
    generation_config=GenerationConfig(max_new_tokens=512, do_sample=False),
    return_dict_in_generate=True,
    use_model_defaults=True,
)

for idx in range(len(output.sequences)):
    gen_ids = output.sequences[idx][inputs['input_ids'].shape[-1]:]
    response = processor.tokenizer.decode(gen_ids, skip_special_tokens=True)
    print(f"Response {idx+1}: {response}")
Batch inference with mixed examples
import torch
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

processor = AutoProcessor.from_pretrained(
    'jinaai/jina-vlm', use_fast=False, trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    'jinaai/jina-vlm',
    device_map='auto',
    torch_dtype=torch.bfloat16,
    attn_implementation='flash_attention_2',
    trust_remote_code=True
)

images = [
    ['https://picsum.photos/id/22/800/600'],
    ['https://picsum.photos/id/49/800/600'],
    ['https://picsum.photos/id/0/800/600', 'https://picsum.photos/id/2/800/600'],
    [],
]
conversations = [
    [
        {
            'role': 'user',
            'content': [
                {'type': 'image', 'image': images[0][0]},
                {'type': 'text', 'text': 'What is the man doing in this image?'},
            ],
        }
    ],
    [
        {
            'role': 'user',
            'content': [
                {'type': 'image', 'image': images[1][0]},
                {'type': 'text', 'text': 'What country\'s flag is in this image?'},
            ],
        }
    ],
    [
        {
            'role': 'user',
            'content': [
                {'type': 'image', 'image': images[2][0]},
                {'type': 'image', 'image': images[2][1]},
                {'type': 'text', 'text': 'What is the difference between these two images?'},
            ],
        }
    ],
    [
        {
            'role': 'user',
            'content': [
                {'type': 'text', 'text': 'Describe the concept of polymorphism in Computer Science'},
            ],
        }
    ],
]

texts = processor.apply_chat_template(conversations, add_generation_prompt=True)
inputs = processor(text=texts, images=images, padding='longest', return_tensors='pt')
inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

output = model.generate(
    **inputs,
    generation_config=GenerationConfig(max_new_tokens=512, do_sample=False),
    return_dict_in_generate=True,
    use_model_defaults=True,
)

for idx in range(len(output.sequences)):
    gen_ids = output.sequences[idx][inputs['input_ids'].shape[-1]:]
    response = processor.tokenizer.decode(gen_ids, skip_special_tokens=True)
    print(f"Response {idx+1}: {response}")

Evaluation

Multilingual Understanding

Model MMMB ar MMMB cn MMMB en MMMB avg MMBench avg Overall
jina-vlm 76.9 80.0 82.0 78.8 74.3 59.6
Qwen2-VL-2B 68.3 74.2 78.3 71.3 69.4 53.8
Qwen3-VL-2B 72.7 75.7 80.7 75.0 72.3 58.2
InternVL3-2B 68.6 78.3 81.9 73.6 71.9 57.4
InternVL3.5-2B 68.5 77.7 80.2 74.6 70.9 58.0

General VQA Tasks

Model AI2D ChartQA TextVQA DocVQA InfoVQA OCRBench SEED-2+ CharXiv Avg
jina-vlm 82.0 81.9 83.2 90.6 71.6 778 67.2 32.3/63.5 72.3
Qwen2-VL-2B 74.7 73.5 79.7 89.2 64.0 809 62.4 23.3/55.0 66.4
Qwen3-VL-2B 76.9 77.2 79.5 92.3 71.9 858 67.3 28.8/62.3 71.6
InternVL3-2B 78.6 80.2 77.0 87.4 67.1 835 64.6 28.3/54.7 69.2
InternVL3.5-2B 78.8 80.7 76.5 88.5 69.3 836 68.0 31.6/65.0 71.6

Text-Only Performance

Model MMLU MMLU-Pro GSM-8K ARC-C HellaSwag
jina-vlm 56.1 30.3 71.3 77.3 59.4
Qwen3-1.7B 62.6 46.4 75.3 73.4 59.0

Citation

If you find jina-vlm useful in your research, please cite our technical report:

@misc{koukounas2025jinavlm,
    title={Jina-VLM: Small Multilingual Vision Language Model},
    author={Andreas Koukounas and Georgios Mastrapas and Florian HΓΆnicke and Sedigheh Eslami and Guillaume Roncari and Scott Martens and Han Xiao},
    year={2025},
    eprint={2512.04032},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2512.04032},
}

License

jina-vlm is licensed under CC BY-NC 4.0. For commercial usage inquiries, feel free to contact us.

Downloads last month
797
Safetensors
Model size
2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for jinaai/jina-vlm

Finetuned
(208)
this model

Collection including jinaai/jina-vlm