jina-vlm / README.md
gmastrapas's picture
Model update
3c865c7 verified
|
raw
history blame
28.1 kB
metadata
library_name: transformers
pipeline_tag: image-text-to-text
license: cc-by-nc-4.0
tags:
  - multimodal
  - multilingual
  - vllm
  - vlm
  - mllm
language:
  - en
  - multilingual
inference: false



By Jina AI

TODO: Update title when ready

Jina VLM v1: Lightweight Vision Language Alignment

GGUF | Blog | Technical Report

A small 🔍

Yet Mighty 🔥

Multimodal 👁️

and Multilingual 🌐

Vision-Language Model 🧠

Overview

TODO: Update overview when ready

We introduce jina-vlm-v1, a compact vision-language model with a focus on downstream embedding performance, computational efficiency, pure text performance and multilingual support. We explore the alignment of an encoder-only vision model with a decoder-only language model, with an emphasis on representation learning under a resource-constrained setting. Our approach employs a straightforward two-stage training strategy with fully unlocked model weights. Images are converted into fixed-size crops via overlapped cropping to enable high-resolution and any-resolution understanding. The crops are then split into patches and embedded to visual features by the vision encoder. The visual features are pooled, projected and injected into a small language model as visual tokens. We openly release jina-vlm-v1 to facilitate further research in this domain.

Model Info

Summary of features:

Feature Jina VLM v1
Type VLM - Vision Language Model
Modalities Texts, Images
Base Text Decoder Qwen3-1.7B-Base
Base Vision Encoder SigLIP2 So400M
Parameters 2.4B
Max Sequence Length 32768
Single-Vector Dimension 2048
Attention Mechanisms FlashAttention2, SDPA, Eager

TODO: Add ArXiv link when ready

Check out our technical report of jina-vlm-v1 for more details on model architecture, training and evaluation.

Evaluation

General VQA Tasks

Model Name AI2D ChartQA (test avg) TextVQA (val) DocVQA (val) InfoVQA (val) OCR Bench SEED-2 Plus CharXiv (RQ/DQ) Overall
jina-vlm-v1 82.0 81.9 83.2 90.6 71.6 778 67.2 32.3 / 63.5 72.3
Qwen2-VL-2B 74.7 73.5 79.7 89.2* 64.0* 809 62.4 23.3 / 55.0* 66.4
Qwen3-VL-2B 76.9 77.2 79.5 92.3* 71.9* 858 67.3* 28.8 / 62.3 71.6
InternVL3-2B 78.6 80.2 77.0 87.4* 67.1* 835 64.6 28.3 / 54.7 69.2
InternVL3.5-2B 78.8 80.7 76.5 88.5* 69.3* 836 68.0 31.6 / 65.0 71.6

Comparison of general visual question answering performance. Other model results are from their respective papers, except those marked with * which are computed using VLMEvalKit. All scores represent accuracy (%) except OCRBench which uses a 0-1000 scale, normalized to 0-100 for Overall calculation.

Multimodal Comprehension and Real-World Understanding

Model MME (sum) MMB v1.1 (EN) MMStar Overall (MM) RealWorld QA MME-RW (EN) R-Bench (dis) Overall (RW)
jina-vlm-v1 1965.8 75.8 56.2 67.4 68.2 50.7 66.7 61.9
Qwen2-VL-2B 1872.0 72.2 48.0 62.4 62.9 38.7* 63.2 55.0*
Qwen3-VL-2B 2000.8* 77.8 58.3 69.2 63.9 57.9* 67.3* 63.0
InternVL3-2B 2221.2 78.6 60.7 72.9 64.3 53.8 67.5 61.9
InternVL3.5-2B 2123.3 76.6 62.7 71.7 62.0 49.7 62.4 58.0

Comparison of generic multimodal understanding and real-world understanding performance. Other model results are from their respective papers, except those marked with * which are computed using VLMEvalKit. All scores represent accuracy (%) except MME which uses a 0-2800 scale, normalized to 0-100 for Overall calculation.

Multi-Image Reasoning and Hallucination

Model BLINK (val) Muir Bench MMT (val) Overall (MI) HallBench (avg) POPE (avg) Overall (Hall)
jina-vlm-v1 50.1 34.7 57.2 47.3 39.1 90.3 64.7
Qwen2-VL-2B 44.4 25.5* 55.1 41.7 41.7 87.9* 64.8
Qwen3-VL-2B 53.8 47.4 60.0* 53.7 44.5 88.9* 66.7
InternVL3-2B 50.3 38.8 59.5 49.5 42.5 89.6 66.1
InternVL3.5-2B 51.3 44.0 58.5 51.3 48.6 87.2 67.9

Comparison of multi-image and hallucination performance. Other model results are from their respective papers, except those marked with * which are computed using VLMEvalKit. All scores represent accuracy (%).

Multimodal Reasoning and Mathematics

Model MMMU MathVista MathVision MathVerse (Vision Only) WeMath LogicVista Overall
jina-vlm-v1 45.6 59.5 19.2 23.9 17.1 33.3 33.1
Qwen2-VL-2B 41.1 43.0 12.4 17.3* 10.9* 27.3* 25.3
Qwen3-VL-2B 53.4 61.3 31.6 22.7* 28.0* 35.4* 38.7
InternVL3-2B 48.6 57.0 21.7 25.3 22.4 36.9 35.3
InternVL3.5-2B 59.0 71.8 / 61.5† 42.8 / 26.5† 53.4 / 35.3† 48.5 / 19.1† 47.7 / 41.4† 50.7

Comparison of multimodal reasoning and mathematical problem-solving performance. Other model results are from their respective papers, except those marked with * which are computed using VLMEvalKit. † indicates scores for InternVL3.5-2B without thinking mode, evaluated using VLMEvalKit. All scores represent accuracy (%).

Text-Only Performance

Model MMLU MMLU-Pro GSM-8K ARC-C HellaSwag
jina-vlm-v1 56.1 30.3 69.6 76.0 59.4
Qwen3-1.7B 62.6 75.3 59.0

Comparison of text-only benchmarks. Results are collected using our evaluation code. All scores represent accuracy (%).

Multimodal Multilingual Understanding

Model Name MMMB ar MMMB cn MMMB en MMMB pt MMMB ru MMMB tr MMMB avg MMBench ar MMBench cn MMBench en MMBench pt MMBench ru MMBench tr MMBench avg MTVQA Overall
jina-vlm-v1 76.9 80.0 82.0 79.2 79.2 75.5 78.8 70.0 75.9 78.8 74.7 75.3 71.1 74.3 25.6 59.6
Qwen2-VL-2B 68.3 74.2 78.3 72.6 72.8 61.8 71.3 66.7 67.0 71.1 72.1 69.9 69.3 69.4 20.6 53.8
Qwen3-VL-2B 72.7* 75.7* 80.7* 75.0* 75.9* 68.5* 75.0* 66.2* 75.7* 77.8* 71.4* 75.9* 67.0* 72.3* 27.3* 58.2
InternVL3-2B 68.6 78.3 81.9 75.4 74.6 62.9 73.6 66.4 77.8 81.3 75.9 70.7 59.5 71.9 26.7 57.4
InternVL3.5-2B 68.5 77.7 80.2 75.9 76.3 69.1 74.6 63.7 75.9 78.4 73.7 71.4 62.0 70.9 28.5 58.0

Comparison of multilingual multimodal understanding performance. Other model results are from their respective papers, except those marked with * which are computed using VLMEvalKit. All scores represent accuracy (%).

Embedding Performance

Task / Metric Qwen3-VL-2B Qwen2.5-VL-3B InternVL3.5-2B Qwen2-VL-2B jina-vlm-v1
Flickr30kT2I Retrieval (NDCG@10) 86.9 83.8 84.6 85.8 86.0
JinaVDR DocVQA Retrieval (NDCG@5) 83.1 81.1 78.2 73.6 76.9
JinaVDR InfoVQA Retrieval (NDCG@5) 88.1 87.6 87.3 88.3 84.9
Nano DBPedia Retrieval (NDCG@10) 52.4 53.3 51.1 51.7 54.0
Nano FEVER Retrieval (NDCG@10) 78.3 83.2 72.8 75.1 76.3
Nano FiQA2018 Retrieval (NDCG@10) 40.4 45.0 40.3 45.7 35.8
Nano HotpotQA Retrieval (NDCG@10) 69.5 72.1 65.5 69.9 70.1
Nano MS MARCO Retrieval (NDCG@10) 48.0 48.7 49.5 47.5 45.7
Nano NFCorpus Retrieval (NDCG@10) 31.7 34.4 34.0 30.7 32.9
Nano NQ Retrieval (NDCG@10) 49.3 51.4 48.2 48.8 48.2
Nano SCIDOCS Retrieval (NDCG@10) 41.6 40.7 39.1 39.0 38.7
Nano SciFact Retrieval (NDCG@10) 73.0 78.0 73.2 70.6 77.2
STS12 (Spearman) 67.3 65.1 67.4 68.3 69.3
SciFact (NDCG@10) 69.7 71.2 68.2 66.0 68.5
Vidore ArXivQA Retrieval (NDCG@5) 74.4 80.2 74.8 75.8 74.4
Average 63.6 65.1 62.3 62.5 62.6

Pair training for single-vector embeddings. Higher is better. Averages are macro-averages across all tasks.

Usage

Requirements

The following Python packages are required:

  • torch>=2.9.0
  • torchvision>=0.24.0
  • transformers>=4.57.0
  • pillow>=12.0.0
  • einops>=0.8.1

Optional but recommended packages:

  • flash-attention: Installing flash-attention is recommended for improved inference speed and efficiency, but not mandatory.

Using the CLI

You can directly chat with jina-vlm-v1 using the test_jvlm.py CLI.

Options:

  • -m, --model: Model path (default: '.'). Set this to 'jinaai/jina-vlm-v1' if you are running this script outside this repo.
  • -i, --image: Image path, URL, or glob pattern (can specify multiple times, default: []).
  • -p, --prompt: Text prompt (can specify multiple times, default: 'Describe the image for me in 100 words' or 'Describe the images for me in 100 words' if multiple images are provided).
  • --max-crops: Maximum crops (default: 12).
  • --max-tokens: Maximum output tokens (default: 1024).
  • --max-pixels: Max pixels per image, larger images are resized and the aspect ratio is preserved (default: None).
  • --stream: Enable streaming (default: False).
  • --image-labels: Enable ordinal text labels after each image (default: False -> no image labels for multi-image).
  • --prompt-first: Place prompt before images instead of after (default: False -> prompt after images).
  • --map: Map mode - apply single prompt to multiple images OR multiple prompts to single image (default: False -> no mapping).
# Single image
python test_jvlm.py -i photo.jpg -p "What's in this image?"

# Single image with streaming
python test_jvlm.py -i photo.jpg -p "What's in this image?" --stream

# Remote image URL
python test_jvlm.py -i https://example.com/image.jpg -p "Describe this image"

# Multiple images (local and remote)
python test_jvlm.py -i img1.jpg -i https://example.com/img2.jpg -i img3.jpg -p "Compare these images"

# Text only input
python test_jvlm.py -p "How many planets are in our solar system?"

# Glob pattern support (quote patterns to prevent shell expansion)
python test_jvlm.py -i "*.jpg" -p "Describe these images"
python test_jvlm.py -i "photos/*.png" -i "images/*.jpg" -p "What do you see in these images?"

# Custom max crops, max pixels and max output tokens
# Reducing max crops and max pixels speeds up inference and lowers mem consumption on large images
python test_jvlm.py -i photo.jpg -p "Describe this picture in detail" --max-crops 8 --max-pixels 500000 --max-tokens 2048

# Prompt position control
python test_jvlm.py -i photo.jpg -p "What's in this image?" --prompt-first

# Map mode: apply one prompt to multiple images
python test_jvlm.py --map -i "*.jpg" -p "What is this?"

# Map mode: apply multiple prompts to one image
python test_jvlm.py --map -i photo_of_a_dog.jpg -p "What breed?" -p "What color?" -p "Happy or sad?"

# Batch inference
# When an equal number of images and prompts (>1) is provided, we assume it is batched inference
# Generation will run in a batch if streaming is disabled, otherwise sequentially
python test_jvlm.py -i photo1.jpg -p "What is shown in this image?" -i photo2.jpg -p "Describe this image"

# Similarly for no images and multiple prompts
python test_jvlm.py -p "What is a neural network?" -p "Describe the concept of polymorphism in Computer Science"

Example input:

python test_jvlm.py -m jinaai/jina-vlm-v1 -i assets/the_persistence_of_memory.jpg -p "Describe this picture"

Example output:

* Conversation 1/1
├── 🖼️Images: ['assets/the_persistence_of_memory.jpg']
├── 📜Prompt: Describe this picture
├── 💬Chat: User: <|image|>Describe this picture Assistant:
└── 🧠Response: This image is a surrealistic painting by Salvador Dalí, titled "The Persistence of Memory." The painting is characterized by its dreamlike and distorted elements, which are hallmarks of Dalí's style. The central focus of the painting is a melting clock, which is a key symbol in the artwork. The clock is depicted in a state of fluidity, with its hands and numbers melting and flowing as if it is made of wax.

In the foreground, there is a wooden table with a branch extending from it. The branch holds a second clock, which is also melting and dripping. To the left of the table, there is a small, round, orange object that appears to be a pocket watch or a small container.

The background of the painting features a landscape with a calm sea and a rocky cliff. The sky is painted in shades of blue and yellow, suggesting either a sunrise or sunset. The overall color palette of the painting is muted, with earthy tones dominating the scene.

The painting is a prime example of Dalí's use of surrealism, which involves the depiction of bizarre and dreamlike scenes. The melting clocks and distorted forms are typical of Dalí's work, which often explores themes of time, memory, and the subconscious mind. The painting is a testament to Dalí's innovative and imaginative approach to art.
Token usage report:
Input Context Window Layout (max: 40960 tokens):
├── Total: 1753 tokens (4.3%)
├── Image 1 → 1744 tokens (4.3%)
└── Text: 9 tokens (0.0%)

Generated 1 responses in 33.078s
0.03 res/s 8.16 tok/s
Done ✅

Using Transformers 🤗

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2.5-VL-3B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")

# The default range for the number of visual tokens per image in the model is 4-16384.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Batch inference
Multi-image inference
Text-only inference
Mixed-batch inference
Feature extraction

Using vLLM

Coming soon!

License

The models is licensed under CC-BY-NC 4.0. For commercial usage inquiries, feel free to contact us.

Contact

Join our Discord community and chat with other community members about ideas.

Citation

TODO: Add citation when ready

If you find jina-vlm-v1 useful in your research, please cite the following paper:

TBD