jina-vlm / README.md

gmastrapas

Model update

3c865c7 verified 22 days ago

preview code

raw

history blame

28.1 kB

metadata

library_name: transformers
pipeline_tag: image-text-to-text
license: cc-by-nc-4.0
tags:
  - multimodal
  - multilingual
  - vllm
  - vlm
  - mllm
language:
  - en
  - multilingual
inference: false

By Jina AI

TODO: Update title when ready

Jina VLM v1: Lightweight Vision Language Alignment

GGUF | Blog | Technical Report

A small 🔍

Yet Mighty 🔥

Multimodal 👁️

and Multilingual 🌐

Vision-Language Model 🧠

Overview

TODO: Update overview when ready

We introduce jina-vlm-v1, a compact vision-language model with a focus on downstream embedding performance, computational efficiency, pure text performance and multilingual support. We explore the alignment of an encoder-only vision model with a decoder-only language model, with an emphasis on representation learning under a resource-constrained setting. Our approach employs a straightforward two-stage training strategy with fully unlocked model weights. Images are converted into fixed-size crops via overlapped cropping to enable high-resolution and any-resolution understanding. The crops are then split into patches and embedded to visual features by the vision encoder. The visual features are pooled, projected and injected into a small language model as visual tokens. We openly release jina-vlm-v1 to facilitate further research in this domain.

Model Info

Summary of features:

Feature	Jina VLM v1
Type	VLM - Vision Language Model
Modalities	Texts, Images
Base Text Decoder	Qwen3-1.7B-Base
Base Vision Encoder	SigLIP2 So400M
Parameters	2.4B
Max Sequence Length	32768
Single-Vector Dimension	2048
Attention Mechanisms	FlashAttention2, SDPA, Eager

TODO: Add ArXiv link when ready

Check out our technical report of jina-vlm-v1 for more details on model architecture, training and evaluation.

Evaluation

General VQA Tasks

Model Name	AI2D	ChartQA (test avg)	TextVQA (val)	DocVQA (val)	InfoVQA (val)	OCR Bench	SEED-2 Plus	CharXiv (RQ/DQ)	Overall
`jina-vlm-v1`	82.0	81.9	83.2	90.6	71.6	778	67.2	32.3 / 63.5	72.3
`Qwen2-VL-2B`	74.7	73.5	79.7	89.2*	64.0*	809	62.4	23.3 / 55.0*	66.4
`Qwen3-VL-2B`	76.9	77.2	79.5	92.3*	71.9*	858	67.3*	28.8 / 62.3	71.6
`InternVL3-2B`	78.6	80.2	77.0	87.4*	67.1*	835	64.6	28.3 / 54.7	69.2
`InternVL3.5-2B`	78.8	80.7	76.5	88.5*	69.3*	836	68.0	31.6 / 65.0	71.6

Comparison of general visual question answering performance. Other model results are from their respective papers, except those marked with * which are computed using VLMEvalKit. All scores represent accuracy (%) except OCRBench which uses a 0-1000 scale, normalized to 0-100 for Overall calculation.

Multimodal Comprehension and Real-World Understanding

Model	MME (sum)	MMB v1.1 (EN)	MMStar	Overall (MM)	RealWorld QA	MME-RW (EN)	R-Bench (dis)	Overall (RW)
`jina-vlm-v1`	1965.8	75.8	56.2	67.4	68.2	50.7	66.7	61.9
`Qwen2-VL-2B`	1872.0	72.2	48.0	62.4	62.9	38.7*	63.2	55.0*
`Qwen3-VL-2B`	2000.8*	77.8	58.3	69.2	63.9	57.9*	67.3*	63.0
`InternVL3-2B`	2221.2	78.6	60.7	72.9	64.3	53.8	67.5	61.9
`InternVL3.5-2B`	2123.3	76.6	62.7	71.7	62.0	49.7	62.4	58.0

Comparison of generic multimodal understanding and real-world understanding performance. Other model results are from their respective papers, except those marked with * which are computed using VLMEvalKit. All scores represent accuracy (%) except MME which uses a 0-2800 scale, normalized to 0-100 for Overall calculation.

Multi-Image Reasoning and Hallucination

Model	BLINK (val)	Muir Bench	MMT (val)	Overall (MI)	HallBench (avg)	POPE (avg)	Overall (Hall)
`jina-vlm-v1`	50.1	34.7	57.2	47.3	39.1	90.3	64.7
`Qwen2-VL-2B`	44.4	25.5*	55.1	41.7	41.7	87.9*	64.8
`Qwen3-VL-2B`	53.8	47.4	60.0*	53.7	44.5	88.9*	66.7
`InternVL3-2B`	50.3	38.8	59.5	49.5	42.5	89.6	66.1
`InternVL3.5-2B`	51.3	44.0	58.5	51.3	48.6	87.2	67.9

Comparison of multi-image and hallucination performance. Other model results are from their respective papers, except those marked with * which are computed using VLMEvalKit. All scores represent accuracy (%).

Multimodal Reasoning and Mathematics

Model	MMMU	MathVista	MathVision	MathVerse (Vision Only)	WeMath	LogicVista	Overall
`jina-vlm-v1`	45.6	59.5	19.2	23.9	17.1	33.3	33.1
`Qwen2-VL-2B`	41.1	43.0	12.4	17.3*	10.9*	27.3*	25.3
`Qwen3-VL-2B`	53.4	61.3	31.6	22.7*	28.0*	35.4*	38.7
`InternVL3-2B`	48.6	57.0	21.7	25.3	22.4	36.9	35.3
`InternVL3.5-2B`	59.0	71.8 / 61.5†	42.8 / 26.5†	53.4 / 35.3†	48.5 / 19.1†	47.7 / 41.4†	50.7

Comparison of multimodal reasoning and mathematical problem-solving performance. Other model results are from their respective papers, except those marked with * which are computed using VLMEvalKit. † indicates scores for InternVL3.5-2B without thinking mode, evaluated using VLMEvalKit. All scores represent accuracy (%).

Text-Only Performance

Model	MMLU	MMLU-Pro	GSM-8K	ARC-C	HellaSwag
`jina-vlm-v1`	56.1	30.3	69.6	76.0	59.4
`Qwen3-1.7B`	62.6		75.3		59.0

Comparison of text-only benchmarks. Results are collected using our evaluation code. All scores represent accuracy (%).

Multimodal Multilingual Understanding

Model Name	MMMB ar	MMMB cn	MMMB en	MMMB pt	MMMB ru	MMMB tr	MMMB avg	MMBench ar	MMBench cn	MMBench en	MMBench pt	MMBench ru	MMBench tr	MMBench avg	MTVQA	Overall
`jina-vlm-v1`	76.9	80.0	82.0	79.2	79.2	75.5	78.8	70.0	75.9	78.8	74.7	75.3	71.1	74.3	25.6	59.6
`Qwen2-VL-2B`	68.3	74.2	78.3	72.6	72.8	61.8	71.3	66.7	67.0	71.1	72.1	69.9	69.3	69.4	20.6	53.8
`Qwen3-VL-2B`	72.7*	75.7*	80.7*	75.0*	75.9*	68.5*	75.0*	66.2*	75.7*	77.8*	71.4*	75.9*	67.0*	72.3*	27.3*	58.2
`InternVL3-2B`	68.6	78.3	81.9	75.4	74.6	62.9	73.6	66.4	77.8	81.3	75.9	70.7	59.5	71.9	26.7	57.4
`InternVL3.5-2B`	68.5	77.7	80.2	75.9	76.3	69.1	74.6	63.7	75.9	78.4	73.7	71.4	62.0	70.9	28.5	58.0

Comparison of multilingual multimodal understanding performance. Other model results are from their respective papers, except those marked with * which are computed using VLMEvalKit. All scores represent accuracy (%).

Embedding Performance

Task / Metric	`Qwen3-VL-2B`	`Qwen2.5-VL-3B`	`InternVL3.5-2B`	`Qwen2-VL-2B`	`jina-vlm-v1`
Flickr30kT2I Retrieval (NDCG@10)	86.9	83.8	84.6	85.8	86.0
JinaVDR DocVQA Retrieval (NDCG@5)	83.1	81.1	78.2	73.6	76.9
JinaVDR InfoVQA Retrieval (NDCG@5)	88.1	87.6	87.3	88.3	84.9
Nano DBPedia Retrieval (NDCG@10)	52.4	53.3	51.1	51.7	54.0
Nano FEVER Retrieval (NDCG@10)	78.3	83.2	72.8	75.1	76.3
Nano FiQA2018 Retrieval (NDCG@10)	40.4	45.0	40.3	45.7	35.8
Nano HotpotQA Retrieval (NDCG@10)	69.5	72.1	65.5	69.9	70.1
Nano MS MARCO Retrieval (NDCG@10)	48.0	48.7	49.5	47.5	45.7
Nano NFCorpus Retrieval (NDCG@10)	31.7	34.4	34.0	30.7	32.9
Nano NQ Retrieval (NDCG@10)	49.3	51.4	48.2	48.8	48.2
Nano SCIDOCS Retrieval (NDCG@10)	41.6	40.7	39.1	39.0	38.7
Nano SciFact Retrieval (NDCG@10)	73.0	78.0	73.2	70.6	77.2
STS12 (Spearman)	67.3	65.1	67.4	68.3	69.3
SciFact (NDCG@10)	69.7	71.2	68.2	66.0	68.5
Vidore ArXivQA Retrieval (NDCG@5)	74.4	80.2	74.8	75.8	74.4
Average	63.6	65.1	62.3	62.5	62.6

Pair training for single-vector embeddings. Higher is better. Averages are macro-averages across all tasks.

Usage

Requirements

The following Python packages are required:

torch>=2.9.0
torchvision>=0.24.0
transformers>=4.57.0
pillow>=12.0.0
einops>=0.8.1

Optional but recommended packages:

flash-attention: Installing flash-attention is recommended for improved inference speed and efficiency, but not mandatory.

Using the CLI

You can directly chat with jina-vlm-v1 using the test_jvlm.py CLI.

Options:

-m, --model: Model path (default: '.'). Set this to 'jinaai/jina-vlm-v1' if you are running this script outside this repo.
-i, --image: Image path, URL, or glob pattern (can specify multiple times, default: []).
-p, --prompt: Text prompt (can specify multiple times, default: 'Describe the image for me in 100 words' or 'Describe the images for me in 100 words' if multiple images are provided).
--max-crops: Maximum crops (default: 12).
--max-tokens: Maximum output tokens (default: 1024).
--max-pixels: Max pixels per image, larger images are resized and the aspect ratio is preserved (default: None).
--stream: Enable streaming (default: False).
--image-labels: Enable ordinal text labels after each image (default: False -> no image labels for multi-image).
--prompt-first: Place prompt before images instead of after (default: False -> prompt after images).
--map: Map mode - apply single prompt to multiple images OR multiple prompts to single image (default: False -> no mapping).

# Single image
python test_jvlm.py -i photo.jpg -p "What's in this image?"

# Single image with streaming
python test_jvlm.py -i photo.jpg -p "What's in this image?" --stream

# Remote image URL
python test_jvlm.py -i https://example.com/image.jpg -p "Describe this image"

# Multiple images (local and remote)
python test_jvlm.py -i img1.jpg -i https://example.com/img2.jpg -i img3.jpg -p "Compare these images"

# Text only input
python test_jvlm.py -p "How many planets are in our solar system?"

# Glob pattern support (quote patterns to prevent shell expansion)
python test_jvlm.py -i "*.jpg" -p "Describe these images"
python test_jvlm.py -i "photos/*.png" -i "images/*.jpg" -p "What do you see in these images?"

# Custom max crops, max pixels and max output tokens
# Reducing max crops and max pixels speeds up inference and lowers mem consumption on large images
python test_jvlm.py -i photo.jpg -p "Describe this picture in detail" --max-crops 8 --max-pixels 500000 --max-tokens 2048

# Prompt position control
python test_jvlm.py -i photo.jpg -p "What's in this image?" --prompt-first

# Map mode: apply one prompt to multiple images
python test_jvlm.py --map -i "*.jpg" -p "What is this?"

# Map mode: apply multiple prompts to one image
python test_jvlm.py --map -i photo_of_a_dog.jpg -p "What breed?" -p "What color?" -p "Happy or sad?"

# Batch inference
# When an equal number of images and prompts (>1) is provided, we assume it is batched inference
# Generation will run in a batch if streaming is disabled, otherwise sequentially
python test_jvlm.py -i photo1.jpg -p "What is shown in this image?" -i photo2.jpg -p "Describe this image"

# Similarly for no images and multiple prompts
python test_jvlm.py -p "What is a neural network?" -p "Describe the concept of polymorphism in Computer Science"

Example input:

python test_jvlm.py -m jinaai/jina-vlm-v1 -i assets/the_persistence_of_memory.jpg -p "Describe this picture"

Example output:

* Conversation 1/1
├── 🖼️Images: ['assets/the_persistence_of_memory.jpg']
├── 📜Prompt: Describe this picture
├── 💬Chat: User: <|image|>Describe this picture Assistant:
└── 🧠Response: This image is a surrealistic painting by Salvador Dalí, titled "The Persistence of Memory." The painting is characterized by its dreamlike and distorted elements, which are hallmarks of Dalí's style. The central focus of the painting is a melting clock, which is a key symbol in the artwork. The clock is depicted in a state of fluidity, with its hands and numbers melting and flowing as if it is made of wax.

In the foreground, there is a wooden table with a branch extending from it. The branch holds a second clock, which is also melting and dripping. To the left of the table, there is a small, round, orange object that appears to be a pocket watch or a small container.

The background of the painting features a landscape with a calm sea and a rocky cliff. The sky is painted in shades of blue and yellow, suggesting either a sunrise or sunset. The overall color palette of the painting is muted, with earthy tones dominating the scene.

The painting is a prime example of Dalí's use of surrealism, which involves the depiction of bizarre and dreamlike scenes. The melting clocks and distorted forms are typical of Dalí's work, which often explores themes of time, memory, and the subconscious mind. The painting is a testament to Dalí's innovative and imaginative approach to art.
Token usage report:
Input Context Window Layout (max: 40960 tokens):
├── Total: 1753 tokens (4.3%)
├── Image 1 → 1744 tokens (4.3%)
└── Text: 9 tokens (0.0%)

Generated 1 responses in 33.078s
0.03 res/s 8.16 tok/s
Done ✅

Using Transformers 🤗

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2.5-VL-3B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")

# The default range for the number of visual tokens per image in the model is 4-16384.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Batch inference

Multi-image inference

Text-only inference

Mixed-batch inference

Feature extraction

Using vLLM

Coming soon!

License

The models is licensed under CC-BY-NC 4.0. For commercial usage inquiries, feel free to contact us.

Contact

Join our Discord community and chat with other community members about ideas.

Citation

TODO: Add citation when ready

If you find jina-vlm-v1 useful in your research, please cite the following paper:

TBD