fix-dtype

by florian-hoenicke - opened Nov 20, 2025

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+436

-876

This PR is in draft mode

Files changed (11) hide show

.gitignore +0 -1
README.md +277 -428
assets/jvlm_architecture.png +2 -2
blocks_jvlm.py +30 -52
config.json +1 -2
configuration_jvlm.py +1 -7
image_processing_jvlm.py +52 -231
modeling_jvlm.py +21 -24
processing_jvlm.py +18 -58
pyproject.toml +0 -18
infer.py → test_jvlm.py +34 -53

.gitignore DELETED Viewed

	@@ -1 +0,0 @@
1	- .DS_Store

README.md CHANGED Viewed

@@ -5,530 +5,379 @@ license: cc-by-nc-4.0
 tags:
 - multimodal
 - multilingual
 - vlm
-- vision-language
-- qwen3
-- siglip2
 language:
 - en
-- zh
-- ar
-- pt
-- ru
-- tr
-- de
-- es
-- fr
-- it
-- ja
-- ko
-- vi
-- th
-- id
-- hi
-- bn
-- nl
-- pl
-- sv
-- fi
-- da
-- "no"
-- cs
-- el
-- he
-- uk
-- ro
-- hu
 - multilingual
-base_model:
-- Qwen/Qwen3-1.7B-Base
-- google/siglip2-so400m-patch14-384
 inference: false
 ---
 <p align="center">
-<img src="https://huggingface.co/datasets/jinaai/documentation-images/resolve/main/logo.webp" alt="Jina AI: Your Search Foundation, Supercharged!" width="150px">
 </p>
-# jina-vlm: Small Multilingual Vision Language Model
-[Blog](https://jina.ai/news/jina-vlm-small-multilingual-vision-language-model/) | API | AWS | Azure | GCP | [Arxiv](https://arxiv.org/abs/2512.04032)
-`jina-vlm` is a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. Training data comprises approximately 5M multimodal samples and 12B text tokens across 29 languages, with roughly half in English and the remainder spanning high- and moderate-resource languages.
-![jina-vlm architecture](./assets/jvlm_architecture.png)
-Built on [Qwen3-1.7B-Base](https://huggingface.co/Qwen/Qwen3-1.7B-Base) with [SigLIP2-So400M](https://huggingface.co/google/siglip2-so400m-patch14-384), it processes images via overlapping tiling with attention-based token pooling that reduces visual tokens by 4x while preserving spatial information. The model achieves the highest average score (72.3) across eight VQA benchmarks while leading on multilingual multimodal understanding (MMMB: 78.8, Multilingual MMBench: 74.3).
-| Model | Params | VQA Avg | MMMB | MM-Bench | RealWorld QA |
-|-------|--------|---------|------|----------|--------------|
-| **jina-vlm** | 2.4B | **72.3** | **78.8** | **74.3** | **68.2** |
-| Qwen2-VL-2B | 2.2B | 66.4 | 71.3 | 69.4 | 62.9 |
-| Qwen3-VL-2B | 2.2B | 71.6 | 75.0 | 72.3 | 63.9 |
-| InternVL3-2B | 2.2B | 69.2 | 73.6 | 71.9 | 64.3 |
-| InternVL3.5-2B | 2.2B | 71.6 | 74.6 | 70.9 | 62.0 |
-## Via Jina API
-We provide an OpenAI-compatible API at `https://api-beta-vlm.jina.ai`. All requests require a Jina API key in the Authorization header, get your API key at [jina.ai](https://jina.ai).
-### Image from URL
-| Format | Example |
-|--------|---------|
-| HTTP/HTTPS URL | `https://example.com/image.jpg` |
-| Base64 data URI | `data:image/jpeg;base64,/9j/4AAQ...` |
-```bash
-curl https://api-beta-vlm.jina.ai/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -H "Authorization: Bearer $JINA_API_KEY" \
-  -d '{
-    "model": "jina-vlm",
-    "messages": [{
-      "role": "user",
-      "content": [
-        {"type": "text", "text": "Describe this image"},
-        {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}}
-      ]
-    }]
-  }'
-```
-### Local image (base64)
-```bash
-curl https://api-beta-vlm.jina.ai/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -H "Authorization: Bearer $JINA_API_KEY" \
-  -d '{
-    "model": "jina-vlm",
-    "messages": [{
-      "role": "user",
-      "content": [
-        {"type": "text", "text": "What is in this image?"},
-        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,'$(base64 -i image.jpg)'"}}
-      ]
-    }]
-  }'
-```
-### Text-only query
-```bash
-curl https://api-beta-vlm.jina.ai/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -H "Authorization: Bearer $JINA_API_KEY" \
-  -d '{
-    "model": "jina-vlm",
-    "messages": [{"role": "user", "content": "What is the capital of France?"}]
-  }'
-```
-### Streaming response
-Add `"stream": true` to receive tokens as they're generated:
-```bash
-curl https://api-beta-vlm.jina.ai/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -H "Authorization: Bearer $JINA_API_KEY" \
-  -d '{
-    "model": "jina-vlm",
-    "stream": true,
-    "messages": [{"role": "user", "content": "Write a haiku about coding"}]
-  }'
-```
-When the service is cold starting, you'll receive:
-```json
-{
-  "error": {
-    "message": "Model is loading, please retry in 30-60 seconds. Cold start takes ~30s after the service scales up.",
-    "code": 503
-  }
-}
-```
-Simply retry your request after waiting.
-## Local Installation
-```bash
-uv sync
-```
-For CUDA users with FlashAttention2 support:
-```bash
-uv sync --extra flash-attn
-```
 ### Using the CLI
-You can directly chat with `jina-vlm` using the `infer.py` CLI:
 ```bash
 # Single image
-python infer.py -i image.jpg -p "What's in this image?"
-# Streaming output
-python infer.py -i image.jpg -p "Describe this image" --stream
-# Multiple images
-python infer.py -i img1.jpg -i img2.jpg -p "Compare these images"
-# Text-only
-python infer.py -p "What is the capital of France?"
-```
-**Options:**
-- `-m, --model`: Model path. Auto-detects local repo (if `config.json` exists) or falls back to `jinaai/jina-vlm` from HuggingFace.
-- `-i, --image`: Image path, URL, or glob pattern (can specify multiple times).
-- `-p, --prompt`: Text prompt (can specify multiple times).
-- `--max-crops`: Maximum crops (default: 12).
-- `--max-tokens`: Maximum output tokens (default: 1024).
-- `--max-pixels`: Max pixels per image, larger images are resized preserving aspect ratio.
-- `--stream`: Enable streaming output.
-**Example:**
-```bash
-python infer.py -i assets/the_persistence_of_memory.jpg -p "Describe this picture"
-```
-<table>
-<tr>
-<td width="40%"><b>Input</b></td>
-<td width="60%"><b>Output</b></td>
-</tr>
-<tr>
-<td><img src="./assets/the_persistence_of_memory.jpg" width="100%"></td>
-<td>
-```
-* Conversation 1/1
-├── 🖼️Images: ['the_persistence_of_memory.jpg']
-├── 📜Prompt: Describe this picture
-└── 🧠Response: This image is a surreal painting
-by Salvador Dalí, titled "The Persistence of
-Memory." It features a dreamlike landscape with
-a variety of melting clocks and other objects.
-The central focus is a melting clock with a blue
-face and yellow hands, which is hanging from a
-branch...
-Token usage: 1753 tokens (4.3%)
-Generated in 8.68s | 20.04 tok/s
-```
-</td>
-</tr>
-</table>
-### Using Transformers
-```python
-import torch
-from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
-processor = AutoProcessor.from_pretrained(
-    'jinaai/jina-vlm', use_fast=False, trust_remote_code=True
-)
-model = AutoModelForCausalLM.from_pretrained(
-    'jinaai/jina-vlm',
-    device_map='auto',
-    trust_remote_code=True
-)
-image = 'https://picsum.photos/800/600'
-conversation = [
-    {
-        'role': 'user',
-        'content': [
-            {'type': 'image', 'image': image},
-            {'type': 'text', 'text': 'Describe this image'},
-        ],
-    }
-]
-text = processor.apply_chat_template(conversation, add_generation_prompt=True)
-inputs = processor(text=[text], images=[image], padding='longest', return_tensors='pt')
-inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
-output = model.generate(
-    **inputs,
-    generation_config=GenerationConfig(max_new_tokens=512, do_sample=False),
-    return_dict_in_generate=True,
-    use_model_defaults=True,
-)
-response = processor.tokenizer.decode(
-    output.sequences[0][inputs['input_ids'].shape[-1]:],
-    skip_special_tokens=True
-)
-print(response)
 ```
-<details>
-<summary>Multi-image inference</summary>
 ```python
-images = ['https://picsum.photos/id/1/800/600', 'https://picsum.photos/id/2/800/600']
-conversation = [
-    {
-        'role': 'user',
-        'content': [
-            {'type': 'image', 'image': images[0]},
-            {'type': 'image', 'image': images[1]},
-            {'type': 'text', 'text': 'What is the difference between these images?'},
-        ],
-    }
-]
-text = processor.apply_chat_template(conversation, add_generation_prompt=True)
-inputs = processor(text=[text], images=images, padding='longest', return_tensors='pt')
-inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
-output = model.generate(
-    **inputs,
-    generation_config=GenerationConfig(max_new_tokens=512, do_sample=False),
-    return_dict_in_generate=True,
-    use_model_defaults=True,
-)
-response = processor.tokenizer.decode(
-    output.sequences[0][inputs['input_ids'].shape[-1]:],
-    skip_special_tokens=True
 )
-print(response)
-```
-</details>
-<details>
-<summary>Text-only inference</summary>
-```python
-conversation = [
     {
-        'role': 'user',
-        'content': [
-            {'type': 'text', 'text': 'Explain quantum computing in simple terms'},
         ],
     }
 ]
-text = processor.apply_chat_template(conversation, add_generation_prompt=True)
-inputs = processor(text=[text], padding='longest', return_tensors='pt')
-inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
-output = model.generate(
-    **inputs,
-    generation_config=GenerationConfig(max_new_tokens=512, do_sample=False),
-    return_dict_in_generate=True,
-    use_model_defaults=True,
-)
-response = processor.tokenizer.decode(
-    output.sequences[0][inputs['input_ids'].shape[-1]:],
-    skip_special_tokens=True
-)
-print(response)
-```
-</details>
-<details>
-<summary>Batch inference</summary>
-```python
-import torch
-from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
-processor = AutoProcessor.from_pretrained(
-    'jinaai/jina-vlm', use_fast=False, trust_remote_code=True
 )
-model = AutoModelForCausalLM.from_pretrained(
-    'jinaai/jina-vlm',
-    device_map='auto',
-    torch_dtype=torch.bfloat16,
-    attn_implementation='flash_attention_2',
-    trust_remote_code=True
 )
-images = [
-    'https://picsum.photos/id/22/800/600',
-    'https://picsum.photos/id/49/800/600'
-]
-conversations = [
-    [
-        {
-            'role': 'user',
-            'content': [
-                {'type': 'image', 'image': images[0]},
-                {'type': 'text', 'text': 'What is the man doing in this image?'},
-            ],
-        }
-    ],
-    [
-        {
-            'role': 'user',
-            'content': [
-                {'type': 'image', 'image': images[1]},
-                {'type': 'text', 'text': 'What country\'s flag is in this image?'},
-            ],
-        }
-    ],
 ]
-texts = processor.apply_chat_template(conversations, add_generation_prompt=True)
-inputs = processor(text=texts, images=images, padding='longest', return_tensors='pt')
-inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
-output = model.generate(
-    **inputs,
-    generation_config=GenerationConfig(max_new_tokens=512, do_sample=False),
-    return_dict_in_generate=True,
-    use_model_defaults=True,
 )
-for idx in range(len(output.sequences)):
-    gen_ids = output.sequences[idx][inputs['input_ids'].shape[-1]:]
-    response = processor.tokenizer.decode(gen_ids, skip_special_tokens=True)
-    print(f"Response {idx+1}: {response}")
 ```
 </details>
 <details>
-<summary>Batch inference with mixed examples</summary>
-```python
-import torch
-from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
-processor = AutoProcessor.from_pretrained(
-    'jinaai/jina-vlm', use_fast=False, trust_remote_code=True
-)
-model = AutoModelForCausalLM.from_pretrained(
-    'jinaai/jina-vlm',
-    device_map='auto',
-    torch_dtype=torch.bfloat16,
-    attn_implementation='flash_attention_2',
-    trust_remote_code=True
-)
-images = [
-    ['https://picsum.photos/id/22/800/600'],
-    ['https://picsum.photos/id/49/800/600'],
-    ['https://picsum.photos/id/0/800/600', 'https://picsum.photos/id/2/800/600'],
-    [],
-]
-conversations = [
-    [
-        {
-            'role': 'user',
-            'content': [
-                {'type': 'image', 'image': images[0][0]},
-                {'type': 'text', 'text': 'What is the man doing in this image?'},
-            ],
-        }
-    ],
-    [
-        {
-            'role': 'user',
-            'content': [
-                {'type': 'image', 'image': images[1][0]},
-                {'type': 'text', 'text': 'What country\'s flag is in this image?'},
-            ],
-        }
-    ],
-    [
-        {
-            'role': 'user',
-            'content': [
-                {'type': 'image', 'image': images[2][0]},
-                {'type': 'image', 'image': images[2][1]},
-                {'type': 'text', 'text': 'What is the difference between these two images?'},
-            ],
-        }
-    ],
-    [
-        {
-            'role': 'user',
-            'content': [
-                {'type': 'text', 'text': 'Describe the concept of polymorphism in Computer Science'},
-            ],
-        }
-    ],
-]
-texts = processor.apply_chat_template(conversations, add_generation_prompt=True)
-inputs = processor(text=texts, images=images, padding='longest', return_tensors='pt')
-inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
-output = model.generate(
-    **inputs,
-    generation_config=GenerationConfig(max_new_tokens=512, do_sample=False),
-    return_dict_in_generate=True,
-    use_model_defaults=True,
-)
-for idx in range(len(output.sequences)):
-    gen_ids = output.sequences[idx][inputs['input_ids'].shape[-1]:]
-    response = processor.tokenizer.decode(gen_ids, skip_special_tokens=True)
-    print(f"Response {idx+1}: {response}")
-```
-</details>
-## Evaluation
-### Multilingual Understanding
-| Model | MMMB ar | MMMB cn | MMMB en | MMMB avg | MMBench avg | Overall |
-|-------|---------|---------|---------|----------|-------------|---------|
-| **jina-vlm** | **76.9** | **80.0** | **82.0** | **78.8** | **74.3** | **59.6** |
-| Qwen2-VL-2B | 68.3 | 74.2 | 78.3 | 71.3 | 69.4 | 53.8 |
-| Qwen3-VL-2B | 72.7 | 75.7 | 80.7 | 75.0 | 72.3 | 58.2 |
-| InternVL3-2B | 68.6 | 78.3 | 81.9 | 73.6 | 71.9 | 57.4 |
-| InternVL3.5-2B | 68.5 | 77.7 | 80.2 | 74.6 | 70.9 | 58.0 |
-### General VQA Tasks
-| Model | AI2D | ChartQA | TextVQA | DocVQA | InfoVQA | OCRBench | SEED-2+ | CharXiv | Avg |
-|-------|------|---------|---------|--------|---------|----------|---------|---------|-----|
-| **jina-vlm** | **82.0** | **81.9** | **83.2** | 90.6 | 71.6 | 778 | 67.2 | **32.3**/63.5 | **72.3** |
-| Qwen2-VL-2B | 74.7 | 73.5 | 79.7 | 89.2 | 64.0 | 809 | 62.4 | 23.3/55.0 | 66.4 |
-| Qwen3-VL-2B | 76.9 | 77.2 | 79.5 | **92.3** | **71.9** | **858** | 67.3 | 28.8/62.3 | 71.6 |
-| InternVL3-2B | 78.6 | 80.2 | 77.0 | 87.4 | 67.1 | 835 | 64.6 | 28.3/54.7 | 69.2 |
-| InternVL3.5-2B | 78.8 | 80.7 | 76.5 | 88.5 | 69.3 | 836 | **68.0** | 31.6/**65.0** | 71.6 |
-### Text-Only Performance
-| Model | MMLU | MMLU-Pro | GSM-8K | ARC-C | HellaSwag |
-|-------|------|----------|--------|-------|-----------|
-| **jina-vlm** | 56.1 | **30.3** | 71.3 | **77.3** | **59.4** |
-| Qwen3-1.7B | **62.6** | 46.4 | **75.3** | 73.4 | 59.0 |
 ## Citation
-If you find `jina-vlm` useful in your research, please cite our [technical report](https://arxiv.org/abs/2512.04032):
-```bibtex
-@misc{koukounas2025jinavlm,
-    title={Jina-VLM: Small Multilingual Vision Language Model},
-    author={Andreas Koukounas and Georgios Mastrapas and Florian Hönicke and Sedigheh Eslami and Guillaume Roncari and Scott Martens and Han Xiao},
-    year={2025},
-    eprint={2512.04032},
-    archivePrefix={arXiv},
-    primaryClass={cs.CL},
-    url={https://arxiv.org/abs/2512.04032},
-}
-```
-## License
-`jina-vlm` is licensed under CC BY-NC 4.0. For commercial usage inquiries, feel free to [contact us](https://jina.ai/contact-sales/).

 tags:
 - multimodal
 - multilingual
+- vllm
 - vlm
+- mllm
 language:
 - en
 - multilingual
 inference: false
 ---
+<br><br>
 <p align="center">
+<img src="https://raw.githubusercontent.com/jina-ai/.github/refs/heads/main/profile/1.png">
 </p>
+<p align="center">
+<b>By <a href="https://jina.ai/"><b>Jina AI</b></a></b>
+</p>
+TODO: Update title when ready
+# Jina VLM v1: Lightweight Vision Language Alignment
+[GGUF]() | [Blog]() | [Technical Report]()
+A small 🔍
+Yet Mighty 🔥
+Multimodal 👁️
+and Multilingual 🌐
+Vision-Language Model 🧠
+## Overview
+TODO: Update overview when ready
+We introduce `jina-vlm-v1`, a compact vision-language model with a focus on downstream embedding performance, computational efficiency, pure text performance and multilingual support. We explore the alignment of an encoder-only vision model with a decoder-only language model, with an emphasis on representation learning under a resource-constrained setting. Our approach employs a straightforward two-stage training strategy with fully unlocked model weights. Images are converted into fixed-size crops via overlapped cropping to enable high-resolution and any-resolution understanding. The crops are then split into patches and embedded to visual features by the vision encoder. The visual features are pooled, projected and injected into a small language model as visual tokens. We openly release `jina-vlm-v1` to facilitate further research in this domain.
+## Model Info
+<p align="center">
+<img src="./assets/jvlm_architecture.png">
+</p>
+Summary of features:
+| Feature                 | Jina VLM v1                                                                |
+|-------------------------|----------------------------------------------------------------------------|
+| Type                    | VLM - Vision Language Model                                                |
+| Modalities              | Texts, Images                                                              |
+| Base Text Decoder       | [Qwen3-1.7B-Base](https://huggingface.co/Qwen/Qwen3-1.7B-Base)             |
+| Base Vision Encoder     | [SigLIP2 So400M](https://huggingface.co/google/siglip2-so400m-patch14-384) |
+| Parameters              | 2.4B                                                                       |
+| Max Sequence Length     | 32768                                                                      |
+| Single-Vector Dimension | 2048                                                                       |
+| Attention Mechanisms    | FlashAttention2, SDPA, Eager                                               |
+TODO: Add ArXiv link when ready
+Check out our [technical report of jina-vlm-v1]() for more details on model architecture, training and evaluation.
+## Evaluation
+### General VQA Tasks
+| Model Name                                                          | AI2D     | ChartQA (test avg) | TextVQA (val) | DocVQA (val) | InfoVQA (val) | OCR Bench | SEED-2 Plus | CharXiv (RQ/DQ) | Overall  |
+|:--------------------------------------------------------------------|:--------:|:------------------:|:-------------:|:------------:|:-------------:|:---------:|:-----------:|:---------------:|:--------:|
+| [`jina-vlm-v1`](https://huggingface.co/jinaai/jina-vlm-v1)          | **82.0** | **81.9**           | **83.2**      | 90.6         | 71.6          | 778       | 67.2        | **32.3** / 63.5 | **72.3** |
+| [`Qwen2-VL-2B`](https://huggingface.co/Qwen/Qwen2-VL-2B)            | 74.7     | 73.5               | 79.7          | 89.2*        | 64.0*         | 809       | 62.4        |  23.3 / 55.0*   | 66.4     |
+| [`Qwen3-VL-2B`](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct)   | 76.9     | 77.2               | 79.5          | **92.3***    | **71.9***     | **858**   | 67.3*       |   28.8 / 62.3   | 71.6     |
+| [`InternVL3-2B`](https://huggingface.co/OpenGVLab/InternVL3-2B)     | 78.6     | 80.2               | 77.0          | 87.4*        | 67.1*         | 835       | 64.6        |   28.3 / 54.7   | 69.2     |
+| [`InternVL3.5-2B`](https://huggingface.co/OpenGVLab/InternVL3_5-2B) | 78.8     | 80.7               | 76.5          | 88.5*        | 69.3*         | 836       | **68.0**    | 31.6 / **65.0** | 71.6     |
+Comparison of general visual question answering performance. Other model results are from their respective papers, except those marked with * which are computed using [VLMEvalKit](https://github.com/open-compass/VLMEvalKit). All scores represent accuracy (%) except OCRBench which uses a 0-1000 scale, normalized to 0-100 for Overall calculation.
+### Multimodal Comprehension and Real-World Understanding
+| Model                                                               | MME (sum) | MMB v1.1 (EN) | MMStar | Overall (MM) | RealWorld QA | MME-RW (EN) | R-Bench (dis) | Overall (RW) |
+|:--------------------------------------------------------------------|:---------:|:-------------:|:------:|:------------:|:------------:|:-----------:|:-------------:|:------------:|
+| [`jina-vlm-v1`](https://huggingface.co/jinaai/jina-vlm-v1)          | 1965.8    | 75.8          | 56.2   | 67.4         | **68.2**     | 50.7        | 66.7          | 61.9         |
+| [`Qwen2-VL-2B`](https://huggingface.co/Qwen/Qwen2-VL-2B)            | 1872.0    | 72.2          | 48.0   | 62.4         | 62.9         | 38.7*       | 63.2          | 55.0*        |
+| [`Qwen3-VL-2B`](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct)   | 2000.8*   | 77.8          | 58.3   | 69.2         | 63.9         | **57.9***   | 67.3*         | **63.0**     |
+| [`InternVL3-2B`](https://huggingface.co/OpenGVLab/InternVL3-2B)     | **2221.2** | **78.6**     | 60.7   | **72.9**     | 64.3         | 53.8        | **67.5**      | 61.9         |
+| [`InternVL3.5-2B`](https://huggingface.co/OpenGVLab/InternVL3_5-2B) | 2123.3    | 76.6          | **62.7** | 71.7       | 62.0         | 49.7        | 62.4          | 58.0         |
+Comparison of generic multimodal understanding and real-world understanding performance. Other model results are from their respective papers, except those marked with * which are computed using [VLMEvalKit](https://github.com/open-compass/VLMEvalKit). All scores represent accuracy (%) except MME which uses a 0-2800 scale, normalized to 0-100 for Overall calculation.
+### Multi-Image Reasoning and Hallucination
+| Model                                                               | BLINK (val) | Muir Bench | MMT (val) | Overall (MI) | HallBench (avg) | POPE (avg) | Overall (Hall) |
+|:--------------------------------------------------------------------|:-----------:|:----------:|:---------:|:------------:|:---------------:|:----------:|:--------------:|
+| [`jina-vlm-v1`](https://huggingface.co/jinaai/jina-vlm-v1)          | 50.1        | 34.7       | 57.2      | 47.3         | 39.1            | **90.3**   | 64.7           |
+| [`Qwen2-VL-2B`](https://huggingface.co/Qwen/Qwen2-VL-2B)            | 44.4        | 25.5*      | 55.1      | 41.7         | 41.7            | 87.9*      | 64.8           |
+| [`Qwen3-VL-2B`](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct)   | **53.8**    | **47.4**   | **60.0*** | **53.7**     | 44.5            | 88.9*      | 66.7           |
+| [`InternVL3-2B`](https://huggingface.co/OpenGVLab/InternVL3-2B)     | 50.3        | 38.8       | 59.5      | 49.5         | 42.5            | 89.6       | 66.1           |
+| [`InternVL3.5-2B`](https://huggingface.co/OpenGVLab/InternVL3_5-2B) | 51.3        | 44.0       | 58.5      | 51.3         | **48.6**        | 87.2       | **67.9**       |
+Comparison of multi-image and hallucination performance. Other model results are from their respective papers, except those marked with * which are computed using [VLMEvalKit](https://github.com/open-compass/VLMEvalKit). All scores represent accuracy (%).
+### Multimodal Reasoning and Mathematics
+| Model                                                               | MMMU     | MathVista                 | MathVision               | MathVerse (Vision Only)  | WeMath                   | LogicVista               | Overall  |
+|:--------------------------------------------------------------------|:--------:|:-------------------------:|:------------------------:|:------------------------:|:------------------------:|:------------------------:|:--------:|
+| [`jina-vlm-v1`](https://huggingface.co/jinaai/jina-vlm-v1)          | 45.6     | 59.5                      | 19.2                     | 23.9                     | 17.1                     | 33.3                     | 33.1     |
+| [`Qwen2-VL-2B`](https://huggingface.co/Qwen/Qwen2-VL-2B)            | 41.1     | 43.0                      | 12.4                     | 17.3*                    | 10.9*                    | 27.3*                    | 25.3     |
+| [`Qwen3-VL-2B`](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct)   | 53.4     | 61.3                      | 31.6                     | 22.7*                    | 28.0*                    | 35.4*                    | 38.7     |
+| [`InternVL3-2B`](https://huggingface.co/OpenGVLab/InternVL3-2B)     | 48.6     | 57.0                      | 21.7                     | 25.3                     | 22.4                     | 36.9                     | 35.3     |
+| [`InternVL3.5-2B`](https://huggingface.co/OpenGVLab/InternVL3_5-2B) | **59.0** | **71.8** / 61.5†          | **42.8** / 26.5†         | **53.4** / 35.3†         | **48.5** / 19.1†         | **47.7** / 41.4†         | **50.7** |
+Comparison of multimodal reasoning and mathematical problem-solving performance. Other model results are from their respective papers, except those marked with * which are computed using [VLMEvalKit](https://github.com/open-compass/VLMEvalKit). † indicates scores for [`InternVL3.5-2B`](https://huggingface.co/OpenGVLab/InternVL3_5-2B) without thinking mode, evaluated using [VLMEvalKit](https://github.com/open-compass/VLMEvalKit). All scores represent accuracy (%).
+### Text-Only Performance
+| Model                                                      | MMLU     | MMLU-Pro |  GSM-8K  | ARC-C    | HellaSwag |
+|:-----------------------------------------------------------|:--------:|:--------:|:--------:|:--------:|:---------:|
+| [`jina-vlm-v1`](https://huggingface.co/jinaai/jina-vlm-v1) | 56.1     | **30.3** |   69.6   | **76.0** | **59.4**  |
+| [`Qwen3-1.7B`](https://huggingface.co/Qwen/Qwen3-1.7B)     | **62.6** |          | **75.3** |          | 59.0      |
+Comparison of text-only benchmarks. Results are collected using our evaluation code. All scores represent accuracy (%).
+### Multimodal Multilingual Understanding
+| Model Name                                                          | MMMB ar  | MMMB cn  | MMMB en  | MMMB pt  | MMMB ru  | MMMB tr  | MMMB avg | MMBench ar | MMBench cn | MMBench en | MMBench pt | MMBench ru | MMBench tr | MMBench avg | MTVQA    | Overall  |
+|:--------------------------------------------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|:-----------:|:--------:|:--------:|
+| [`jina-vlm-v1`](https://huggingface.co/jinaai/jina-vlm-v1)          | **76.9** | **80.0** | **82.0** | **79.2** | **79.2** | **75.5** | **78.8** | **70.0**   | 75.9       | 78.8       | 74.7       | 75.3       | **71.1**   | **74.3**    | 25.6     | **59.6** |
+| [`Qwen2-VL-2B`](https://huggingface.co/Qwen/Qwen2-VL-2B)            | 68.3     | 74.2     | 78.3     | 72.6     | 72.8     | 61.8     | 71.3     | 66.7       | 67.0       | 71.1       | 72.1       | 69.9       | 69.3       | 69.4        | 20.6     | 53.8     |
+| [`Qwen3-VL-2B`](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct)   | 72.7*    | 75.7*    | 80.7*    | 75.0*    | 75.9*    | 68.5*    | 75.0*    | 66.2*      | 75.7*      | 77.8*      | 71.4*      | **75.9***  | 67.0*      | 72.3*       | 27.3*    | 58.2     |
+| [`InternVL3-2B`](https://huggingface.co/OpenGVLab/InternVL3-2B)     | 68.6     | 78.3     | 81.9     | 75.4     | 74.6     | 62.9     | 73.6     | 66.4       | **77.8**   | **81.3**   | **75.9**   | 70.7       | 59.5       | 71.9        | 26.7     | 57.4     |
+| [`InternVL3.5-2B`](https://huggingface.co/OpenGVLab/InternVL3_5-2B) | 68.5     | 77.7     | 80.2     | 75.9     | 76.3     | 69.1     | 74.6     | 63.7       | 75.9       | 78.4       | 73.7       | 71.4       | 62.0       | 70.9        | **28.5** | 58.0     |
+Comparison of multilingual multimodal understanding performance. Other model results are from their respective papers, except those marked with * which are computed using [VLMEvalKit](https://github.com/open-compass/VLMEvalKit). All scores represent accuracy (%).
+### Embedding Performance
+| Task / Metric                              | [`Qwen3-VL-2B`](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct) | [`Qwen2.5-VL-3B`](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) | [`InternVL3.5-2B`](https://huggingface.co/OpenGVLab/InternVL3_5-2B) | [`Qwen2-VL-2B`](https://huggingface.co/Qwen/Qwen2-VL-2B) | [`jina-vlm-v1`](https://huggingface.co/jinaai/jina-vlm-v1) |
+|:-------------------------------------------|:-----------------------------------------------------------------:|:---------------------------------------------------------------------:|:-------------------------------------------------------------------:|:--------------------------------------------------------:|:----------------------------------------------------------:|
+| Flickr30kT2I Retrieval (NDCG@10)           |                             **86.9**                              |                                 83.8                                  |                                84.6                                 |                           85.8                           |                            86.0                            |
+| JinaVDR DocVQA Retrieval (NDCG@5)          |                             **83.1**                              |                                 81.1                                  |                                78.2                                 |                           73.6                           |                            76.9                            |
+| JinaVDR InfoVQA Retrieval (NDCG@5)         |                               88.1                                |                                 87.6                                  |                                87.3                                 |                         **88.3**                         |                            84.9                            |
+| Nano DBPedia Retrieval (NDCG@10)           |                               52.4                                |                                 53.3                                  |                                51.1                                 |                           51.7                           |                          **54.0**                          |
+| Nano FEVER Retrieval (NDCG@10)             |                               78.3                                |                               **83.2**                                |                                72.8                                 |                           75.1                           |                            76.3                            |
+| Nano FiQA2018 Retrieval (NDCG@10)          |                               40.4                                |                                 45.0                                  |                                40.3                                 |                         **45.7**                         |                            35.8                            |
+| Nano HotpotQA Retrieval (NDCG@10)          |                               69.5                                |                               **72.1**                                |                                65.5                                 |                           69.9                           |                            70.1                            |
+| Nano MS MARCO Retrieval (NDCG@10)          |                               48.0                                |                                 48.7                                  |                              **49.5**                               |                           47.5                           |                            45.7                            |
+| Nano NFCorpus Retrieval (NDCG@10)          |                               31.7                                |                               **34.4**                                |                                34.0                                 |                           30.7                           |                            32.9                            |
+| Nano NQ Retrieval (NDCG@10)                |                               49.3                                |                               **51.4**                                |                                48.2                                 |                           48.8                           |                            48.2                            |
+| Nano SCIDOCS Retrieval (NDCG@10)           |                             **41.6**                              |                                 40.7                                  |                                39.1                                 |                           39.0                           |                            38.7                            |
+| Nano SciFact Retrieval (NDCG@10)           |                               73.0                                |                               **78.0**                                |                                73.2                                 |                           70.6                           |                            77.2                            |
+| STS12 (Spearman)                           |                               67.3                                |                                 65.1                                  |                                67.4                                 |                           68.3                           |                          **69.3**                          |
+| SciFact (NDCG@10)                          |                               69.7                                |                               **71.2**                                |                                68.2                                 |                           66.0                           |                            68.5                            |
+| Vidore ArXivQA Retrieval (NDCG@5)          |                               74.4                                |                               **80.2**                                |                                74.8                                 |                           75.8                           |                            74.4                            |
+| **Average**                                |                               63.6                                |                               **65.1**                                |                                62.3                                 |                           62.5                           |                            62.6                            |
+Pair training for single-vector embeddings. Higher is better. Averages are macro-averages across all tasks.
+## Usage
+### Requirements
+The following Python packages are required:
+- `torch>=2.9.0`
+- `torchvision>=0.24.0`
+- `transformers>=4.57.0`
+- `pillow>=12.0.0`
+- `einops>=0.8.1`
+Optional but recommended packages:
+- **flash-attention**: Installing [flash-attention](https://github.com/Dao-AILab/flash-attention) is recommended for improved inference speed and efficiency, but not mandatory.
 ### Using the CLI
+You can directly chat with `jina-vlm-v1` using the `test_jvlm.py` CLI.
+**Options:**
+- `-m, --model`: Model path (default: `'.'`). Set this to `'jinaai/jina-vlm-v1'` if you are running this script outside this repo.
+- `-i, --image`: Image path, URL, or glob pattern (can specify multiple times, default: `[]`).
+- `-p, --prompt`: Text prompt (can specify multiple times, default: `'Describe the image for me in 100 words'` or `'Describe the images for me in 100 words'` if multiple images are provided).
+- `--max-crops`: Maximum crops (default: `12`).
+- `--max-tokens`: Maximum output tokens (default: `1024`).
+- `--max-pixels`: Max pixels per image, larger images are resized and the aspect ratio is preserved (default: `None`).
+- `--stream`: Enable streaming (default: `False`).
+- `--image-labels`: Enable ordinal text labels after each image (default: `False` -> no image labels for multi-image).
+- `--prompt-first`: Place prompt before images instead of after (default: `False` -> prompt after images).
+- `--map`: Map mode - apply single prompt to multiple images OR multiple prompts to single image (default: `False` -> no mapping).
 ```bash
 # Single image
+python test_jvlm.py -i photo.jpg -p "What's in this image?"
+# Single image with streaming
+python test_jvlm.py -i photo.jpg -p "What's in this image?" --stream
+# Remote image URL
+python test_jvlm.py -i https://example.com/image.jpg -p "Describe this image"
+# Multiple images (local and remote)
+python test_jvlm.py -i img1.jpg -i https://example.com/img2.jpg -i img3.jpg -p "Compare these images"
+# Text only input
+python test_jvlm.py -p "How many planets are in our solar system?"
+# Glob pattern support (quote patterns to prevent shell expansion)
+python test_jvlm.py -i "*.jpg" -p "Describe these images"
+python test_jvlm.py -i "photos/*.png" -i "images/*.jpg" -p "What do you see in these images?"
+# Custom max crops, max pixels and max output tokens
+# Reducing max crops and max pixels speeds up inference and lowers mem consumption on large images
+python test_jvlm.py -i photo.jpg -p "Describe this picture in detail" --max-crops 8 --max-pixels 500000 --max-tokens 2048
+# Prompt position control
+python test_jvlm.py -i photo.jpg -p "What's in this image?" --prompt-first
+# Map mode: apply one prompt to multiple images
+python test_jvlm.py --map -i "*.jpg" -p "What is this?"
+# Map mode: apply multiple prompts to one image
+python test_jvlm.py --map -i photo_of_a_dog.jpg -p "What breed?" -p "What color?" -p "Happy or sad?"
+# Batch inference
+# When an equal number of images and prompts (>1) is provided, we assume it is batched inference
+# Generation will run in a batch if streaming is disabled, otherwise sequentially
+python test_jvlm.py -i photo1.jpg -p "What is shown in this image?" -i photo2.jpg -p "Describe this image"
+# Similarly for no images and multiple prompts
+python test_jvlm.py -p "What is a neural network?" -p "Describe the concept of polymorphism in Computer Science"
+```
+Example input:
+```bash
+python test_jvlm.py -m jinaai/jina-vlm-v1 -i assets/the_persistence_of_memory.jpg -p "Describe this picture"
+```
+<p align="center">
+<img src="./assets/the_persistence_of_memory.jpg">
+</p>
+Example output:
+```
+* Conversation 1/1
+├── 🖼️Images: ['assets/the_persistence_of_memory.jpg']
+├── 📜Prompt: Describe this picture
+├── 💬Chat: User: <|image|>Describe this picture Assistant:
+└── 🧠Response: This image is a surrealistic painting by Salvador Dalí, titled "The Persistence of Memory." The painting is characterized by its dreamlike and distorted elements, which are hallmarks of Dalí's style. The central focus of the painting is a melting clock, which is a key symbol in the artwork. The clock is depicted in a state of fluidity, with its hands and numbers melting and flowing as if it is made of wax.
+In the foreground, there is a wooden table with a branch extending from it. The branch holds a second clock, which is also melting and dripping. To the left of the table, there is a small, round, orange object that appears to be a pocket watch or a small container.
+The background of the painting features a landscape with a calm sea and a rocky cliff. The sky is painted in shades of blue and yellow, suggesting either a sunrise or sunset. The overall color palette of the painting is muted, with earthy tones dominating the scene.
+The painting is a prime example of Dalí's use of surrealism, which involves the depiction of bizarre and dreamlike scenes. The melting clocks and distorted forms are typical of Dalí's work, which often explores themes of time, memory, and the subconscious mind. The painting is a testament to Dalí's innovative and imaginative approach to art.
+Token usage report:
+Input Context Window Layout (max: 40960 tokens):
+├── Total: 1753 tokens (4.3%)
+├── Image 1 → 1744 tokens (4.3%)
+└── Text: 9 tokens (0.0%)
+Generated 1 responses in 33.078s
+0.03 res/s 8.16 tok/s
+Done ✅
 ```
+### Using Transformers 🤗
 ```python
+from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
+from qwen_vl_utils import process_vision_info
+# default: Load the model on the available device(s)
+model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+    "Qwen/Qwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto"
 )
+# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
+# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+#     "Qwen/Qwen2.5-VL-3B-Instruct",
+#     torch_dtype=torch.bfloat16,
+#     attn_implementation="flash_attention_2",
+#     device_map="auto",
+# )
+# default processer
+processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")
+# The default range for the number of visual tokens per image in the model is 4-16384.
+# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
+# min_pixels = 256*28*28
+# max_pixels = 1280*28*28
+# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
+messages = [
     {
+        "role": "user",
+        "content": [
+            {
+                "type": "image",
+                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
+            },
+            {"type": "text", "text": "Describe this image."},
         ],
     }
 ]
+# Preparation for inference
+text = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
 )
+image_inputs, video_inputs = process_vision_info(messages)
+inputs = processor(
+    text=[text],
+    images=image_inputs,
+    videos=video_inputs,
+    padding=True,
+    return_tensors="pt",
 )
+inputs = inputs.to("cuda")
+# Inference: Generation of the output
+generated_ids = model.generate(**inputs, max_new_tokens=128)
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
 ]
+output_text = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
 )
+print(output_text)
 ```
+<details>
+<summary>Batch inference</summary>
 </details>
 <details>
+<summary>Multi-image inference</summary>
+</details>
+<details>
+<summary>Text-only inference</summary>
+</details>
+<details>
+<summary>Mixed-batch inference</summary>
+</details>
+<details>
+<summary>Feature extraction</summary>
+</details>
+### Using vLLM
+Coming soon!
+## License
+The models is licensed under CC-BY-NC 4.0. For commercial usage inquiries, feel free to [contact us](https://jina.ai/contact-sales/).
+## Contact
+Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.
 ## Citation
+TODO: Add citation when ready
+If you find `jina-vlm-v1` useful in your research, please cite the following paper:
+```
+TBD
+```

assets/jvlm_architecture.png CHANGED Viewed

Git LFS Details

SHA256: 1d33806662487fa930aae7ffd1335833156a73758436b38b4abc3aff62691e66
Pointer size: 131 Bytes
Size of remote file: 654 kB

Git LFS Details

SHA256: 8941f6788e95e12904ac301bff2f37089a1b2421e2c44c4cffa1743a62a3915e
Pointer size: 131 Bytes
Size of remote file: 248 kB

blocks_jvlm.py CHANGED Viewed

@@ -11,7 +11,6 @@ import torch
 import torch.backends.cuda
 import torch.nn as nn
 import torch.nn.functional as f
-from torch.nn.attention import SDPBackend, sdpa_kernel
 from transformers import PretrainedConfig
 from transformers.activations import ACT2FN
 from transformers.cache_utils import Cache
@@ -325,11 +324,10 @@ modeling_rope_utils.py
 def inv_freq_to_device(rope_forward):
-    """Sometimes the inv_freq is calculated on the wrong device, or ends up in lower
-    precision than float32.
-    This wrapper ensures that inv_freq is always on the right device and in float32
-    precision.
     """
     @wraps(rope_forward)
@@ -355,6 +353,7 @@ class RotaryEmbedding(nn.Module):
         theta: float,
         head_dim: int,
         hidden_size: int,
         partial_rotary_factor: float,
         device: Optional[torch.device] = None,
         scaling: Optional[Dict[str, Any]] = None,
@@ -367,6 +366,7 @@ class RotaryEmbedding(nn.Module):
         setattr(self.config, 'rope_theta', theta)
         setattr(self.config, 'partial_rotary_factor', partial_rotary_factor)
         setattr(self.config, 'head_dim', head_dim)
         setattr(self.config, 'hidden_size', hidden_size)
         setattr(self.config, 'rope_scaling', scaling or {})
@@ -377,7 +377,9 @@ class RotaryEmbedding(nn.Module):
         self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type]
         device = device or torch.device('cuda' if torch.cuda.is_available() else 'cpu')
         seqlen = config.max_position_embeddings or config.max_sequence_length
-        invfreq, self.attention_scaling = self.rope_init_fn(self.config, device, seqlen)
         self.rope_init_device = device
         self.register_buffer('inv_freq', invfreq, persistent=False)
         self.original_inv_freq = self.inv_freq
@@ -615,9 +617,11 @@ def _create_causal_mask(seq_len: int, device: torch.device) -> torch.Tensor:
 def _ensure_finite(
     x: torch.Tensor, check_neg_inf: bool = True, check_pos_inf: bool = False
 ):
-    """Modify ``x`` in place to replace ``float("-inf")`` with the minimum value of the
     dtype when ``check_neg_inf`` is ``True`` and replace ``float("inf")`` with the
-    maximum value of the dtype when ``check_pos_inf`` is ``True``"""
     if check_neg_inf:
         x.masked_fill_(x == float('-inf'), torch.finfo(x.dtype).min)
     if check_pos_inf:
@@ -637,12 +641,14 @@ def resolve_causal_mask(
         # shape: (batch_size, 1, 1, seq_len)
         if len(attention_mask.shape) == 2:
             attention_mask = attention_mask[:, : past_length + seq_len]
-            attention_mask = attention_mask.to(dtype=torch.float).view(batch_size, -1)[
-                :, None, None, :
-            ]
         else:
             attention_mask = attention_mask.unsqueeze(1).to(dtype=torch.float)
-        attention_mask = (1.0 - attention_mask) * torch.finfo(attention_mask.dtype).min
     # Merge attention mask with causal mask (attention bias)
     # NOTE: We need to initialize the attn bias in order for attn to
@@ -654,7 +660,9 @@ def resolve_causal_mask(
         or past_key_values is not None
     ):
         if causal_mask is None:
-            causal_mask = _create_causal_mask(past_length + seq_len, device)
         elif causal_mask.dtype in (torch.int8, torch.bool):
             causal_mask = causal_mask.to(dtype=torch.float)
             causal_mask.masked_fill_(
@@ -737,9 +745,7 @@ def rotate_half(x: torch.Tensor):
 def apply_rotary_positional_embeddings(
-    x: torch.Tensor,
-    cos: torch.Tensor,
-    sin: torch.Tensor,
 ) -> torch.Tensor:
     return (x * cos + rotate_half(x) * sin).to(x.dtype)
@@ -884,6 +890,7 @@ class MHSDPA(nn.Module):
         attn_mask: Optional[torch.Tensor] = None,
         is_causal: Optional[bool] = None,
     ) -> Tuple[Callable, Optional[torch.Tensor], Optional[bool]]:
         if 'flash' in attn_implementation and self.fp32_attn:
             raise ValueError('Flash attention does not support fp32 attention')
         if self.sliding_window != -1 and 'flash' not in attn_implementation:
@@ -1064,7 +1071,9 @@ class FFN(nn.Module):
         if self.gated_activation:
             intermediate_size = 2 * self.intermediate_size
-        self.up = nn.Linear(self.hidden_size, intermediate_size, bias=self.use_bias)
         self.down = nn.Linear(
             self.intermediate_size, self.output_size, bias=self.use_bias
         )
@@ -1236,14 +1245,6 @@ class VisionLanguageConnector(GradientCheckpointingLayer):
             assert config.attn_pooling_config is not None
             if config.pooling_type == ImagePooling2DType.attention_2wide:
                 pooling_input_size *= 2
-            # Flash Attention can cause Inf grads in the attention pooling layer
-            # because of very large batch sizes. Setting this to sdpa does not cost us
-            # much since sequence lengths in the case of attention pooling are very
-            # small
-            attn_implementation = attn_implementation or 'eager'
-            if attn_implementation.startswith('flash'):
-                attn_implementation = 'sdpa'
             self.pooling = MHSDPA(
                 config.attn_pooling_config,
                 hidden_size=pooling_input_size,
@@ -1289,12 +1290,10 @@ class VisionLanguageConnector(GradientCheckpointingLayer):
         image_features: torch.Tensor,
         image_masks: Optional[torch.Tensor] = None,
         attn_implementation: Optional[str] = None,
-        **kwargs: Unpack[FlashAttentionKwargs],
     ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
         # image_features:
         # (batch_size, num_crops(=num_image), num_patch, nximage_emb_dim)
         bs, ncrops = image_features.shape[:2]
-        ogtype = image_features.dtype
         if self.padding_embed_type is not None:
             assert image_masks is not None
@@ -1323,7 +1322,6 @@ class VisionLanguageConnector(GradientCheckpointingLayer):
                     partial_pad, -1
                 )
-        image_features = image_features.to(dtype=ogtype)
         image_features = self.feature_dropout(image_features)
         image_features = image_features.reshape((bs, ncrops) + self.n_patches + (-1,))
         pad_h = self.n_patches[0] % self.pooling_h
@@ -1345,31 +1343,11 @@ class VisionLanguageConnector(GradientCheckpointingLayer):
                 dh=self.pooling_h,
                 dw=self.pooling_w,
             )
-            image_features = image_features.contiguous()
             if self.pooling_type == ImagePooling2DType.attention_meanq:
                 query = image_features.mean(-2, keepdim=True)
-                # Flash Attention can cause Inf grads in the attention pooling layer
-                # because of very large batch sizes. Setting this to sdpa does not cost
-                # us much since sequence lengths in the case of attention pooling are
-                # very small
-                attn_implementation = attn_implementation or 'eager'
-                if attn_implementation.startswith('flash'):
-                    attn_implementation = 'sdpa'
-                if attn_implementation == 'sdpa':
-                    with sdpa_kernel(backends=[SDPBackend.MATH]):
-                        image_features, _ = self.pooling(
-                            xq=query,
-                            xk=image_features,
-                            attn_implementation='sdpa',
-                            **kwargs,
-                        )
-                else:
-                    image_features, _ = self.pooling(
-                        xq=query,
-                        xk=image_features,
-                        attn_implementation=attn_implementation,
-                        **kwargs,
-                    )
             elif self.pooling_type not in {
                 ImagePooling2DType.none,
                 ImagePooling2DType.stack,

 import torch.backends.cuda
 import torch.nn as nn
 import torch.nn.functional as f
 from transformers import PretrainedConfig
 from transformers.activations import ACT2FN
 from transformers.cache_utils import Cache
 def inv_freq_to_device(rope_forward):
+    """
+    Sometimes the inv_freq is calculated on the wrong device, or ends up in lower
+    precision than float32. This wrapper ensures that inv_freq is always on the right
+    device and in float32 precision.
     """
     @wraps(rope_forward)
         theta: float,
         head_dim: int,
         hidden_size: int,
+        n_heads: int,
         partial_rotary_factor: float,
         device: Optional[torch.device] = None,
         scaling: Optional[Dict[str, Any]] = None,
         setattr(self.config, 'rope_theta', theta)
         setattr(self.config, 'partial_rotary_factor', partial_rotary_factor)
         setattr(self.config, 'head_dim', head_dim)
+        setattr(self.config, 'num_attention_heads', n_heads)
         setattr(self.config, 'hidden_size', hidden_size)
         setattr(self.config, 'rope_scaling', scaling or {})
         self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type]
         device = device or torch.device('cuda' if torch.cuda.is_available() else 'cpu')
         seqlen = config.max_position_embeddings or config.max_sequence_length
+        invfreq, self.attention_scaling = self.rope_init_fn(
+            self.config, device, seqlen
+        )
         self.rope_init_device = device
         self.register_buffer('inv_freq', invfreq, persistent=False)
         self.original_inv_freq = self.inv_freq
 def _ensure_finite(
     x: torch.Tensor, check_neg_inf: bool = True, check_pos_inf: bool = False
 ):
+    """
+    Modify ``x`` in place to replace ``float("-inf")`` with the minimum value of the
     dtype when ``check_neg_inf`` is ``True`` and replace ``float("inf")`` with the
+    maximum value of the dtype when ``check_pos_inf`` is ``True``
+    """
     if check_neg_inf:
         x.masked_fill_(x == float('-inf'), torch.finfo(x.dtype).min)
     if check_pos_inf:
         # shape: (batch_size, 1, 1, seq_len)
         if len(attention_mask.shape) == 2:
             attention_mask = attention_mask[:, : past_length + seq_len]
+            attention_mask = attention_mask.to(dtype=torch.float).view(
+                batch_size, -1
+            )[:, None, None, :]
         else:
             attention_mask = attention_mask.unsqueeze(1).to(dtype=torch.float)
+        attention_mask = (1.0 - attention_mask) * torch.finfo(
+            attention_mask.dtype
+        ).min
     # Merge attention mask with causal mask (attention bias)
     # NOTE: We need to initialize the attn bias in order for attn to
         or past_key_values is not None
     ):
         if causal_mask is None:
+            causal_mask = _create_causal_mask(
+                past_length + seq_len, device
+            )
         elif causal_mask.dtype in (torch.int8, torch.bool):
             causal_mask = causal_mask.to(dtype=torch.float)
             causal_mask.masked_fill_(
 def apply_rotary_positional_embeddings(
+    x: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor,
 ) -> torch.Tensor:
     return (x * cos + rotate_half(x) * sin).to(x.dtype)
         attn_mask: Optional[torch.Tensor] = None,
         is_causal: Optional[bool] = None,
     ) -> Tuple[Callable, Optional[torch.Tensor], Optional[bool]]:
         if 'flash' in attn_implementation and self.fp32_attn:
             raise ValueError('Flash attention does not support fp32 attention')
         if self.sliding_window != -1 and 'flash' not in attn_implementation:
         if self.gated_activation:
             intermediate_size = 2 * self.intermediate_size
+        self.up = nn.Linear(
+            self.hidden_size, intermediate_size, bias=self.use_bias
+        )
         self.down = nn.Linear(
             self.intermediate_size, self.output_size, bias=self.use_bias
         )
             assert config.attn_pooling_config is not None
             if config.pooling_type == ImagePooling2DType.attention_2wide:
                 pooling_input_size *= 2
             self.pooling = MHSDPA(
                 config.attn_pooling_config,
                 hidden_size=pooling_input_size,
         image_features: torch.Tensor,
         image_masks: Optional[torch.Tensor] = None,
         attn_implementation: Optional[str] = None,
     ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
         # image_features:
         # (batch_size, num_crops(=num_image), num_patch, nximage_emb_dim)
         bs, ncrops = image_features.shape[:2]
         if self.padding_embed_type is not None:
             assert image_masks is not None
                     partial_pad, -1
                 )
         image_features = self.feature_dropout(image_features)
         image_features = image_features.reshape((bs, ncrops) + self.n_patches + (-1,))
         pad_h = self.n_patches[0] % self.pooling_h
                 dh=self.pooling_h,
                 dw=self.pooling_w,
             )
             if self.pooling_type == ImagePooling2DType.attention_meanq:
                 query = image_features.mean(-2, keepdim=True)
+                image_features, _ = self.pooling(
+                    xq=query, xk=image_features, attn_implementation=attn_implementation
+                )
             elif self.pooling_type not in {
                 ImagePooling2DType.none,
                 ImagePooling2DType.stack,

config.json CHANGED Viewed

@@ -4,7 +4,6 @@
   ],
   "auto_map": {
     "AutoConfig": "configuration_jvlm.JinaVLMConfig",
-    "AutoModel": "modeling_jvlm.JinaVLM",
     "AutoModelForCausalLM": "modeling_jvlm.JinaVLMForConditionalGeneration"
   },
   "bos_token_id": 151643,
@@ -215,4 +214,4 @@
       "spatial_merge_size": 2
     }
   }
-}

   ],
   "auto_map": {
     "AutoConfig": "configuration_jvlm.JinaVLMConfig",
     "AutoModelForCausalLM": "modeling_jvlm.JinaVLMForConditionalGeneration"
   },
   "bos_token_id": 151643,
       "spatial_merge_size": 2
     }
   }
+}

configuration_jvlm.py CHANGED Viewed

@@ -530,11 +530,6 @@ class JinaVLMTextConfig(PretrainedConfigWithDataclasses):
         self.rope_theta = rope_theta
         self.rope_scaling = rope_scaling
-    # Needed for vLLM
-    @property
-    def num_attention_heads(self) -> int:
-        return self.block_config.attn_config.n_heads
 class JinaVLMConfig(PretrainedConfig):
     """JinaVLM configuration.
@@ -550,8 +545,7 @@ class JinaVLMConfig(PretrainedConfig):
     model_type = 'jvlm'
     sub_configs = {
-        'vision_config': JinaVLMVisionConfig,
-        'text_config': JinaVLMTextConfig,
     }
     def __init__(

         self.rope_theta = rope_theta
         self.rope_scaling = rope_scaling
 class JinaVLMConfig(PretrainedConfig):
     """JinaVLM configuration.
     model_type = 'jvlm'
     sub_configs = {
+        'vision_config': JinaVLMVisionConfig, 'text_config': JinaVLMTextConfig
     }
     def __init__(

image_processing_jvlm.py CHANGED Viewed

@@ -437,17 +437,6 @@ class JinaVLMImageProcessor(BaseImageProcessor):
     """ Base cropping via resizing """
-    def base_get_n_image_patches(
-        self,
-        height: int,
-        width: int,
-        max_crops: int,
-    ) -> int:
-        raise NotImplementedError(
-            'Function `get_n_image_patches` is not implemented for cropping method '
-            f'{CroppingMethod.RESIZE}'
-        )
     def base_resize_cropping(self, image: np.ndarray):
         resized, mask = self.resize_image(image, list(self.base_input_size))
         resized = self.normalize_image(resized)
@@ -508,117 +497,6 @@ class JinaVLMImageProcessor(BaseImageProcessor):
         return candidate_tilings[ix]
-    @staticmethod
-    def _molmo_get_patches_from_tiling(
-        num_tiles,
-        pooling_size,
-        crop_patches,
-        crop_window_patches,
-        left_margin,
-        right_margin,
-    ) -> np.int32:
-        if num_tiles > 1:
-            left_crop_window_patches = (
-                (crop_window_patches + left_margin + pooling_size - 1)
-                // pooling_size
-                * pooling_size
-            )
-            middle_crop_window_patches = (
-                (crop_window_patches + pooling_size - 1) // pooling_size * pooling_size
-            )
-            right_crop_window_patches = (
-                (crop_window_patches + right_margin + pooling_size - 1)
-                // pooling_size
-                * pooling_size
-            )
-            return (
-                left_crop_window_patches
-                + (num_tiles - 2) * middle_crop_window_patches
-                + right_crop_window_patches
-            )
-        else:
-            single_crop_window_patches = (
-                (crop_patches + pooling_size - 1) // pooling_size * pooling_size
-            )
-            return single_crop_window_patches
-    def molmo_get_n_image_patches(
-        self,
-        height: int,
-        width: int,
-        max_crops: int,
-    ) -> int:
-        # Discard this many patches from the (left/top, right/bottom) of crops
-        left_margin, right_margin = self.overlap_margins
-        # Required for compatibility with image pooling
-        assert left_margin % self.pooling_w == 0 and right_margin % self.pooling_w == 0
-        assert left_margin % self.pooling_h == 0 and right_margin % self.pooling_h == 0
-        # pixels removed per dim
-        total_margin_pixels = self.patch_size * (right_margin + left_margin)
-        # patches per crop dim
-        crop_patches = self.base_input_size[0] // self.patch_size
-        # usable patches
-        crop_window_patches = crop_patches - (right_margin + left_margin)
-        crop_window_size = crop_window_patches * self.patch_size
-        # We assume hxw pooling, but can allow padding the right/bottom with extra
-        # patches if the number of patches per side is not divisible by h/w
-        assert (
-            crop_patches + self.pooling_h - 1
-        ) // self.pooling_h == self.token_length_h
-        assert (
-            crop_patches + self.pooling_w - 1
-        ) // self.pooling_w == self.token_length_w
-        # Decide how to tile the image, to account for the overlap margins we
-        # compute the tiling as if we had an image without the margins and were
-        # using a crop size without the margins
-        tiling = self._molmo_select_tiling(
-            height - total_margin_pixels,
-            width - total_margin_pixels,
-            crop_window_size,
-            max_crops,
-        )
-        # Now build the output tokens
-        h = self._molmo_get_patches_from_tiling(
-            tiling[0],
-            self.pooling_h,
-            crop_patches,
-            crop_window_patches,
-            left_margin,
-            right_margin,
-        )
-        w = self._molmo_get_patches_from_tiling(
-            tiling[1],
-            self.pooling_w,
-            crop_patches,
-            crop_window_patches,
-            left_margin,
-            right_margin,
-        )
-        # for each row of patches, add a patch token per patch
-        n_tokens = w.item() // self.pooling_w
-        if self.use_column_tokens:
-            # after each row, one column token is added
-            n_tokens += 1
-        # replicate each row of patch tokens by number of rows, i.e.
-        # proportional to image height
-        n_tokens *= h.item() // self.pooling_h
-        # add start and end image tokens
-        n_tokens += 2
-        # Global image goes first, so the order of patches in previous crops gets
-        # increased
-        n_thumbnail_tokens = self.token_length_w
-        if self.use_column_tokens:
-            n_thumbnail_tokens += 1
-        n_thumbnail_tokens *= self.token_length_h
-        n_thumbnail_tokens += 2
-        return n_tokens + n_thumbnail_tokens
     def molmo_overlap_and_resize_cropping(self, image: np.ndarray):
         # Discard this many patches from the (left/top, right/bottom) of crops
         left_margin, right_margin = self.overlap_margins
@@ -747,23 +625,37 @@ class JinaVLMImageProcessor(BaseImageProcessor):
         # new order into sparse structure of `patch_ordering` to fix it
         patch_ordering[valid] = patch_ordering_rh[patch_ordering_rh >= 0]
         # Now build the output tokens
-        h = self._molmo_get_patches_from_tiling(
-            tiling[0],
-            self.pooling_h,
-            crop_patches,
-            crop_window_patches,
-            left_margin,
-            right_margin,
-        )
-        w = self._molmo_get_patches_from_tiling(
-            tiling[1],
-            self.pooling_w,
-            crop_patches,
-            crop_window_patches,
-            left_margin,
-            right_margin,
-        )
         # for each row of patches, add a patch token per patch
         per_row = np.full((w // self.pooling_w,), self.patch_token_id, dtype=np.int32)
         if self.use_column_tokens:
@@ -918,14 +810,6 @@ class JinaVLMImageProcessor(BaseImageProcessor):
         return slices, image_masks, patch_ordering_arr, best_grid
-    def minicpm_get_n_image_patches(
-        self, height: int, width: int, max_crops: int, with_thumbnail: bool = False
-    ) -> int:
-        raise NotImplementedError(
-            'Function `get_n_image_patches` is not implemented for cropping method '
-            f'{CroppingMethod.ADAPTIVE_SLICING}'
-        )
     def minicpm_adaptive_slicing(self, image: np.ndarray, with_thumbnail: bool = True):
         scale_resolution = self.base_input_size[0]
         refine_image, image_mask, best_grid = self._minicpm_refine_image_for_slicing(
@@ -1062,12 +946,23 @@ class JinaVLMImageProcessor(BaseImageProcessor):
         self.start_token_id = start_token_id
         self.end_token_id = end_token_id
-    def _resolve_images_kwargs(
-        self, **kwargs: Unpack[JinaVLMImagesKwargs]
-    ) -> JinaVLMImagesKwargs:
-        max_crops = self.max_crops
         if 'max_crops' in kwargs and kwargs['max_crops'] is not None:
             max_crops = kwargs['max_crops']
         min_pixels = self.min_pixels
         if 'min_pixels' in kwargs and kwargs['min_pixels'] is not None:
@@ -1089,93 +984,14 @@ class JinaVLMImageProcessor(BaseImageProcessor):
             size = {'shortest_edge': min_pixels, 'longest_edge': max_pixels}
         else:
             size = {**self.size}
-        min_pixels = size['shortest_edge']
-        max_pixels = size['longest_edge']
         do_resize = self.do_resize
         if 'do_resize' in kwargs and kwargs['do_resize'] is not None:
             do_resize = kwargs['do_resize']
         do_convert_rgb = self.do_convert_rgb
         if 'do_convert_rgb' in kwargs and kwargs['do_convert_rgb'] is not None:
             do_convert_rgb = kwargs['do_convert_rgb']
-        input_data_format = None
-        if 'input_data_format' in kwargs:
-            input_data_format = kwargs['input_data_format']
-        return JinaVLMImagesKwargs(
-            do_convert_rgb=do_convert_rgb,
-            do_resize=do_resize,
-            min_pixels=min_pixels,
-            max_pixels=max_pixels,
-            size=size,
-            max_crops=max_crops,
-            input_data_format=input_data_format,
-        )
-    def get_n_image_patches(
-        self,
-        height: int,
-        width: int,
-        **kwargs: Unpack[JinaVLMImagesKwargs],
-    ) -> int:
-        """A utility that returns number of image patches for a given image size.
-        Args:
-            height (`int`):
-                Height of the input image.
-            width (`int`):
-                Width of the input image.
-            **kwargs (`dict`, *optional*)
-                Any kwargs to override defaults of the image processor.
-        Returns:
-            `int`: Number of image patches
-        """
-        if self.cropping_method != CroppingMethod.OVERLAP_AND_RESIZE:
-            raise NotImplementedError(
-                'Function is only implemented for cropping method '
-                f'{CroppingMethod.OVERLAP_AND_RESIZE}'
-            )
-        kwargs = self._resolve_images_kwargs(**kwargs)
-        do_resize = kwargs['do_resize']
-        size = kwargs['size']
-        max_crops = kwargs['max_crops']
-        if do_resize:
-            height, width = smart_resize(
-                height,
-                width,
-                factor=self.patch_size,
-                min_pixels=size['shortest_edge'],
-                max_pixels=size['longest_edge'],
-            )
-        if self.cropping_method == CroppingMethod.RESIZE:
-            return self.base_get_n_image_patches(height, width, max_crops)
-        elif self.cropping_method == CroppingMethod.OVERLAP_AND_RESIZE:
-            return self.molmo_get_n_image_patches(height, width, max_crops)
-        elif self.cropping_method == CroppingMethod.ADAPTIVE_SLICING:
-            return self.minicpm_get_n_image_patches(height, width, max_crops)
-        return self.minicpm_get_n_image_patches(
-            height, width, max_crops, with_thumbnail=True
-        )
-    def preprocess(
-        self,
-        images: ImageInput,
-        **kwargs: Unpack[JinaVLMImagesKwargs],
-    ) -> Dict[str, List[np.ndarray]]:
-        """Preprocess an image or batch of images."""
-        if images is None or len(images) == 0:
-            return {
-                'image_crops': [],
-                'image_tokens': [],
-                'image_input_idx': [],
-                'image_padding_mask': [],
-            }
-        kwargs = self._resolve_images_kwargs(**kwargs)
-        do_convert_rgb = kwargs['do_convert_rgb']
-        do_resize = kwargs['do_resize']
-        input_data_format = kwargs['input_data_format']
-        size = kwargs['size']
-        self.max_crops = kwargs['max_crops']
         # noinspection PyTypeChecker
         images = self.fetch_images(images)
@@ -1185,11 +1001,16 @@ class JinaVLMImageProcessor(BaseImageProcessor):
                 'Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray '
                 'or torch.Tensor'
             )
         if do_convert_rgb:
             images = [convert_to_rgb(image) for image in images]
         # All transformations expect numpy arrays
         images = [to_numpy_array(image) for image in images]
         if input_data_format is None:
             # We assume that all images have the same channel dimension format.
             input_data_format = infer_channel_dimension_format(images[0])

     """ Base cropping via resizing """
     def base_resize_cropping(self, image: np.ndarray):
         resized, mask = self.resize_image(image, list(self.base_input_size))
         resized = self.normalize_image(resized)
         return candidate_tilings[ix]
     def molmo_overlap_and_resize_cropping(self, image: np.ndarray):
         # Discard this many patches from the (left/top, right/bottom) of crops
         left_margin, right_margin = self.overlap_margins
         # new order into sparse structure of `patch_ordering` to fix it
         patch_ordering[valid] = patch_ordering_rh[patch_ordering_rh >= 0]
+        def get_num_patches(num_tiles, pooling_size) -> int:
+            if num_tiles > 1:
+                left_crop_window_patches = (
+                    (crop_window_patches + left_margin + pooling_size - 1)
+                    // pooling_size
+                    * pooling_size
+                )
+                middle_crop_window_patches = (
+                    (crop_window_patches + pooling_size - 1)
+                    // pooling_size
+                    * pooling_size
+                )
+                right_crop_window_patches = (
+                    (crop_window_patches + right_margin + pooling_size - 1)
+                    // pooling_size
+                    * pooling_size
+                )
+                return (
+                    left_crop_window_patches
+                    + (num_tiles - 2) * middle_crop_window_patches
+                    + right_crop_window_patches
+                )
+            else:
+                single_crop_window_patches = (
+                    (crop_patches + pooling_size - 1) // pooling_size * pooling_size
+                )
+                return single_crop_window_patches
         # Now build the output tokens
+        h = get_num_patches(tiling[0], self.pooling_h)
+        w = get_num_patches(tiling[1], self.pooling_w)
         # for each row of patches, add a patch token per patch
         per_row = np.full((w // self.pooling_w,), self.patch_token_id, dtype=np.int32)
         if self.use_column_tokens:
         return slices, image_masks, patch_ordering_arr, best_grid
     def minicpm_adaptive_slicing(self, image: np.ndarray, with_thumbnail: bool = True):
         scale_resolution = self.base_input_size[0]
         refine_image, image_mask, best_grid = self._minicpm_refine_image_for_slicing(
         self.start_token_id = start_token_id
         self.end_token_id = end_token_id
+    def preprocess(
+        self,
+        images: ImageInput,
+        **kwargs: Unpack[JinaVLMImagesKwargs],
+    ) -> Dict[str, List[np.ndarray]]:
+        """Preprocess an image or batch of images."""
+        if images is None or len(images) == 0:
+            return {
+                'image_crops': [],
+                'image_tokens': [],
+                'image_input_idx': [],
+                'image_padding_mask': [],
+            }
         if 'max_crops' in kwargs and kwargs['max_crops'] is not None:
             max_crops = kwargs['max_crops']
+            self.max_crops = max_crops
         min_pixels = self.min_pixels
         if 'min_pixels' in kwargs and kwargs['min_pixels'] is not None:
             size = {'shortest_edge': min_pixels, 'longest_edge': max_pixels}
         else:
             size = {**self.size}
         do_resize = self.do_resize
         if 'do_resize' in kwargs and kwargs['do_resize'] is not None:
             do_resize = kwargs['do_resize']
         do_convert_rgb = self.do_convert_rgb
         if 'do_convert_rgb' in kwargs and kwargs['do_convert_rgb'] is not None:
             do_convert_rgb = kwargs['do_convert_rgb']
         # noinspection PyTypeChecker
         images = self.fetch_images(images)
                 'Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray '
                 'or torch.Tensor'
             )
         if do_convert_rgb:
             images = [convert_to_rgb(image) for image in images]
         # All transformations expect numpy arrays
         images = [to_numpy_array(image) for image in images]
+        input_data_format = None
+        if 'input_data_format' in kwargs:
+            input_data_format = kwargs['input_data_format']
         if input_data_format is None:
             # We assume that all images have the same channel dimension format.
             input_data_format = infer_channel_dimension_format(images[0])

modeling_jvlm.py CHANGED Viewed

@@ -27,13 +27,14 @@ from .blocks_jvlm import (
     TransformerBlock,
     VisionLanguageConnector,
     build_layer_norm,
-    resolve_causal_mask,
 )
 from .configuration_jvlm import JinaVLMConfig, JinaVLMTextConfig, JinaVLMVisionConfig
 class JinaPreTrainedModel(PreTrainedModel):
     config: JinaVLMConfig
     base_model_prefix = 'model'
     supports_gradient_checkpointing = True
     _supports_flash_attn = True
@@ -50,6 +51,8 @@ class JinaPreTrainedModel(PreTrainedModel):
 class JinaVLMVisionModel(JinaPreTrainedModel):
     config: JinaVLMVisionConfig
     def __init__(self, config: JinaVLMVisionConfig, *args, **kwargs):
         super().__init__(config, *args, **kwargs)
@@ -183,11 +186,7 @@ class JinaVLMVisionModel(JinaPreTrainedModel):
             pos = pos_emb[None, :, :].to(x.dtype)
         return x + pos
-    def get_visual_features(
-        self,
-        images: torch.Tensor,
-        **kwargs: Unpack[FlashAttentionKwargs],
-    ) -> BaseModelOutput:
         x, shape = self.patch_embed(images)
         if self.cls_embed is not None:
             cls = self.cls_embed.view(1, 1, -1).expand(x.shape[0], -1, -1).to(x.dtype)
@@ -202,11 +201,7 @@ class JinaVLMVisionModel(JinaPreTrainedModel):
         hidden_states = []
         attentions = []
         for layer in self.layers:
-            x, attn = layer(
-                x,
-                attn_implementation=self.config._attn_implementation,
-                **kwargs,
-            )
             hidden_states.append(x)
             attentions.append(attn)
         x = self.post_lnorm(x)
@@ -219,15 +214,12 @@ class JinaVLMVisionModel(JinaPreTrainedModel):
         )
     def forward(
-        self,
-        images: torch.Tensor,
-        image_masks: torch.Tensor,
-        **kwargs: Unpack[FlashAttentionKwargs],
     ) -> BaseModelOutput:
         b, t, n, d = images.shape
         mask = ~torch.all(images.view(b * t, n, d) == -1, dim=(1, 2), keepdim=True)
         images = images.view(b * t, n, d)
-        out = self.get_visual_features(images, **kwargs)
         image_features = out.hidden_states
         features = []
@@ -238,13 +230,14 @@ class JinaVLMVisionModel(JinaPreTrainedModel):
             features.append(feats)
         image_features = torch.cat(features, dim=-1)
         image_features = image_features * mask
-        image_features = image_features.view(b, t, n, -1).contiguous()
         image_features = self.vl_connector(
             image_features,
             image_masks,
             attn_implementation=self.config._attn_implementation,
-            **kwargs,
         )
         return BaseModelOutput(
             last_hidden_state=image_features,
             hidden_states=out.hidden_states,
@@ -253,7 +246,11 @@ class JinaVLMVisionModel(JinaPreTrainedModel):
 class JinaVLMTextModel(JinaPreTrainedModel):
     config: JinaVLMTextConfig
     def __init__(self, config: JinaVLMTextConfig, *args, **kwargs):
         super().__init__(config, *args, **kwargs)
@@ -300,6 +297,7 @@ class JinaVLMTextModel(JinaPreTrainedModel):
                 theta=self.config.rope_theta,
                 head_dim=self.config.block_config.attn_config.head_dim,
                 hidden_size=self.config.hidden_size,
                 partial_rotary_factor=self.config.partial_rotary_factor,
                 scaling=self.config.rope_scaling,
             )
@@ -390,7 +388,6 @@ class JinaVLMTextModel(JinaPreTrainedModel):
             batch_idx = torch.arange(bs, device=x.device)
             batch_idx = torch.tile(batch_idx[:, None], [1, image_features.shape[1]])
             image_features = image_features.to(x.device)
-            x = x.clone()  # Clone x to avoid in-place operation on leaf tensor
             x[batch_idx[valid], image_input_idx[valid]] += image_features[valid]
         if not self.rope:
@@ -446,7 +443,7 @@ class JinaVLMTextModel(JinaPreTrainedModel):
 class JinaVLM(JinaPreTrainedModel):
-    config: JinaVLMConfig
     def __init__(self, config: JinaVLMConfig):
         super().__init__(config)
@@ -495,7 +492,7 @@ class JinaVLM(JinaPreTrainedModel):
     ) -> BaseModelOutputWithPast:
         image_features = None
         if images is not None and images.shape[1] > 0:
-            image_out = self.vision_model(images, image_masks, **kwargs)
             image_features = image_out.last_hidden_state
         return self.language_model(
             input_ids=input_ids,
@@ -514,10 +511,10 @@ class JinaVLM(JinaPreTrainedModel):
 class JinaVLMForConditionalGeneration(JinaPreTrainedModel, GenerationMixin):
-    _tied_weights_keys = {
-        'lm_head.weight': 'model.language_model.embedding.embedding.weight'
-    }
     accepts_loss_kwargs = False
     config: JinaVLMConfig
     def __init__(self, config: JinaVLMConfig):

     TransformerBlock,
     VisionLanguageConnector,
     build_layer_norm,
+    resolve_causal_mask
 )
 from .configuration_jvlm import JinaVLMConfig, JinaVLMTextConfig, JinaVLMVisionConfig
 class JinaPreTrainedModel(PreTrainedModel):
     config: JinaVLMConfig
+    config_class = JinaVLMConfig
     base_model_prefix = 'model'
     supports_gradient_checkpointing = True
     _supports_flash_attn = True
 class JinaVLMVisionModel(JinaPreTrainedModel):
     config: JinaVLMVisionConfig
+    config_class = JinaVLMVisionConfig
+    base_model_prefix = ''
     def __init__(self, config: JinaVLMVisionConfig, *args, **kwargs):
         super().__init__(config, *args, **kwargs)
             pos = pos_emb[None, :, :].to(x.dtype)
         return x + pos
+    def get_visual_features(self, images: torch.Tensor) -> BaseModelOutput:
         x, shape = self.patch_embed(images)
         if self.cls_embed is not None:
             cls = self.cls_embed.view(1, 1, -1).expand(x.shape[0], -1, -1).to(x.dtype)
         hidden_states = []
         attentions = []
         for layer in self.layers:
+            x, attn = layer(x, attn_implementation=self.config._attn_implementation)
             hidden_states.append(x)
             attentions.append(attn)
         x = self.post_lnorm(x)
         )
     def forward(
+        self, images: torch.Tensor, image_masks: torch.Tensor
     ) -> BaseModelOutput:
         b, t, n, d = images.shape
         mask = ~torch.all(images.view(b * t, n, d) == -1, dim=(1, 2), keepdim=True)
         images = images.view(b * t, n, d)
+        out = self.get_visual_features(images)
         image_features = out.hidden_states
         features = []
             features.append(feats)
         image_features = torch.cat(features, dim=-1)
         image_features = image_features * mask
+        image_features = image_features.view(b, t, n, -1)
         image_features = self.vl_connector(
             image_features,
             image_masks,
             attn_implementation=self.config._attn_implementation,
         )
         return BaseModelOutput(
             last_hidden_state=image_features,
             hidden_states=out.hidden_states,
 class JinaVLMTextModel(JinaPreTrainedModel):
+    """Decoder-only language model."""
     config: JinaVLMTextConfig
+    config_class = JinaVLMTextConfig
+    base_model_prefix = ''
     def __init__(self, config: JinaVLMTextConfig, *args, **kwargs):
         super().__init__(config, *args, **kwargs)
                 theta=self.config.rope_theta,
                 head_dim=self.config.block_config.attn_config.head_dim,
                 hidden_size=self.config.hidden_size,
+                n_heads=self.config.block_config.attn_config.n_heads,
                 partial_rotary_factor=self.config.partial_rotary_factor,
                 scaling=self.config.rope_scaling,
             )
             batch_idx = torch.arange(bs, device=x.device)
             batch_idx = torch.tile(batch_idx[:, None], [1, image_features.shape[1]])
             image_features = image_features.to(x.device)
             x[batch_idx[valid], image_input_idx[valid]] += image_features[valid]
         if not self.rope:
 class JinaVLM(JinaPreTrainedModel):
+    base_model_prefix = ''
     def __init__(self, config: JinaVLMConfig):
         super().__init__(config)
     ) -> BaseModelOutputWithPast:
         image_features = None
         if images is not None and images.shape[1] > 0:
+            image_out = self.vision_model(images, image_masks)
             image_features = image_out.last_hidden_state
         return self.language_model(
             input_ids=input_ids,
 class JinaVLMForConditionalGeneration(JinaPreTrainedModel, GenerationMixin):
+    _checkpoint_conversion_mapping = {}
+    _tied_weights_keys = ['lm_head.weight']
     accepts_loss_kwargs = False
+    base_model_prefix = 'model'
     config: JinaVLMConfig
     def __init__(self, config: JinaVLMConfig):

processing_jvlm.py CHANGED Viewed

@@ -10,14 +10,11 @@ from transformers.image_utils import ImageInput
 from transformers.processing_utils import (
     AllKwargsForChatTemplate,
     CommonKwargs,
-    MultiModalData,
     ProcessorMixin,
     Unpack,
 )
 from transformers.tokenization_utils_base import (
-    PaddingStrategy,
-    PreTokenizedInput,
-    TextInput,
 )
 from .image_processing_jvlm import JinaVLMImageProcessor, JinaVLMImagesKwargs
@@ -41,8 +38,8 @@ class JinaVLMTextKwargs(TypedDict, total=False):
     is_split_into_words: Optional[bool]
-class JinaVLMProcessingKwargs(JinaVLMTextKwargs, JinaVLMImagesKwargs, CommonKwargs):
-    return_labels: Optional[bool]
 class JinaVLMProcessor(ProcessorMixin):
@@ -174,8 +171,8 @@ class JinaVLMProcessor(ProcessorMixin):
     def _collate(
         self,
         batch: Dict[str, List[Optional[np.ndarray]]],
-        text_max_sequence_length: Optional[int] = None,
-        image_max_sequence_length: Optional[int] = None,
         padding: Union[
             PaddingStrategy.MAX_LENGTH, PaddingStrategy.LONGEST
         ] = PaddingStrategy.MAX_LENGTH,
@@ -188,10 +185,10 @@ class JinaVLMProcessor(ProcessorMixin):
             _padding_side = 'right'
             if key in self.TEXT_KEYS:
                 _padding_side = padding_side
-                max_len = text_max_sequence_length
                 dtype = np.int64
             elif key in self.IMAGE_KEYS:
-                max_len = image_max_sequence_length
                 dtype = np.int64
                 if key == 'images':
                     dtype = np.float32
@@ -217,22 +214,22 @@ class JinaVLMProcessor(ProcessorMixin):
             shift = input_ids_padlens[:, np.newaxis, np.newaxis]
             shift = np.repeat(shift, n_image_tokens, axis=2)
             shift = np.repeat(shift, n_crops, axis=1)
-            image_input_idx[image_input_idx < 0] = -text_max_sequence_length
             image_input_idx = image_input_idx + shift
             out['image_input_idx'] = image_input_idx
-        if text_max_sequence_length is not None:
             image_input_idx = out.get('image_input_idx', [])
             n = len(image_input_idx)
             for i in range(n):
                 arr = image_input_idx[i]
                 if arr.ndim > 0 and arr.size > 0:
                     n_image_tokens = arr.max()
-                    if n_image_tokens > text_max_sequence_length - 3:
                         raise RuntimeError(
                             'Image tokens truncation at sequence boundary. Max '
-                            f'sequence length ({text_max_sequence_length}) is too '
-                            'small to fit the generated image tokens '
                             f'({n_image_tokens}). Consider increasing the max '
                             'sequence length or tweaking the image processing '
                             'parameters (`max_crops`, `max_pixels`) to reduce the '
@@ -262,7 +259,6 @@ class JinaVLMProcessor(ProcessorMixin):
         image_tokens: List[np.ndarray],
         image_input_idx: List[np.ndarray],
         image_padding_mask: List[np.ndarray],
-        return_labels: bool = False,
         add_empty_image_features: bool = False,
     ):
         """Interleave images and text tokens into multi-modal features for the model."""
@@ -286,9 +282,8 @@ class JinaVLMProcessor(ProcessorMixin):
             data = {
                 'input_ids': input_ids,
                 'position_ids': position_ids,
             }
-            if return_labels:
-                data['labels'] = target_tokens
             if add_empty_image_features:
                 # Add size-zero image features, this can be useful to make sure all
                 # devices get an image input when the image ViT is FSDP wrapped
@@ -372,16 +367,14 @@ class JinaVLMProcessor(ProcessorMixin):
             image_input_idx < 0, image_input_idx, image_input_idx + 1
         )
         position_ids = np.arange(len(input_ids), dtype=np.int64)
-        data = {
             'input_ids': input_ids,
             'position_ids': position_ids,
             'images': images,
             'image_input_idx': image_input_idx,
             'image_masks': image_masks,
         }
-        if return_labels:
-            data['labels'] = target_tokens
-        return data
     def __call__(
         self,
@@ -389,7 +382,7 @@ class JinaVLMProcessor(ProcessorMixin):
         text: Union[
             None, TextInput, PreTokenizedInput, list[TextInput], list[PreTokenizedInput]
         ] = None,
-        **kwargs: Unpack[JinaVLMProcessingKwargs],
     ) -> BatchFeature:
         """Main method to prepare for the model one or several sequences(s) and
         image(s). This method forwards the `text` and `kwargs` arguments to  the
@@ -432,7 +425,6 @@ class JinaVLMProcessor(ProcessorMixin):
             raise ValueError('Processor requires text input.')
         return_tensors = kwargs.pop('return_tensors', None)
-        return_labels = kwargs.pop('return_labels', False)
         padding = kwargs.pop('padding', PaddingStrategy.LONGEST)
         padding_side = kwargs.pop('padding_side', 'left')
         max_length = kwargs.pop('max_length', None)
@@ -461,7 +453,6 @@ class JinaVLMProcessor(ProcessorMixin):
         )
         token_ids = text_inputs['input_ids']
         batch_size = token_ids.shape[0]
-        images = images or [[] for _ in range(batch_size)]
         if batch_size == 1:
             if isinstance(images, list):
@@ -492,11 +483,9 @@ class JinaVLMProcessor(ProcessorMixin):
                 )
         outputs = defaultdict(list)
-        n_images = []
         for idx in range(batch_size):
             _token_ids = token_ids[idx]
             _images = images[idx]
-            n_images.append(len(_images))
             image_inputs = self.image_processor(_images, **images_kwargs)
             image_crops = image_inputs['image_crops']
             image_tokens = image_inputs['image_tokens']
@@ -509,48 +498,19 @@ class JinaVLMProcessor(ProcessorMixin):
                 image_input_idx,
                 image_padding_mask if image_padding_mask is not None else [],
                 add_empty_image_features=(batch_size > 1),
-                return_labels=return_labels,
             )
             for k, v in output.items():
                 outputs[k].append(v)
         if padding != PaddingStrategy.DO_NOT_PAD:
-            text_max_sequence_length = max_length or self.max_sequence_length
-            max_crops = max_crops or self.max_crops
-            max_n_images = max(n_images)
-            image_max_sequence_length = (max_crops + 1) * max_n_images
             outputs = self._collate(
                 outputs,
-                text_max_sequence_length=text_max_sequence_length,
-                image_max_sequence_length=image_max_sequence_length,
                 padding=padding,
                 padding_side=padding_side,
             )
         return BatchFeature(data=outputs, tensor_type=return_tensors)
-    def _get_num_multimodal_tokens(
-        self,
-        image_sizes: Optional[List[List[int]]] = None,
-        **kwargs: Unpack[JinaVLMImagesKwargs],
-    ) -> MultiModalData:
-        """Computes the number of placeholder tokens needed for multimodal inputs with
-        the given sizes.
-        Args:
-            image_sizes (`list[list[int]]`, *optional*):
-                The input sizes formatted as (height, width) per each image.
-        Returns:
-            `MultiModalData`: A `MultiModalData` object holding number of tokens per
-            each of the provided input modalities, along with other useful data.
-        """
-        data = {}
-        if image_sizes is not None:
-            n_patches = [
-                self.image_processor.get_n_image_patches(h, w, **kwargs)
-                for h, w in image_sizes
-            ]
-            data.update({'num_image_tokens': n_patches, 'num_image_patches': n_patches})
-        return MultiModalData(**data)
 JinaVLMProcessor.register_for_auto_class()

 from transformers.processing_utils import (
     AllKwargsForChatTemplate,
     CommonKwargs,
     ProcessorMixin,
     Unpack,
 )
 from transformers.tokenization_utils_base import (
+    PaddingStrategy, PreTokenizedInput, TextInput,
 )
 from .image_processing_jvlm import JinaVLMImageProcessor, JinaVLMImagesKwargs
     is_split_into_words: Optional[bool]
+class JinaVLProcessingKwargs(JinaVLMTextKwargs, JinaVLMImagesKwargs, CommonKwargs):
+    pass
 class JinaVLMProcessor(ProcessorMixin):
     def _collate(
         self,
         batch: Dict[str, List[Optional[np.ndarray]]],
+        max_sequence_length: Optional[int] = None,
+        max_crops: Optional[int] = None,
         padding: Union[
             PaddingStrategy.MAX_LENGTH, PaddingStrategy.LONGEST
         ] = PaddingStrategy.MAX_LENGTH,
             _padding_side = 'right'
             if key in self.TEXT_KEYS:
                 _padding_side = padding_side
+                max_len = max_sequence_length
                 dtype = np.int64
             elif key in self.IMAGE_KEYS:
+                max_len = max_crops
                 dtype = np.int64
                 if key == 'images':
                     dtype = np.float32
             shift = input_ids_padlens[:, np.newaxis, np.newaxis]
             shift = np.repeat(shift, n_image_tokens, axis=2)
             shift = np.repeat(shift, n_crops, axis=1)
+            image_input_idx[image_input_idx < 0] = -max_sequence_length
             image_input_idx = image_input_idx + shift
             out['image_input_idx'] = image_input_idx
+        if max_sequence_length is not None:
             image_input_idx = out.get('image_input_idx', [])
             n = len(image_input_idx)
             for i in range(n):
                 arr = image_input_idx[i]
                 if arr.ndim > 0 and arr.size > 0:
                     n_image_tokens = arr.max()
+                    if n_image_tokens > max_sequence_length - 3:
                         raise RuntimeError(
                             'Image tokens truncation at sequence boundary. Max '
+                            f'sequence length ({max_sequence_length}) is too small '
+                            'to fit the generated image tokens '
                             f'({n_image_tokens}). Consider increasing the max '
                             'sequence length or tweaking the image processing '
                             'parameters (`max_crops`, `max_pixels`) to reduce the '
         image_tokens: List[np.ndarray],
         image_input_idx: List[np.ndarray],
         image_padding_mask: List[np.ndarray],
         add_empty_image_features: bool = False,
     ):
         """Interleave images and text tokens into multi-modal features for the model."""
             data = {
                 'input_ids': input_ids,
                 'position_ids': position_ids,
+                'labels': target_tokens,
             }
             if add_empty_image_features:
                 # Add size-zero image features, this can be useful to make sure all
                 # devices get an image input when the image ViT is FSDP wrapped
             image_input_idx < 0, image_input_idx, image_input_idx + 1
         )
         position_ids = np.arange(len(input_ids), dtype=np.int64)
+        return {
             'input_ids': input_ids,
             'position_ids': position_ids,
             'images': images,
             'image_input_idx': image_input_idx,
             'image_masks': image_masks,
+            'labels': target_tokens,
         }
     def __call__(
         self,
         text: Union[
             None, TextInput, PreTokenizedInput, list[TextInput], list[PreTokenizedInput]
         ] = None,
+        **kwargs: Unpack[JinaVLProcessingKwargs],
     ) -> BatchFeature:
         """Main method to prepare for the model one or several sequences(s) and
         image(s). This method forwards the `text` and `kwargs` arguments to  the
             raise ValueError('Processor requires text input.')
         return_tensors = kwargs.pop('return_tensors', None)
         padding = kwargs.pop('padding', PaddingStrategy.LONGEST)
         padding_side = kwargs.pop('padding_side', 'left')
         max_length = kwargs.pop('max_length', None)
         )
         token_ids = text_inputs['input_ids']
         batch_size = token_ids.shape[0]
         if batch_size == 1:
             if isinstance(images, list):
                 )
         outputs = defaultdict(list)
         for idx in range(batch_size):
             _token_ids = token_ids[idx]
             _images = images[idx]
             image_inputs = self.image_processor(_images, **images_kwargs)
             image_crops = image_inputs['image_crops']
             image_tokens = image_inputs['image_tokens']
                 image_input_idx,
                 image_padding_mask if image_padding_mask is not None else [],
                 add_empty_image_features=(batch_size > 1),
             )
             for k, v in output.items():
                 outputs[k].append(v)
         if padding != PaddingStrategy.DO_NOT_PAD:
             outputs = self._collate(
                 outputs,
+                max_sequence_length=max_length or self.max_sequence_length,
+                max_crops=max_crops or self.max_crops,
                 padding=padding,
                 padding_side=padding_side,
             )
         return BatchFeature(data=outputs, tensor_type=return_tensors)
 JinaVLMProcessor.register_for_auto_class()

pyproject.toml DELETED Viewed

@@ -1,18 +0,0 @@
-[project]
-name = "jina-vlm"
-version = "1.0.0"
-description = "Jina VLM v1: Lightweight Vision Language Alignment"
-readme = "README.md"
-license = "CC-BY-NC-4.0"
-requires-python = ">=3.10"
-dependencies = [
-    "torch>=2.9.0",
-    "torchvision>=0.24.0",
-    "transformers>=4.57.0",
-    "pillow>=12.0.0",
-    "einops>=0.8.1",
-    "accelerate>=1.0.0",
-]
-[project.optional-dependencies]
-flash-attn = ["flash-attn>=2.0.0"]

infer.py → test_jvlm.py RENAMED Viewed

@@ -11,10 +11,7 @@ os.environ['TOKENIZERS_PARALLELISM'] = 'false'
 import torch
 from transformers import (
-    AutoModelForCausalLM,
-    AutoProcessor,
-    GenerationConfig,
-    TextStreamer,
 )
 from transformers.utils import is_flash_attn_2_available
@@ -63,8 +60,7 @@ def _build_conversations(
         try:
             result = urlparse(_path)
             return result.scheme in ('http', 'https')
-        except Exception as e:
-            _ = str(e)
             return False
     images = images or []
@@ -87,9 +83,8 @@ def _build_conversations(
             images = [TEST_IMAGE]
             n_images = len(images)
         prompts = (
-            ['Describe the image in 100 words']
-            if n_images == 1 or map_mode
-            else ['Describe the images in 100 words']
         )
     n_prompts = len(prompts)
@@ -124,16 +119,8 @@ def _build_conversations(
     allimages = []
     allprompts = []
     ordinals = [
-        'first',
-        'second',
-        'third',
-        'fourth',
-        'fifth',
-        'sixth',
-        'seventh',
-        'eighth',
-        'ninth',
-        'tenth',
     ]
     for images, prompt in examples:
         content = []
@@ -143,17 +130,15 @@ def _build_conversations(
             content.append({'type': 'text', 'text': prompt})
         if len(images) > 1 and image_labels:
             for idx, img in enumerate(images):
-                ordinal = ordinals[idx] if idx < len(ordinals) else f'{idx + 1}th'
                 image = images[idx]
                 descriptor = f'url: {image}'
                 if os.path.isfile(image):
                     descriptor = f'filename: {os.path.basename(image)}'
-                content.append(
-                    {
-                        'type': 'text',
-                        'text': f'(this is the {ordinal} image, {descriptor})',
-                    }
-                )
                 content.append({'type': 'image', 'image': img})
         else:
             content.extend([{'type': 'image', 'image': image} for image in images])
@@ -204,7 +189,9 @@ def _token_usage_report(
     tokens_per_image_list = []
     # Find all img_start and img_end positions in input_ids
-    start_positions = (input_ids == image_start_id).nonzero(as_tuple=True)[0].tolist()
     end_positions = (input_ids == image_end_id).nonzero(as_tuple=True)[0].tolist()
     if len(start_positions) > 0 and len(end_positions) > 0:
@@ -224,8 +211,9 @@ def _token_usage_report(
                 # Get the start and end indices for this image
                 start_idx_begin = idx * n_starts_per_image
                 end_idx_end = (idx + 1) * n_starts_per_image
-                if start_idx_begin < len(start_positions) and end_idx_end <= len(
-                    end_positions
                 ):
                     # First start position and last end position define the image span
                     first_start = start_positions[start_idx_begin]
@@ -245,10 +233,10 @@ def _token_usage_report(
     for idx in range(n_images):
         n_tokens = tokens_per_image_list[idx] if idx < len(tokens_per_image_list) else 0
-        pct = n_tokens / max_sequence_length * 100
         report.append(f'├── Image {idx + 1} → {n_tokens} tokens ({pct:.1f}%)')
-    text_pct = text_token_count / max_sequence_length * 100
     report.append(f'└── Text: {text_token_count} tokens ({text_pct:.1f}%)')
     return '\n'.join(report)
@@ -256,17 +244,16 @@ def _token_usage_report(
 def test_jvlm():
     parser = argparse.ArgumentParser(
-        description='jina-vlm vision-language model inference.'
     )
-    default_model = '.' if os.path.exists('./config.json') else 'jinaai/jina-vlm'
     parser.add_argument(
         '-m',
         '--model',
-        default=default_model,
         help=(
-            'Model path. Auto-detects local repo (if config.json exists) or '
-            'falls back to "jinaai/jina-vlm" from HuggingFace.'
-        ),
     )
     parser.add_argument(
         '-i',
@@ -340,7 +327,7 @@ def test_jvlm():
     args = parser.parse_args()
     print()
-    print('Welcome to the jinaai/jina-vlm playground ✨')
     print('Use this script to test our model!')
     print('- Jina AI')
     print()
@@ -352,9 +339,7 @@ def test_jvlm():
     print(f'Using dtype: {dtype}')
     print('Model path: ', args.model)
     processor = AutoProcessor.from_pretrained(
-        args.model,
-        trust_remote_code=True,
-        use_fast=False,
     )
     model = AutoModelForCausalLM.from_pretrained(
         args.model,
@@ -371,13 +356,13 @@ def test_jvlm():
     print('Done ✅')
     print()
-    print("--- Let's create some conversations ...")
     conversations, images, prompts = _build_conversations(
         args.image,
         args.prompt,
         map_mode=args.map,
         prompt_first=args.prompt_first,
-        image_labels=args.image_labels,
     )
     n_conversations = len(conversations)
     print(f'Built {n_conversations} conversations 🚀')
@@ -449,28 +434,25 @@ def test_jvlm():
             print(f'├── 🖼️Images: {images[idx]}')
             print(f'├── 📜Prompt: {prompts[idx]}')
             print(f'├── 💬Chat:{texts[idx]}')
-            print('└── 🧠Response:', end='')
             ith_inputs = {k: v[idx].unsqueeze(0) for k, v in device_inputs.items()}
             with (
                 timer,
                 torch.no_grad(),
-                torch.autocast(
-                    device.type, enabled=(device.type != 'mps'), dtype=dtype
-                ),
             ):
                 output = model.generate(
                     **ith_inputs,
                     streamer=streamer,
                     generation_config=GenerationConfig(
-                        max_new_tokens=args.max_tokens,
-                        do_sample=False,
                     ),
                     return_dict_in_generate=True,
                     use_model_defaults=True,
                 )
             generation_time += timer.time
-            out = output.sequences[0][len(input_prompts[idx].tolist()) :]
             generated_tokens += len(out)
             print('Token usage report:')
             print(token_usage_reports[idx])
@@ -488,8 +470,7 @@ def test_jvlm():
             output = model.generate(
                 **device_inputs,
                 generation_config=GenerationConfig(
-                    max_new_tokens=args.max_tokens,
-                    do_sample=False,
                 ),
                 return_dict_in_generate=True,
                 use_model_defaults=True,
@@ -497,7 +478,7 @@ def test_jvlm():
         generation_time = timer.time
         for idx in range(n_conversations):
-            out = output.sequences[idx][len(input_prompts[idx].tolist()) :]
             generated_tokens += len(out)
             response = processor.tokenizer.decode(out, skip_special_tokens=True)
             print(f'* Conversation {idx + 1}/{n_conversations}')

 import torch
 from transformers import (
+    AutoModelForCausalLM, AutoProcessor, GenerationConfig, TextStreamer
 )
 from transformers.utils import is_flash_attn_2_available
         try:
             result = urlparse(_path)
             return result.scheme in ('http', 'https')
+        except:
             return False
     images = images or []
             images = [TEST_IMAGE]
             n_images = len(images)
         prompts = (
+            ['Describe the image in 100 words'] if n_images == 1 or map_mode else
+            ['Describe the images in 100 words']
         )
     n_prompts = len(prompts)
     allimages = []
     allprompts = []
     ordinals = [
+        'first', 'second', 'third', 'fourth', 'fifth',
+        'sixth', 'seventh', 'eighth', 'ninth', 'tenth',
     ]
     for images, prompt in examples:
         content = []
             content.append({'type': 'text', 'text': prompt})
         if len(images) > 1 and image_labels:
             for idx, img in enumerate(images):
+                ordinal = ordinals[idx] if idx < len(ordinals) else f'{idx+1}th'
                 image = images[idx]
                 descriptor = f'url: {image}'
                 if os.path.isfile(image):
                     descriptor = f'filename: {os.path.basename(image)}'
+                content.append({
+                    'type': 'text',
+                    'text': f'(this is the {ordinal} image, {descriptor})',
+                })
                 content.append({'type': 'image', 'image': img})
         else:
             content.extend([{'type': 'image', 'image': image} for image in images])
     tokens_per_image_list = []
     # Find all img_start and img_end positions in input_ids
+    start_positions = (input_ids == image_start_id).nonzero(
+        as_tuple=True
+    )[0].tolist()
     end_positions = (input_ids == image_end_id).nonzero(as_tuple=True)[0].tolist()
     if len(start_positions) > 0 and len(end_positions) > 0:
                 # Get the start and end indices for this image
                 start_idx_begin = idx * n_starts_per_image
                 end_idx_end = (idx + 1) * n_starts_per_image
+                if (
+                    start_idx_begin < len(start_positions) and
+                    end_idx_end <= len(end_positions)
                 ):
                     # First start position and last end position define the image span
                     first_start = start_positions[start_idx_begin]
     for idx in range(n_images):
         n_tokens = tokens_per_image_list[idx] if idx < len(tokens_per_image_list) else 0
+        pct = (n_tokens / max_sequence_length * 100)
         report.append(f'├── Image {idx + 1} → {n_tokens} tokens ({pct:.1f}%)')
+    text_pct = (text_token_count / max_sequence_length * 100)
     report.append(f'└── Text: {text_token_count} tokens ({text_pct:.1f}%)')
     return '\n'.join(report)
 def test_jvlm():
     parser = argparse.ArgumentParser(
+        description='jina-vlm-v1 vision-language model inference.'
     )
     parser.add_argument(
         '-m',
         '--model',
+        default='.',
         help=(
+            'Model path (default: `"."`). Set this to `"jinaai/jina-vlm-v1"` if you '
+            'are running this script outside this repo.'
+        )
     )
     parser.add_argument(
         '-i',
     args = parser.parse_args()
     print()
+    print('Welcome to the jinaai/jina-vlm-v1 playground ✨')
     print('Use this script to test our model!')
     print('- Jina AI')
     print()
     print(f'Using dtype: {dtype}')
     print('Model path: ', args.model)
     processor = AutoProcessor.from_pretrained(
+        args.model, trust_remote_code=True, use_fast=False,
     )
     model = AutoModelForCausalLM.from_pretrained(
         args.model,
     print('Done ✅')
     print()
+    print('--- Let\'s create some conversations ...')
     conversations, images, prompts = _build_conversations(
         args.image,
         args.prompt,
         map_mode=args.map,
         prompt_first=args.prompt_first,
+        image_labels=args.image_labels
     )
     n_conversations = len(conversations)
     print(f'Built {n_conversations} conversations 🚀')
             print(f'├── 🖼️Images: {images[idx]}')
             print(f'├── 📜Prompt: {prompts[idx]}')
             print(f'├── 💬Chat:{texts[idx]}')
+            print(f'└── 🧠Response:', end='')
             ith_inputs = {k: v[idx].unsqueeze(0) for k, v in device_inputs.items()}
             with (
                 timer,
                 torch.no_grad(),
+                torch.autocast(device.type, enabled=(device.type != 'mps'), dtype=dtype)
             ):
                 output = model.generate(
                     **ith_inputs,
                     streamer=streamer,
                     generation_config=GenerationConfig(
+                        max_new_tokens=args.max_tokens, do_sample=False,
                     ),
                     return_dict_in_generate=True,
                     use_model_defaults=True,
                 )
             generation_time += timer.time
+            out = output.sequences[0][len(input_prompts[idx].tolist()):]
             generated_tokens += len(out)
             print('Token usage report:')
             print(token_usage_reports[idx])
             output = model.generate(
                 **device_inputs,
                 generation_config=GenerationConfig(
+                    max_new_tokens=args.max_tokens, do_sample=False,
                 ),
                 return_dict_in_generate=True,
                 use_model_defaults=True,
         generation_time = timer.time
         for idx in range(n_conversations):
+            out = output.sequences[idx][len(input_prompts[idx].tolist()):]
             generated_tokens += len(out)
             response = processor.tokenizer.decode(out, skip_special_tokens=True)
             print(f'* Conversation {idx + 1}/{n_conversations}')