Update README.md

#2
by Xenova HF Staff - opened
Files changed (1) hide show
  1. README.md +232 -3
README.md CHANGED
@@ -1,3 +1,232 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - mistralai/Ministral-3-3B-Instruct-2512
5
+ language:
6
+ - en
7
+ - fr
8
+ - es
9
+ - de
10
+ - it
11
+ - pt
12
+ - nl
13
+ - zh
14
+ - ja
15
+ - ko
16
+ - ar
17
+ ---
18
+
19
+ # Ministral 3 3B Instruct 2512
20
+ The smallest model in the Ministral 3 family, **Ministral 3 3B** is a powerful, efficient tiny language model with vision capabilities.
21
+
22
+ This model is the instruct post-trained version, fine-tuned for instruction tasks, making it ideal for chat and instruction based use cases.
23
+
24
+ The Ministral 3 family is designed for edge deployment, capable of running on a wide range of hardware. Ministral 3 3B can even be deployed locally, capable of fitting in 16GB of VRAM in BF16, and less than 8GB of RAM/VRAM when quantized.
25
+
26
+ We provide a no-loss FP8 version [here](https://huggingface.co/mistralai/Ministral-3-3B-Instruct-2512-FP8), you can find other formats and quantizations in the [Ministral 3 - Quants](https://huggingface.co/collections/mistralai/ministral-3-quants) collection.
27
+
28
+ ## Key Features
29
+ Ministral 3 3B consists of two main architectural components:
30
+ - **3.4B Language Model**
31
+ - **0.4B Vision Encoder**
32
+
33
+ The Ministral 3 3B Instruct model offers the following capabilities:
34
+ - **Vision**: Enables the model to analyze images and provide insights based on visual content, in addition to text.
35
+ - **Multilingual**: Supports dozens of languages, including English, French, Spanish, German, Italian, Portuguese, Dutch, Chinese, Japanese, Korean, Arabic.
36
+ - **System Prompt**: Maintains strong adherence and support for system prompts.
37
+ - **Agentic**: Offers best-in-class agentic capabilities with native function calling and JSON outputting.
38
+ - **Edge-Optimized**: Delivers best-in-class performance at a small scale, deployable anywhere.
39
+ - **Apache 2.0 License**: Open-source license allowing usage and modification for both commercial and non-commercial purposes.
40
+ - **Large Context Window**: Supports a 256k context window.
41
+
42
+ ### Use Cases
43
+ Ideal for lightweight, real-time applications on edge or low-resource devices, such as:
44
+ - Image captioning
45
+ - Text classification
46
+ - Real-time efficient translation
47
+ - Data extraction
48
+ - Short content generation
49
+ - Fine-tuning and specialization
50
+ - And more...
51
+
52
+ Bringing advanced AI capabilities to edge and distributed environments for embedded systems.
53
+
54
+ ## Ministral 3 Family
55
+
56
+ | Model Name | Type | Precision | Link |
57
+ |--------------------------------|--------------------|-----------|------------------------------------------------------------------------------------------|
58
+ | Ministral 3 3B Base 2512 | Base pre-trained | BF16 | [Hugging Face](https://huggingface.co/mistralai/Ministral-3-3B-Base-2512) |
59
+ | **Ministral 3 3B Instruct 2512** | **Instruct post-trained** | **BF16** | [Hugging Face](https://huggingface.co/mistralai/Ministral-3-3B-Instruct-2512) |
60
+ | Ministral 3 3B Reasoning 2512 | Reasoning capable | BF16 | [Hugging Face](https://huggingface.co/mistralai/Ministral-3-3B-Reasoning-2512) |
61
+ | Ministral 3 8B Base 2512 | Base pre-trained | BF16 | [Hugging Face](https://huggingface.co/mistralai/Ministral-3-8B-Base-2512) |
62
+ | Ministral 3 8B Instruct 2512 | Instruct post-trained | BF16 | [Hugging Face](https://huggingface.co/mistralai/Ministral-3-8B-Instruct-2512) |
63
+ | Ministral 3 8B Reasoning 2512 | Reasoning capable | BF16 | [Hugging Face](https://huggingface.co/mistralai/Ministral-3-8B-Reasoning-2512) |
64
+ | Ministral 3 14B Base 2512 | Base pre-trained | BF16 | [Hugging Face](https://huggingface.co/mistralai/Ministral-3-14B-Base-2512) |
65
+ | Ministral 3 14B Instruct 2512 | Instruct post-trained | BF16 | [Hugging Face](https://huggingface.co/mistralai/Ministral-3-14B-Instruct-2512) |
66
+ | Ministral 3 14B Reasoning 2512 | Reasoning capable | BF16 | [Hugging Face](https://huggingface.co/mistralai/Ministral-3-14B-Reasoning-2512) |
67
+
68
+ Other formats available [here](https://huggingface.co/collections/mistralai/ministral-3-quants).
69
+
70
+ ## Benchmark Results
71
+
72
+ We compare Ministral 3 to similar sized models.
73
+
74
+ ### Reasoning
75
+
76
+ | Model | AIME25 | AIME24 | GPQA Diamond | LiveCodeBench |
77
+ |---------------------------|-------------|-------------|--------------|---------------|
78
+ | **Ministral 3 14B** | <u>0.850</u>| <u>0.898</u>| <u>0.712</u> | <u>0.646</u> |
79
+ | Qwen3-14B (Thinking) | 0.737 | 0.837 | 0.663 | 0.593 |
80
+ | | | | | |
81
+ | **Ministral 3 8B** | 0.787 | <u>0.860</u>| 0.668 | <u>0.616</u> |
82
+ | Qwen3-VL-8B-Thinking | <u>0.798</u>| <u>0.860</u>| <u>0.671</u> | 0.580 |
83
+ | | | | | |
84
+ | **Ministral 3 3B** | <u>0.721</u>| <u>0.775</u>| 0.534 | <u>0.548</u> |
85
+ | Qwen3-VL-4B-Thinking | 0.697 | 0.729 | <u>0.601</u> | 0.513 |
86
+
87
+ ### Instruct
88
+
89
+ | Model | Arena Hard | WildBench | MATH Maj@1 | MM MTBench |
90
+ |---------------------------|-------------|------------|-------------|------------------|
91
+ | **Ministral 3 14B** | <u>0.551</u>| <u>68.5</u>| <u>0.904</u>| <u>8.49</u> |
92
+ | Qwen3 14B (Non-Thinking) | 0.427 | 65.1 | 0.870 | NOT MULTIMODAL |
93
+ | Gemma3-12B-Instruct | 0.436 | 63.2 | 0.854 | 6.70 |
94
+ | | | | | |
95
+ | **Ministral 3 8B** | 0.509 | <u>66.8</u>| 0.876 | <u>8.08</u> |
96
+ | Qwen3-VL-8B-Instruct | <u>0.528</u>| 66.3 | <u>0.946</u>| 8.00 |
97
+ | | | | | |
98
+ | **Ministral 3 3B** | 0.305 | <u>56.8</u>| 0.830 | 7.83 |
99
+ | Qwen3-VL-4B-Instruct | <u>0.438</u>| <u>56.8</u>| <u>0.900</u>| <u>8.01</u> |
100
+ | Qwen3-VL-2B-Instruct | 0.163 | 42.2 | 0.786 | 6.36 |
101
+ | Gemma3-4B-Instruct | 0.318 | 49.1 | 0.759 | 5.23 |
102
+
103
+ ### Base
104
+
105
+ | Model | Multilingual MMLU | MATH CoT 2-Shot | AGIEval 5-shot | MMLU Redux 5-shot | MMLU 5-shot | TriviaQA 5-shot |
106
+ |---------------------|-------------------|-----------------|----------------|-------------------|-------------|-----------------|
107
+ | **Ministral 3 14B** | 0.742 | <u>0.676</u> | 0.648 | 0.820 | 0.794 | 0.749 |
108
+ | Qwen3 14B Base | <u>0.754</u> | 0.620 | <u>0.661</u> | <u>0.837</u> | <u>0.804</u>| 0.703 |
109
+ | Gemma 3 12B Base | 0.690 | 0.487 | 0.587 | 0.766 | 0.745 | <u>0.788</u> |
110
+ | | | | | | | |
111
+ | **Ministral 3 8B** | <u>0.706</u> | <u>0.626</u> | 0.591 | 0.793 | <u>0.761</u>| <u>0.681</u> |
112
+ | Qwen 3 8B Base | 0.700 | 0.576 | <u>0.596</u> | <u>0.794</u> | 0.760 | 0.639 |
113
+ | | | | | | | |
114
+ | **Ministral 3 3B** | 0.652 | <u>0.601</u> | 0.511 | 0.735 | 0.707 | 0.592 |
115
+ | Qwen 3 4B Base | <u>0.677</u> | 0.405 | <u>0.570</u> | <u>0.759</u> | <u>0.713</u>| 0.530 |
116
+ | Gemma 3 4B Base | 0.516 | 0.294 | 0.430 | 0.626 | 0.589 | <u>0.640</u> |
117
+
118
+ ## Usage
119
+
120
+ ### ONNXRuntime
121
+
122
+ ```py
123
+ from transformers import AutoConfig, AutoProcessor
124
+ import onnxruntime
125
+ import numpy as np
126
+ from huggingface_hub import hf_hub_download
127
+
128
+ # 1. Load config, processor, and model
129
+ model_id = "mistralai/Ministral-3-3B-Instruct-2512-ONNX"
130
+ config = AutoConfig.from_pretrained(model_id)
131
+ processor = AutoProcessor.from_pretrained(model_id)
132
+
133
+ vision_model_path = hf_hub_download(model_id, "vision_encoder_q4.onnx", subfolder="onnx") # Download vision graph
134
+ hf_hub_download(model_id, "vision_encoder_q4.onnx_data", subfolder="onnx") # Download vision weights
135
+ embed_model_path = hf_hub_download(model_id, "embed_tokens_fp16.onnx", subfolder="onnx") # Download embed_tokens graph
136
+ hf_hub_download(model_id, "embed_tokens_fp16.onnx_data", subfolder="onnx") # Download embed_tokens weights
137
+ decoder_model_path = hf_hub_download(model_id, "decoder_model_merged_q4.onnx", subfolder="onnx") # Download decoder graph
138
+ hf_hub_download(model_id, "decoder_model_merged_q4.onnx_data", subfolder="onnx") # Download decoder weights (1/2)
139
+ hf_hub_download(model_id, "decoder_model_merged_q4.onnx_data_1", subfolder="onnx") # Download decoder weights (2/2)
140
+
141
+ ## Load sessions
142
+ providers = ['CPUExecutionProvider']
143
+ vision_session = onnxruntime.InferenceSession(vision_model_path, providers=providers)
144
+ embed_session = onnxruntime.InferenceSession(embed_model_path, providers=providers)
145
+ decoder_session = onnxruntime.InferenceSession(decoder_model_path, providers=providers)
146
+
147
+ ## Set config values
148
+ text_config = config.text_config
149
+ num_key_value_heads = text_config.num_key_value_heads
150
+ head_dim = text_config.head_dim
151
+ num_hidden_layers = text_config.num_hidden_layers
152
+ eos_token_id = text_config.eos_token_id
153
+ image_token_index = config.image_token_index
154
+
155
+ # 2. Prepare inputs
156
+ image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"
157
+ messages = [
158
+ {
159
+ "role": "user",
160
+ "content": [
161
+ {
162
+ "type": "text",
163
+ "text": "What action do you think I should take in this situation? List all the possible actions and explain why you think they are good or bad.",
164
+ },
165
+ {"type": "image", "url": image_url},
166
+ ],
167
+ },
168
+ ]
169
+ inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt")
170
+
171
+ input_ids = inputs['input_ids'].numpy()
172
+ attention_mask = inputs['attention_mask'].numpy()
173
+ pixel_values = inputs['pixel_values'].numpy()
174
+ batch_size = input_ids.shape[0]
175
+ past_key_values = {
176
+ f'past_key_values.{layer}.{kv}': np.zeros([batch_size, num_key_value_heads, 0, head_dim], dtype=np.float32)
177
+ for layer in range(num_hidden_layers)
178
+ for kv in ('key', 'value')
179
+ }
180
+ position_ids = np.tile(np.arange(0, input_ids.shape[-1]), (batch_size, 1))
181
+
182
+ # 3. Generation loop
183
+ max_new_tokens = 1024
184
+ generated_tokens = np.array([[]], dtype=np.int64)
185
+ image_features = None
186
+ for i in range(max_new_tokens):
187
+ inputs_embeds = embed_session.run(None, {'input_ids': input_ids})[0]
188
+
189
+ if image_features is None:
190
+ ## Only compute vision features if not already computed
191
+ image_features = vision_session.run(None, dict(
192
+ pixel_values=pixel_values,
193
+ ))[0]
194
+
195
+ ## Merge text and vision embeddings
196
+ inputs_embeds[input_ids == image_token_index] = image_features.reshape(-1, image_features.shape[-1])
197
+
198
+ logits, *present_key_values = decoder_session.run(None, dict(
199
+ inputs_embeds=inputs_embeds,
200
+ attention_mask=attention_mask,
201
+ position_ids=position_ids,
202
+ **past_key_values,
203
+ ))
204
+
205
+ ## Update values for next generation loop
206
+ input_ids = logits[:, -1].argmax(-1, keepdims=True)
207
+ attention_mask = np.concatenate([attention_mask, np.ones((batch_size, 1), dtype=attention_mask.dtype)], axis=-1)
208
+ position_ids = position_ids[:, -1:] + 1
209
+ for j, key in enumerate(past_key_values):
210
+ past_key_values[key] = present_key_values[j]
211
+
212
+ generated_tokens = np.concatenate([generated_tokens, input_ids], axis=-1)
213
+ if (input_ids == eos_token_id).all():
214
+ break
215
+
216
+ ## (Optional) Streaming
217
+ print(processor.decode(input_ids[0]), end='', flush=True)
218
+ print()
219
+
220
+ # 4. Output result
221
+ print(processor.batch_decode(generated_tokens, skip_special_tokens=True)[0])
222
+ ```
223
+
224
+ ### Transformers.js
225
+
226
+ TODO
227
+
228
+ ## License
229
+
230
+ This model is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0.txt).
231
+
232
+ *You must not use this model in a manner that infringes, misappropriates, or otherwise violates any third party’s rights, including intellectual property rights.*