Evaluation

The model was evaluated on the OpenLLM leaderboard tasks (v1 and v2), long context RULER, multimodal MMMU, and multimodal ChartQA. All evaluations are obtained through lm-evaluation-harness.

Evaluation details

OpenLLM v1

vllm serve nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant -tp 8 -pp 2 --dtype auto --max-model-len 4096 --gpu-memory-utilization 0.8 --enable-chunked-prefill

lm_eval \
  --model local-completions \
  --model_args model="nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant",tokenizer="nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant",base_url="http://localhost:8000/v1/completions",max_retries=3,timeout=300,tokenized_requests=True,add_bos_token=True,max_length=4096,num_concurrent=1 \
  --tasks openllm \
  --write_out \
  --log_samples \
  --output_path <output_path> \
  --show_config

OpenLLM v2

vllm serve nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant -tp 8 -pp 2 --dtype auto --max-model-len 16384 --gpu-memory-utilization 0.6 --enable-chunked-prefill

lm_eval \
  --model local-completions \
  --model_args model="nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant",tokenizer="nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant",base_url="http://localhost:8000/v1/completions",max_retries=3,timeout=300,tokenized_requests=True,add_bos_token=True,max_length=16384,num_concurrent=1 \
  --tasks leaderboard \
  --write_out \
  --log_samples \
  --output_path <output_path> \
  --show_config

Long Context RULER

vllm serve nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant -tp 8 -pp 2 --dtype auto --max-model-len 524288 --gpu-memory-utilization 0.9 --enable-chunked-prefill

lm_eval \
  --model local-completions \
  --model_args model="nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant",tokenizer="nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant",base_url="http://localhost:8000/v1/completions",max_retries=3,timeout=300,tokenized_requests=True,add_bos_token=False,max_length=524288,num_concurrent=1 \
  --tasks ruler \
  --metadata='{"max_seq_lengths":[131072]}' \
  --write_out \
  --log_samples \
  --output_path <output_path> \
  --show_config

Multimodal MMMU For Multimodal evals with local-chat-completions, we need to use this lm-evaluation-harness PR: https://github.com/EleutherAI/lm-evaluation-harness/pull/2981 Due to vLLM issues, we also need to constrain max images to 1.

vllm serve nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant -tp 8 -pp 2 --dtype auto --max-model-len 524288 --gpu-memory-utilization 0.9 --limit-mm-per-prompt image=1 --enable-chunked-prefill

lm_eval \
  --model local-chat-completions \
  --model_args model="nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant",base_url="http://localhost:8000/v1/chat/completions",max_retries=3,timeout=300,add_bos_token=False,max_length=524288,num_concurrent=1,max_images=1 \
  --tasks mmmu_val \
  --apply_chat_template \
  --write_out \
  --log_samples \
  --output_path <output_path> \
  --show_config

Multimodal ChartQA For Multimodal evals with local-chat-completions, we need to use this lm-evaluation-harness PR: https://github.com/EleutherAI/lm-evaluation-harness/pull/2981 Due to vLLM issues, we also need to constrain max images to 1.

vllm serve nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant -tp 8 -pp 2 --dtype auto --max-model-len 524288 --gpu-memory-utilization 0.9 --limit-mm-per-prompt image=1 --enable-chunked-prefill

export VLLM_MM_INPUT_CACHE_GIB=8
lm_eval \
  --model local-chat-completions \
  --model_args model="nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant",base_url="http://localhost:8000/v1/chat/completions",max_retries=3,timeout=300,add_bos_token=False,max_length=524288,num_concurrent=1,max_images=1 \
  --tasks chartqa \
  --apply_chat_template \
  --write_out \
  --log_samples \
  --output_path <output_path> \
  --show_config

Downloads last month: 3

Safetensors

Model size

402B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support