YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

This is unquantized Llama-4-Maverick model which has been converted into a format that can be used for quantization. The main diff is that it unrolls experts into separate modules so they can be picked up independently by quantizers. It has been used to produce Meta and RedHatAI quantized models: https://huggingface.co/RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16, https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8

Evaluation

The model was evaluated on the OpenLLM leaderboard tasks (v1 and v2), long context RULER, multimodal MMMU, and multimodal ChartQA. All evaluations are obtained through lm-evaluation-harness.

Evaluation details

OpenLLM v1

vllm serve nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant -tp 8 -pp 2 --dtype auto --max-model-len 4096 --gpu-memory-utilization 0.8 --enable-chunked-prefill

lm_eval \
  --model local-completions \
  --model_args model="nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant",tokenizer="nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant",base_url="http://localhost:8000/v1/completions",max_retries=3,timeout=300,tokenized_requests=True,add_bos_token=True,max_length=4096,num_concurrent=1 \
  --tasks openllm \
  --write_out \
  --log_samples \
  --output_path <output_path> \
  --show_config

OpenLLM v2

vllm serve nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant -tp 8 -pp 2 --dtype auto --max-model-len 16384 --gpu-memory-utilization 0.6 --enable-chunked-prefill

lm_eval \
  --model local-completions \
  --model_args model="nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant",tokenizer="nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant",base_url="http://localhost:8000/v1/completions",max_retries=3,timeout=300,tokenized_requests=True,add_bos_token=True,max_length=16384,num_concurrent=1 \
  --tasks leaderboard \
  --write_out \
  --log_samples \
  --output_path <output_path> \
  --show_config

Long Context RULER

vllm serve nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant -tp 8 -pp 2 --dtype auto --max-model-len 524288 --gpu-memory-utilization 0.9 --enable-chunked-prefill

lm_eval \
  --model local-completions \
  --model_args model="nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant",tokenizer="nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant",base_url="http://localhost:8000/v1/completions",max_retries=3,timeout=300,tokenized_requests=True,add_bos_token=False,max_length=524288,num_concurrent=1 \
  --tasks ruler \
  --metadata='{"max_seq_lengths":[131072]}' \
  --write_out \
  --log_samples \
  --output_path <output_path> \
  --show_config

Multimodal MMMU For Multimodal evals with local-chat-completions, we need to use this lm-evaluation-harness PR: https://github.com/EleutherAI/lm-evaluation-harness/pull/2981 Due to vLLM issues, we also need to constrain max images to 1.

vllm serve nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant -tp 8 -pp 2 --dtype auto --max-model-len 524288 --gpu-memory-utilization 0.9 --limit-mm-per-prompt image=1 --enable-chunked-prefill

lm_eval \
  --model local-chat-completions \
  --model_args model="nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant",base_url="http://localhost:8000/v1/chat/completions",max_retries=3,timeout=300,add_bos_token=False,max_length=524288,num_concurrent=1,max_images=1 \
  --tasks mmmu_val \
  --apply_chat_template \
  --write_out \
  --log_samples \
  --output_path <output_path> \
  --show_config

Multimodal ChartQA For Multimodal evals with local-chat-completions, we need to use this lm-evaluation-harness PR: https://github.com/EleutherAI/lm-evaluation-harness/pull/2981 Due to vLLM issues, we also need to constrain max images to 1.

vllm serve nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant -tp 8 -pp 2 --dtype auto --max-model-len 524288 --gpu-memory-utilization 0.9 --limit-mm-per-prompt image=1 --enable-chunked-prefill

export VLLM_MM_INPUT_CACHE_GIB=8
lm_eval \
  --model local-chat-completions \
  --model_args model="nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant",base_url="http://localhost:8000/v1/chat/completions",max_retries=3,timeout=300,add_bos_token=False,max_length=524288,num_concurrent=1,max_images=1 \
  --tasks chartqa \
  --apply_chat_template \
  --write_out \
  --log_samples \
  --output_path <output_path> \
  --show_config
Downloads last month
3
Safetensors
Model size
402B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support