This is unquantized Llama-4-Maverick model which has been converted into a format that can be used for quantization. The main diff is that it unrolls experts into separate modules so they can be picked up independently by quantizers. It has been used to produce Meta and RedHatAI quantized models: https://huggingface.co/RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16, https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
Evaluation
The model was evaluated on the OpenLLM leaderboard tasks (v1 and v2), long context RULER, multimodal MMMU, and multimodal ChartQA. All evaluations are obtained through lm-evaluation-harness.
Evaluation details
OpenLLM v1
vllm serve nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant -tp 8 -pp 2 --dtype auto --max-model-len 4096 --gpu-memory-utilization 0.8 --enable-chunked-prefill
lm_eval \
--model local-completions \
--model_args model="nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant",tokenizer="nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant",base_url="http://localhost:8000/v1/completions",max_retries=3,timeout=300,tokenized_requests=True,add_bos_token=True,max_length=4096,num_concurrent=1 \
--tasks openllm \
--write_out \
--log_samples \
--output_path <output_path> \
--show_config
OpenLLM v2
vllm serve nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant -tp 8 -pp 2 --dtype auto --max-model-len 16384 --gpu-memory-utilization 0.6 --enable-chunked-prefill
lm_eval \
--model local-completions \
--model_args model="nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant",tokenizer="nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant",base_url="http://localhost:8000/v1/completions",max_retries=3,timeout=300,tokenized_requests=True,add_bos_token=True,max_length=16384,num_concurrent=1 \
--tasks leaderboard \
--write_out \
--log_samples \
--output_path <output_path> \
--show_config
Long Context RULER
vllm serve nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant -tp 8 -pp 2 --dtype auto --max-model-len 524288 --gpu-memory-utilization 0.9 --enable-chunked-prefill
lm_eval \
--model local-completions \
--model_args model="nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant",tokenizer="nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant",base_url="http://localhost:8000/v1/completions",max_retries=3,timeout=300,tokenized_requests=True,add_bos_token=False,max_length=524288,num_concurrent=1 \
--tasks ruler \
--metadata='{"max_seq_lengths":[131072]}' \
--write_out \
--log_samples \
--output_path <output_path> \
--show_config
Multimodal MMMU For Multimodal evals with local-chat-completions, we need to use this lm-evaluation-harness PR: https://github.com/EleutherAI/lm-evaluation-harness/pull/2981 Due to vLLM issues, we also need to constrain max images to 1.
vllm serve nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant -tp 8 -pp 2 --dtype auto --max-model-len 524288 --gpu-memory-utilization 0.9 --limit-mm-per-prompt image=1 --enable-chunked-prefill
lm_eval \
--model local-chat-completions \
--model_args model="nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant",base_url="http://localhost:8000/v1/chat/completions",max_retries=3,timeout=300,add_bos_token=False,max_length=524288,num_concurrent=1,max_images=1 \
--tasks mmmu_val \
--apply_chat_template \
--write_out \
--log_samples \
--output_path <output_path> \
--show_config
Multimodal ChartQA For Multimodal evals with local-chat-completions, we need to use this lm-evaluation-harness PR: https://github.com/EleutherAI/lm-evaluation-harness/pull/2981 Due to vLLM issues, we also need to constrain max images to 1.
vllm serve nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant -tp 8 -pp 2 --dtype auto --max-model-len 524288 --gpu-memory-utilization 0.9 --limit-mm-per-prompt image=1 --enable-chunked-prefill
export VLLM_MM_INPUT_CACHE_GIB=8
lm_eval \
--model local-chat-completions \
--model_args model="nm-testing/Llama-4-Maverick-17B-128E-Instruct-for-quant",base_url="http://localhost:8000/v1/chat/completions",max_retries=3,timeout=300,add_bos_token=False,max_length=524288,num_concurrent=1,max_images=1 \
--tasks chartqa \
--apply_chat_template \
--write_out \
--log_samples \
--output_path <output_path> \
--show_config
- Downloads last month
- 3