OpenMOSE/Qwen3-VL-REAP-145B-A22B-GGUF
Vision–Language MoE model created by applying Router-weighted Expert Activation Pruning (REAP) to Qwen3-VL-235B.
1. Model Summary
- Base model: Qwen3-VL-235B (vision–language MoE LLM)
- Variant name: Qwen3-VL-REAP-145B-A22B
- Architecture: Decoder-only Transformer + MoE MLP experts, with vision encoder + VL fusion as in Qwen3-VL
- Pruning method: REAP (Router-weighted Expert Activation Pruning) by Cerebras Research https://github.com/CerebrasResearch/reap
- Expert sparsity: ~40% of MoE experts pruned globally
- Active parameters: “A22B” indicates roughly ~22B active parameters per token (MoE sparse activation), while total parameters are reduced to about 145B
- Modality: Text + Vision (VL support kept intact)
- License: Apache 2.0
- Author / Maintainer: OpenMOSE
- Year: 2025
This is an unofficial community variant of Qwen3-VL, not affiliated with or endorsed by Alibaba or Cerebras Systems.
2. What Is REAP and What Did We Change?
REAP (Router-weighted Expert Activation Pruning) is a pruning method for MoE models that uses:
- Router statistics (routing probabilities)
- Expert activation patterns on a calibration set
to identify under-used or redundant experts and prune them while preserving model quality as much as possible.
For this model:
- We applied REAP to Qwen3-VL-235B across its MoE MLP blocks.
- ~40% of experts are pruned, based on router-weighted activation statistics.
- The routing mechanism itself is not conceptually changed; we only changed which experts remain.
- We extended the original REAP implementation to support the Qwen3-VL architecture, i.e. vision encoder + VL fusion layers, so pruning can be applied without breaking VL functionality.
In short: same REAP algorithm, adapted to Qwen3-VL, and leaving VL functionality available.
3. Calibration Data
The REAP pruning statistics were computed using:
Calibration dataset: https://huggingface.co/datasets/OpenMOSE/reap-calib-mix
This dataset is mostly synthetic, generated by Qwen3-235B-Instruct on mixed prompts designed to cover:
- General instruction-following
- Reasoning and long-form text
The calibration set is not used for additional fine-tuning here; it is used to measure router/expert activations to decide which experts to prune.
4. Why 145B-A22B? (Motivation & Hardware Footprint)
By pruning ~40% of experts and keeping VL:
The model shrinks from ~235B total parameters to about 145B total parameters.
With sparse MoE activation, around 22B parameters are active per token (“A22B”).
In practice, this makes it feasible to deploy on a single 96 GB GPU:
- No CPU offload is strictly required in Q4_L_M Quantization
- Lower VRAM configurations may still use offload / agressive quantization.
The goal is to keep the full VL capability of Qwen3-VL-235B, while making it:
- Easier to experiment with REAP-style MoE pruning
- More practical to deploy and fine-tune on high-end but single-node hardware
5. Intended Use
Primary intended uses
Research on:
- MoE pruning and compression (especially REAP)
- Scaling behavior of pruned MoE VL models
- Trade-offs between expert sparsity and performance
Experimental deployment for:
- Vision–language assistants
- Multimodal chatbots
- Document + image understanding
Suitable tasks (examples)
- Multimodal chat (image + text → text)
- Image captioning / description
- Visual question answering
- General instruction-following and long-form text generation
Out-of-scope / high-risk uses
This model should not be used without additional safeguards for:
- Medical, legal, or financial advice
- Safety-critical decision making
- Political persuasion or targeted disinformation
- Any scenario where incorrect or biased outputs can cause real-world harm
6. Limitations & Risks
This model inherits all the limitations of Qwen3-VL-235B plus those introduced by pruning:
Hallucinations: The model can generate plausible but incorrect facts.
Bias & toxicity: Biases from the original training data and synthetic calibration data remain and may be amplified.
Distribution shift from pruning:
- Some long-tail behaviors may degrade due to pruning 40% of experts.
- Performance may be uneven across tasks, domains, or languages not well covered in the calibration set.
Multimodal edge cases:
- Complex compositional visual reasoning or extremely high-resolution images may not work reliably.
- VL behavior is preserved but not fully re-tuned after pruning.
Users should perform their own evaluation before relying on the model in any sensitive context.
7. How to Use
Please use latest llama.cpp :)
8. Evaluation (Status)
This release focuses on making the REAP-pruned VL model available.
Quantitative benchmarks (e.g., MMBench, general QA, reasoning benchmarks) are still work in progress.
Early qualitative checks show:
- VL behavior is preserved after pruning.
- Latency and memory usage are improved compared to Qwen3-VL-235B, especially on single 96 GB GPUs.
Community contributions with detailed benchmarks are very welcome.
9. Training & Distillation Details (High-Level)
Base model: Qwen3-VL-235B
Pruning method: REAP (Router-weighted Expert Activation Pruning)
Calibration data:
OpenMOSE/reap-calib-mix(mostly generated by Qwen3-235B-Instruct)Post-processing:
- Router / gating structure retained
- Experts pruned according to REAP scoring
- No additional large-scale pretraining is performed in this release
Future versions may include post-pruning fine-tuning or distillation to recover more performance.
10. Community & Contribution
Let’s grow this model together as a community.
You are encouraged to:
Run benchmarks and publish results
Contribute scripts for:
- Further pruning experiments
- Quantization (e.g., GGUF, AWQ, GPTQ)
- Long-context or domain-specific fine-tuning
Report issues or findings about failure modes, biases, or surprising behaviors
11. License
- Model & code (this repository): Apache License 2.0
- The original Qwen3-VL-235B model and any downstream use must also respect their respective licenses and usage terms.
12. Acknowledgements
- Qwen team for building the Qwen3-VL family of models.
- Cerebras Research for the REAP method and reference implementation: https://github.com/CerebrasResearch/reap
- OpenMOSE community for experimentation, engineering, and calibration data generation.
2025 OpenMOSE
SImple MMLU Bench
| Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr|
|---------------------------------------|------:|------|-----:|------|---|-----:|---|-----:|
|mmlu | 2|none | |acc |↑ |0.8383|± |0.0030|
| - humanities | 2|none | |acc |↑ |0.7666|± |0.0059|
| - formal_logic | 1|none | 0|acc |↑ |0.7460|± |0.0389|
| - high_school_european_history | 1|none | 0|acc |↑ |0.8667|± |0.0265|
| - high_school_us_history | 1|none | 0|acc |↑ |0.9363|± |0.0171|
| - high_school_world_history | 1|none | 0|acc |↑ |0.9367|± |0.0158|
| - international_law | 1|none | 0|acc |↑ |0.9008|± |0.0273|
| - jurisprudence | 1|none | 0|acc |↑ |0.9167|± |0.0267|
| - logical_fallacies | 1|none | 0|acc |↑ |0.8528|± |0.0278|
| - moral_disputes | 1|none | 0|acc |↑ |0.8237|± |0.0205|
| - moral_scenarios | 1|none | 0|acc |↑ |0.7575|± |0.0143|
| - philosophy | 1|none | 0|acc |↑ |0.8424|± |0.0207|
| - prehistory | 1|none | 0|acc |↑ |0.9074|± |0.0161|
| - professional_law | 1|none | 0|acc |↑ |0.6128|± |0.0124|
| - world_religions | 1|none | 0|acc |↑ |0.8830|± |0.0246|
| - other | 2|none | |acc |↑ |0.8606|± |0.0059|
| - business_ethics | 1|none | 0|acc |↑ |0.8400|± |0.0368|
| - clinical_knowledge | 1|none | 0|acc |↑ |0.9019|± |0.0183|
| - college_medicine | 1|none | 0|acc |↑ |0.8728|± |0.0254|
| - global_facts | 1|none | 0|acc |↑ |0.5400|± |0.0501|
| - human_aging | 1|none | 0|acc |↑ |0.8296|± |0.0252|
| - management | 1|none | 0|acc |↑ |0.9126|± |0.0280|
| - marketing | 1|none | 0|acc |↑ |0.9573|± |0.0133|
| - medical_genetics | 1|none | 0|acc |↑ |0.9200|± |0.0273|
| - miscellaneous | 1|none | 0|acc |↑ |0.9004|± |0.0107|
| - nutrition | 1|none | 0|acc |↑ |0.9183|± |0.0157|
| - professional_accounting | 1|none | 0|acc |↑ |0.7766|± |0.0248|
| - professional_medicine | 1|none | 0|acc |↑ |0.9228|± |0.0162|
| - virology | 1|none | 0|acc |↑ |0.5723|± |0.0385|
| - social sciences | 2|none | |acc |↑ |0.9097|± |0.0051|
| - econometrics | 1|none | 0|acc |↑ |0.7632|± |0.0400|
| - high_school_geography | 1|none | 0|acc |↑ |0.9394|± |0.0170|
| - high_school_government_and_politics| 1|none | 0|acc |↑ |1.0000|± |0.0000|
| - high_school_macroeconomics | 1|none | 0|acc |↑ |0.9282|± |0.0131|
| - high_school_microeconomics | 1|none | 0|acc |↑ |0.9706|± |0.0110|
| - high_school_psychology | 1|none | 0|acc |↑ |0.9706|± |0.0072|
| - human_sexuality | 1|none | 0|acc |↑ |0.9160|± |0.0243|
| - professional_psychology | 1|none | 0|acc |↑ |0.8644|± |0.0139|
| - public_relations | 1|none | 0|acc |↑ |0.7545|± |0.0412|
| - security_studies | 1|none | 0|acc |↑ |0.8408|± |0.0234|
| - sociology | 1|none | 0|acc |↑ |0.9055|± |0.0207|
| - us_foreign_policy | 1|none | 0|acc |↑ |0.9100|± |0.0288|
| - stem | 2|none | |acc |↑ |0.8538|± |0.0061|
| - abstract_algebra | 1|none | 0|acc |↑ |0.7300|± |0.0446|
| - anatomy | 1|none | 0|acc |↑ |0.8000|± |0.0346|
| - astronomy | 1|none | 0|acc |↑ |0.9342|± |0.0202|
| - college_biology | 1|none | 0|acc |↑ |0.9583|± |0.0167|
| - college_chemistry | 1|none | 0|acc |↑ |0.6100|± |0.0490|
| - college_computer_science | 1|none | 0|acc |↑ |0.8500|± |0.0359|
| - college_mathematics | 1|none | 0|acc |↑ |0.6700|± |0.0473|
| - college_physics | 1|none | 0|acc |↑ |0.7843|± |0.0409|
| - computer_security | 1|none | 0|acc |↑ |0.8800|± |0.0327|
| - conceptual_physics | 1|none | 0|acc |↑ |0.9362|± |0.0160|
| - electrical_engineering | 1|none | 0|acc |↑ |0.8621|± |0.0287|
| - elementary_mathematics | 1|none | 0|acc |↑ |0.9153|± |0.0143|
| - high_school_biology | 1|none | 0|acc |↑ |0.9613|± |0.0110|
| - high_school_chemistry | 1|none | 0|acc |↑ |0.8374|± |0.0260|
| - high_school_computer_science | 1|none | 0|acc |↑ |0.9400|± |0.0239|
| - high_school_mathematics | 1|none | 0|acc |↑ |0.7704|± |0.0256|
| - high_school_physics | 1|none | 0|acc |↑ |0.8411|± |0.0299|
| - high_school_statistics | 1|none | 0|acc |↑ |0.8333|± |0.0254|
| - machine_learning | 1|none | 0|acc |↑ |0.7321|± |0.0420|
| Groups |Version|Filter|n-shot|Metric| |Value | |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu | 2|none | |acc |↑ |0.8383|± |0.0030|
| - humanities | 2|none | |acc |↑ |0.7666|± |0.0059|
| - other | 2|none | |acc |↑ |0.8606|± |0.0059|
| - social sciences| 2|none | |acc |↑ |0.9097|± |0.0051|
| - stem | 2|none | |acc |↑ |0.8538|± |0.0061|
- Downloads last month
- 805
2-bit
3-bit
4-bit
Model tree for OpenMOSE/Qwen3-VL-REAP-145B-A22B-GGUF
Base model
Qwen/Qwen3-VL-235B-A22B-Instruct