OpenMOSE/Qwen3-VL-REAP-145B-A22B-GGUF

Vision–Language MoE model created by applying Router-weighted Expert Activation Pruning (REAP) to Qwen3-VL-235B.


1. Model Summary

  • Base model: Qwen3-VL-235B (vision–language MoE LLM)
  • Variant name: Qwen3-VL-REAP-145B-A22B
  • Architecture: Decoder-only Transformer + MoE MLP experts, with vision encoder + VL fusion as in Qwen3-VL
  • Pruning method: REAP (Router-weighted Expert Activation Pruning) by Cerebras Research https://github.com/CerebrasResearch/reap
  • Expert sparsity: ~40% of MoE experts pruned globally
  • Active parameters: “A22B” indicates roughly ~22B active parameters per token (MoE sparse activation), while total parameters are reduced to about 145B
  • Modality: Text + Vision (VL support kept intact)
  • License: Apache 2.0
  • Author / Maintainer: OpenMOSE
  • Year: 2025

This is an unofficial community variant of Qwen3-VL, not affiliated with or endorsed by Alibaba or Cerebras Systems.


2. What Is REAP and What Did We Change?

REAP (Router-weighted Expert Activation Pruning) is a pruning method for MoE models that uses:

  • Router statistics (routing probabilities)
  • Expert activation patterns on a calibration set

to identify under-used or redundant experts and prune them while preserving model quality as much as possible.

For this model:

  • We applied REAP to Qwen3-VL-235B across its MoE MLP blocks.
  • ~40% of experts are pruned, based on router-weighted activation statistics.
  • The routing mechanism itself is not conceptually changed; we only changed which experts remain.
  • We extended the original REAP implementation to support the Qwen3-VL architecture, i.e. vision encoder + VL fusion layers, so pruning can be applied without breaking VL functionality.

In short: same REAP algorithm, adapted to Qwen3-VL, and leaving VL functionality available.


3. Calibration Data

The REAP pruning statistics were computed using:

The calibration set is not used for additional fine-tuning here; it is used to measure router/expert activations to decide which experts to prune.


4. Why 145B-A22B? (Motivation & Hardware Footprint)

By pruning ~40% of experts and keeping VL:

  • The model shrinks from ~235B total parameters to about 145B total parameters.

  • With sparse MoE activation, around 22B parameters are active per token (“A22B”).

  • In practice, this makes it feasible to deploy on a single 96 GB GPU:

    • No CPU offload is strictly required in Q4_L_M Quantization
    • Lower VRAM configurations may still use offload / agressive quantization.

The goal is to keep the full VL capability of Qwen3-VL-235B, while making it:

  • Easier to experiment with REAP-style MoE pruning
  • More practical to deploy and fine-tune on high-end but single-node hardware

5. Intended Use

Primary intended uses

  • Research on:

    • MoE pruning and compression (especially REAP)
    • Scaling behavior of pruned MoE VL models
    • Trade-offs between expert sparsity and performance
  • Experimental deployment for:

    • Vision–language assistants
    • Multimodal chatbots
    • Document + image understanding

Suitable tasks (examples)

  • Multimodal chat (image + text → text)
  • Image captioning / description
  • Visual question answering
  • General instruction-following and long-form text generation

Out-of-scope / high-risk uses

This model should not be used without additional safeguards for:

  • Medical, legal, or financial advice
  • Safety-critical decision making
  • Political persuasion or targeted disinformation
  • Any scenario where incorrect or biased outputs can cause real-world harm

6. Limitations & Risks

This model inherits all the limitations of Qwen3-VL-235B plus those introduced by pruning:

  • Hallucinations: The model can generate plausible but incorrect facts.

  • Bias & toxicity: Biases from the original training data and synthetic calibration data remain and may be amplified.

  • Distribution shift from pruning:

    • Some long-tail behaviors may degrade due to pruning 40% of experts.
    • Performance may be uneven across tasks, domains, or languages not well covered in the calibration set.
  • Multimodal edge cases:

    • Complex compositional visual reasoning or extremely high-resolution images may not work reliably.
    • VL behavior is preserved but not fully re-tuned after pruning.

Users should perform their own evaluation before relying on the model in any sensitive context.


7. How to Use

Please use latest llama.cpp :)


8. Evaluation (Status)

  • This release focuses on making the REAP-pruned VL model available.

  • Quantitative benchmarks (e.g., MMBench, general QA, reasoning benchmarks) are still work in progress.

  • Early qualitative checks show:

    • VL behavior is preserved after pruning.
    • Latency and memory usage are improved compared to Qwen3-VL-235B, especially on single 96 GB GPUs.

Community contributions with detailed benchmarks are very welcome.


9. Training & Distillation Details (High-Level)

  • Base model: Qwen3-VL-235B

  • Pruning method: REAP (Router-weighted Expert Activation Pruning)

  • Calibration data: OpenMOSE/reap-calib-mix (mostly generated by Qwen3-235B-Instruct)

  • Post-processing:

    • Router / gating structure retained
    • Experts pruned according to REAP scoring
    • No additional large-scale pretraining is performed in this release

Future versions may include post-pruning fine-tuning or distillation to recover more performance.


10. Community & Contribution

Let’s grow this model together as a community.

You are encouraged to:

  • Run benchmarks and publish results

  • Contribute scripts for:

    • Further pruning experiments
    • Quantization (e.g., GGUF, AWQ, GPTQ)
    • Long-context or domain-specific fine-tuning
  • Report issues or findings about failure modes, biases, or surprising behaviors


11. License

  • Model & code (this repository): Apache License 2.0
  • The original Qwen3-VL-235B model and any downstream use must also respect their respective licenses and usage terms.

12. Acknowledgements

  • Qwen team for building the Qwen3-VL family of models.
  • Cerebras Research for the REAP method and reference implementation: https://github.com/CerebrasResearch/reap
  • OpenMOSE community for experimentation, engineering, and calibration data generation.

2025 OpenMOSE

SImple MMLU Bench

|                 Tasks                 |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|---------------------------------------|------:|------|-----:|------|---|-----:|---|-----:|
|mmlu                                   |      2|none  |      |acc   |↑  |0.8383|±  |0.0030|
| - humanities                          |      2|none  |      |acc   |↑  |0.7666|±  |0.0059|
|  - formal_logic                       |      1|none  |     0|acc   |↑  |0.7460|±  |0.0389|
|  - high_school_european_history       |      1|none  |     0|acc   |↑  |0.8667|±  |0.0265|
|  - high_school_us_history             |      1|none  |     0|acc   |↑  |0.9363|±  |0.0171|
|  - high_school_world_history          |      1|none  |     0|acc   |↑  |0.9367|±  |0.0158|
|  - international_law                  |      1|none  |     0|acc   |↑  |0.9008|±  |0.0273|
|  - jurisprudence                      |      1|none  |     0|acc   |↑  |0.9167|±  |0.0267|
|  - logical_fallacies                  |      1|none  |     0|acc   |↑  |0.8528|±  |0.0278|
|  - moral_disputes                     |      1|none  |     0|acc   |↑  |0.8237|±  |0.0205|
|  - moral_scenarios                    |      1|none  |     0|acc   |↑  |0.7575|±  |0.0143|
|  - philosophy                         |      1|none  |     0|acc   |↑  |0.8424|±  |0.0207|
|  - prehistory                         |      1|none  |     0|acc   |↑  |0.9074|±  |0.0161|
|  - professional_law                   |      1|none  |     0|acc   |↑  |0.6128|±  |0.0124|
|  - world_religions                    |      1|none  |     0|acc   |↑  |0.8830|±  |0.0246|
| - other                               |      2|none  |      |acc   |↑  |0.8606|±  |0.0059|
|  - business_ethics                    |      1|none  |     0|acc   |↑  |0.8400|±  |0.0368|
|  - clinical_knowledge                 |      1|none  |     0|acc   |↑  |0.9019|±  |0.0183|
|  - college_medicine                   |      1|none  |     0|acc   |↑  |0.8728|±  |0.0254|
|  - global_facts                       |      1|none  |     0|acc   |↑  |0.5400|±  |0.0501|
|  - human_aging                        |      1|none  |     0|acc   |↑  |0.8296|±  |0.0252|
|  - management                         |      1|none  |     0|acc   |↑  |0.9126|±  |0.0280|
|  - marketing                          |      1|none  |     0|acc   |↑  |0.9573|±  |0.0133|
|  - medical_genetics                   |      1|none  |     0|acc   |↑  |0.9200|±  |0.0273|
|  - miscellaneous                      |      1|none  |     0|acc   |↑  |0.9004|±  |0.0107|
|  - nutrition                          |      1|none  |     0|acc   |↑  |0.9183|±  |0.0157|
|  - professional_accounting            |      1|none  |     0|acc   |↑  |0.7766|±  |0.0248|
|  - professional_medicine              |      1|none  |     0|acc   |↑  |0.9228|±  |0.0162|
|  - virology                           |      1|none  |     0|acc   |↑  |0.5723|±  |0.0385|
| - social sciences                     |      2|none  |      |acc   |↑  |0.9097|±  |0.0051|
|  - econometrics                       |      1|none  |     0|acc   |↑  |0.7632|±  |0.0400|
|  - high_school_geography              |      1|none  |     0|acc   |↑  |0.9394|±  |0.0170|
|  - high_school_government_and_politics|      1|none  |     0|acc   |↑  |1.0000|±  |0.0000|
|  - high_school_macroeconomics         |      1|none  |     0|acc   |↑  |0.9282|±  |0.0131|
|  - high_school_microeconomics         |      1|none  |     0|acc   |↑  |0.9706|±  |0.0110|
|  - high_school_psychology             |      1|none  |     0|acc   |↑  |0.9706|±  |0.0072|
|  - human_sexuality                    |      1|none  |     0|acc   |↑  |0.9160|±  |0.0243|
|  - professional_psychology            |      1|none  |     0|acc   |↑  |0.8644|±  |0.0139|
|  - public_relations                   |      1|none  |     0|acc   |↑  |0.7545|±  |0.0412|
|  - security_studies                   |      1|none  |     0|acc   |↑  |0.8408|±  |0.0234|
|  - sociology                          |      1|none  |     0|acc   |↑  |0.9055|±  |0.0207|
|  - us_foreign_policy                  |      1|none  |     0|acc   |↑  |0.9100|±  |0.0288|
| - stem                                |      2|none  |      |acc   |↑  |0.8538|±  |0.0061|
|  - abstract_algebra                   |      1|none  |     0|acc   |↑  |0.7300|±  |0.0446|
|  - anatomy                            |      1|none  |     0|acc   |↑  |0.8000|±  |0.0346|
|  - astronomy                          |      1|none  |     0|acc   |↑  |0.9342|±  |0.0202|
|  - college_biology                    |      1|none  |     0|acc   |↑  |0.9583|±  |0.0167|
|  - college_chemistry                  |      1|none  |     0|acc   |↑  |0.6100|±  |0.0490|
|  - college_computer_science           |      1|none  |     0|acc   |↑  |0.8500|±  |0.0359|
|  - college_mathematics                |      1|none  |     0|acc   |↑  |0.6700|±  |0.0473|
|  - college_physics                    |      1|none  |     0|acc   |↑  |0.7843|±  |0.0409|
|  - computer_security                  |      1|none  |     0|acc   |↑  |0.8800|±  |0.0327|
|  - conceptual_physics                 |      1|none  |     0|acc   |↑  |0.9362|±  |0.0160|
|  - electrical_engineering             |      1|none  |     0|acc   |↑  |0.8621|±  |0.0287|
|  - elementary_mathematics             |      1|none  |     0|acc   |↑  |0.9153|±  |0.0143|
|  - high_school_biology                |      1|none  |     0|acc   |↑  |0.9613|±  |0.0110|
|  - high_school_chemistry              |      1|none  |     0|acc   |↑  |0.8374|±  |0.0260|
|  - high_school_computer_science       |      1|none  |     0|acc   |↑  |0.9400|±  |0.0239|
|  - high_school_mathematics            |      1|none  |     0|acc   |↑  |0.7704|±  |0.0256|
|  - high_school_physics                |      1|none  |     0|acc   |↑  |0.8411|±  |0.0299|
|  - high_school_statistics             |      1|none  |     0|acc   |↑  |0.8333|±  |0.0254|
|  - machine_learning                   |      1|none  |     0|acc   |↑  |0.7321|±  |0.0420|

|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu              |      2|none  |      |acc   |↑  |0.8383|±  |0.0030|
| - humanities     |      2|none  |      |acc   |↑  |0.7666|±  |0.0059|
| - other          |      2|none  |      |acc   |↑  |0.8606|±  |0.0059|
| - social sciences|      2|none  |      |acc   |↑  |0.9097|±  |0.0051|
| - stem           |      2|none  |      |acc   |↑  |0.8538|±  |0.0061|
Downloads last month
805
GGUF
Model size
145B params
Architecture
qwen3vlmoe
Hardware compatibility
Log In to view the estimation

2-bit

3-bit

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OpenMOSE/Qwen3-VL-REAP-145B-A22B-GGUF

Quantized
(3)
this model