STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning

Welcome to the official repository for our paper: "STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning"

Abstract

Multimodal large language models (MLLMs) play a pivotal role in advancing the quest for general artificial intelligence. However, achieving unified target for multimodal understanding and generation remains challenging due to optimization conflicts and performance trade-offs. To effectively enhance generative performance while preserving existing comprehension capabilities, we introduce STAR: a STacked AutoRegressive scheme for task-progressive unified multimodal learning. This approach decomposes multimodal learning into multiple stages: understanding, generation, and editing. By freezing the parameters of the fundamental autoregressive (AR) model and progressively stacking isomorphic AR modules, it avoids cross-task interference while expanding the model's capabilities. Concurrently, we introduce a high-capacity VQ to enhance the granularity of image representations and employ an implicit reasoning mechanism to improve generation quality under complex conditions. Experiments demonstrate that STAR achieves state-of-the-art performance on GenEval (0.91), DPG-Bench (87.44), and ImgEdit (4.34), validating its efficacy for unified multimodal learning.

🌟 Model Checkpoint

Model Name	Checkpoint
STAR-3B	Link
STAR-7B	Link
VQ Model	Link

📚 Preparation

Prepare the environment

Set up environment

git clone <repository-url>
cd STAR
conda create -n star python==3.11 -y
conda activate star

Install the required packages:

# upgrade pip and setuptools if necessary
pip install -U pip setuptools
# install required packages
pip install -r requirements.txt

Download Pre-trained Models

Download the necessary pre-trained models before proceeding to inference.

STAR/checkpoints/STAR-7B.pt
STAR/checkpoints/VQ-Model.pt

Configuration

The model configuration file star/configs/STAR_Qwen2.5-VL-7B.json contains all necessary parameters for model initialization. Make sure to update the paths in the configuration file to match your local setup.

🔥 Quick Start

Demo

Run the interactive demo interface using Gradio.

python3 gradio_app.py

Inference

1. Image Understanding

For visual question answering and image understanding tasks:

python3 inference_understand.py \
    --image-path "path/to/your/image.jpg" \
    --question "What is in this image? Describe it in detail." \
    --max-new-tokens 256 \
    --model-config "star/configs/STAR_Qwen2.5-VL-7B.json" \
    --checkpoint "checkpoints/STAR-7B.pt" \
    --device "cuda:0"

Parameters:

--image-path: Path to the input image
--question: Question or instruction for the model
--max-new-tokens: Maximum number of tokens to generate (default: 256)
--model-config: Path to model configuration file
--checkpoint: Path to model checkpoint
--device: Device to run inference on

2. Text-to-Image Generation

For generating images from text prompts:

python3 inference_generation.py \
    --prompt "a photo of a cute cat" \
    --save-path "./outputs/a photo of a cute cat.jpg" \
    --num-images 1 \
    --cfg 1.1 \
    --topk 1000 \
    --topp 0.8 \
    --model-config "star/configs/STAR_Qwen2.5-VL-7B.json" \
    --checkpoint "checkpoints/STAR-7B.pt" \
    --diffusion-as-decoder \
    --device "cuda:0"

Parameters:

--prompt: Text prompt for image generation
--save-path: Path to save the generated image
--num-images: Number of images to generate (default: 1)
--cfg: Classifier-free guidance scale (default: 1.0)
--topk: Top-k sampling parameter (default: 1000)
--topp: Top-p sampling parameter (default: 0.8)
--diffusion-as-decoder: Use diffusion model as decoder for high-quality generation

3. Image Editing

For editing images based on text instructions:

python3 inference_edit.py \
    --image-path "./outputs/a photo of a cute cat.jpg" \
    --instruction "change the color of cat to blue" \
    --save-path "./outputs/edited_image.jpg" \
    --cfg 1.1 \
    --topk 1000 \
    --topp 0.8 \
    --model-config "star/configs/STAR_Qwen2.5-VL-7B.json" \
    --checkpoint "checkpoints/STAR-7B.pt" \
    --diffusion-as-decoder \
    --device "cuda:0"

Parameters:

--image-path: Path to the input image to be edited
--instruction: Text instruction describing the desired edit
--save-path: Path to save the edited image
--cfg: Classifier-free guidance scale for editing
--topk: Top-k sampling parameter
--topp: Top-p sampling parameter
--diffusion-as-decoder: Use diffusion model for high-quality image decoding

✍️ Citation

@article{2025star,
  title   = {STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning},
  author  = {Qin, Jie and Huang, Jiancheng and Qiao, Limeng and Ma, Lin},
  journal = {arXiv preprint arXiv:2512.13752},
  year    = {2025}
}

📜 License

STAR is licensed under the Apache 2.0.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

MM-MVR
/

STAR-VQ

STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning

Abstract

🌟 Model Checkpoint

📚 Preparation

Prepare the environment

Download Pre-trained Models

Configuration

🔥 Quick Start

Demo

Inference

1. Image Understanding

2. Text-to-Image Generation

3. Image Editing

✍️ Citation

📜 License

Space using MM-MVR/STAR-VQ 1

Collection including MM-MVR/STAR-VQ

STAR