Gemma-3-4B Fine-tuned on Schema-Guided Dialog (GRPO)

This model is a fine-tuned version of unsloth/gemma-3-4b-it using GRPO (Group Relative Policy Optimization) on the Schema-Guided Dialog dataset.

Model Details

Base Model: Gemma-3-4B-IT
Training Method: GRPO with LoRA (r=32)
Dataset: Schema-Guided Dialog (GEM)
Task: Task-oriented dialog generation with structured dialog acts

Training Configuration

LoRA Rank: 32
Training Epochs: 1
Batch Size: 4 (effective)
Learning Rate: 5e-6
Optimizer: AdamW 8-bit
Quantization: 4-bit

Reward Functions

The model was trained using three reward functions:

Formatting Reward: Proper use of response tags
Quality Reward: Similarity to target responses and keyword overlap
Coherence Reward: Fluency, capitalization, and punctuation

Usage

from unsloth import FastLanguageModel
import torch

# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="Arittro2/gemma3-4b-sgd-grpo",
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)

# Prepare for inference
FastLanguageModel.for_inference(model)

# Example prompt
prompt = """You are a helpful virtual assistant. Generate an appropriate response.

<CONTEXT>
User: I need to book a restaurant for dinner tonight.
System: I can help you with that. What type of cuisine are you interested in?

Dialog acts to realize:
  - Act: REQUEST, Slot: location
</CONTEXT>

Generate a natural, helpful response between <RESPONSE> and </RESPONSE> tags."""

messages = [{"role": "user", "content": prompt}]
input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")

# Generate
output = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
    repetition_penalty=1.1
)

print(tokenizer.decode(output[0], skip_special_tokens=True))

Training Details

Trained using the Unsloth framework for efficient fine-tuning with:

Gradient checkpointing
4-bit quantization
LoRA adapters on attention and MLP layers

Citation

@misc{gemma3-sgd-grpo,
  author = {Arittro2},
  title = {Gemma-3-4B Fine-tuned on Schema-Guided Dialog with GRPO},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Arittro2/gemma3-4b-sgd-grpo}}
}

Limitations

Trained on task-oriented dialog domain
Best suited for service/booking conversations
May require prompt engineering for optimal results

License

This model inherits the Gemma license from the base model.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Arittro2/gemma3-4b-sgd-grpo

Base model

google/gemma-3-4b-pt

Finetuned

google/gemma-3-4b-it

Finetuned

unsloth/gemma-3-4b-it

Finetuned

(293)

this model

Arittro2
/

gemma3-4b-sgd-grpo

Gemma-3-4B Fine-tuned on Schema-Guided Dialog (GRPO)

Model Details

Training Configuration

Reward Functions

Usage

Training Details

Citation

Limitations

License

Model tree for Arittro2/gemma3-4b-sgd-grpo

Dataset used to train Arittro2/gemma3-4b-sgd-grpo