Papers
arxiv:2507.08068

Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions

Published on Jul 10
Authors:
,
,

Abstract

QRPO enables learning from pointwise absolute rewards using quantile rewards, achieving top performance in chat and coding evaluations while preserving the simplicity and offline applicability of DPO-like methods.

AI-generated summary

Aligning large language models with pointwise absolute rewards has so far required online, on-policy algorithms such as PPO and GRPO. In contrast, simpler methods that can leverage offline or off-policy data, such as DPO and REBEL, are limited to learning from preference pairs or relative signals. To bridge this gap, we introduce Quantile Reward Policy Optimization (QRPO), which learns from pointwise absolute rewards while preserving the simplicity and offline applicability of DPO-like methods. QRPO uses quantile rewards to enable regression to the closed-form solution of the KL-regularized RL objective. This reward yields an analytically tractable partition function, removing the need for relative signals to cancel this term. Moreover, QRPO scales with increased compute to estimate quantile rewards, opening a new dimension for pre-computation scaling. Empirically, QRPO consistently achieves top performance on chat and coding evaluations--reward model scores, AlpacaEval 2, and LeetCode--compared to DPO, REBEL, and SimPO across diverse datasets and 8B-scale models. Finally, we find that training with robust rewards instead of converting them to preferences induces less length bias.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.08068 in a model README.md to link it from this page.

Datasets citing this paper 28

Browse 28 datasets citing this paper

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.08068 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.