Checkpoints for "Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models" https://arxiv.org/abs/2410.18252
Shengyi Costa Huang
vwxyzjn
AI & ML interests
None yet
Organizations
TL;DR summarization checkpoints
The checkpoints are trained in https://arxiv.org/abs/2403.17031 and taken from https://wandb.ai/costa-huang/tldr_summarize/reports/Release--Vmlldzo3MT
-
cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr
Text Generation • Updated • 986 -
cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr
Text Classification • Updated • 384 -
cleanrl/EleutherAI_pythia-2.8b-deduped__sft__tldr
Text Generation • Updated • 121 -
cleanrl/EleutherAI_pythia-2.8b-deduped__reward__tldr
Text Classification • Updated • 61
Async RLHF Paper Checkpoints
Checkpoints for "Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models" https://arxiv.org/abs/2410.18252
lm-human-preference-details
TL;DR summarization checkpoints
The checkpoints are trained in https://arxiv.org/abs/2403.17031 and taken from https://wandb.ai/costa-huang/tldr_summarize/reports/Release--Vmlldzo3MT
-
cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr
Text Generation • Updated • 986 -
cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr
Text Classification • Updated • 384 -
cleanrl/EleutherAI_pythia-2.8b-deduped__sft__tldr
Text Generation • Updated • 121 -
cleanrl/EleutherAI_pythia-2.8b-deduped__reward__tldr
Text Classification • Updated • 61
RLOO / PPOv2 TL;DR summarize checkpoints