arxiv:2510.12402

Cautious Weight Decay

Published on Oct 14

· Submitted by

Kaizhao Liang on Oct 15

Google

Upvote

Authors:

Lizhang Chen ,

Abstract

Cautious Weight Decay (CWD) enhances optimizer performance by applying weight decay selectively, improving accuracy and loss in large-scale models without additional tuning.

AI-generated summary

We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that applies weight decay only to parameter coordinates whose signs align with the optimizer update. Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss and admits a bilevel interpretation: it induces sliding-mode behavior upon reaching the stationary manifold, allowing it to search for locally Pareto-optimal stationary points of the unmodified objective. In practice, CWD is a drop-in change for optimizers such as AdamW, Lion, and Muon, requiring no new hyperparameters or additional tuning. For language model pre-training and ImageNet classification, CWD consistently improves final loss and accuracy at million- to billion-parameter scales.

View arXiv page View PDF GitHub 42 Add to collection

Community

kz919

Paper submitter Oct 15

We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that
applies weight decay only to parameter coordinates whose signs align with the optimizer update.
Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss and admits a bilevel interpretation: it induces sliding-mode
behavior upon reaching the stationary manifold, allowing it to search for locally Pareto-optimal
stationary points of the unmodified objective. In practice, CWD is a drop-in change for optimizers
such as AdamW, Lion, and Muon, requiring no new hyperparameters or additional tuning.
For language model pre-training and ImageNet classification, CWD consistently improves final
loss and accuracy at million- to billion-parameter scales.

librarian-bot

Oct 16

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

KitsuVp

Oct 16

One question, are you planning to release the code for reproduction?

KitsuVp

Oct 16

Another question: I’m using the Conda optimizer (just to mention it). Do you have any idea how CWD could be added to it? In my case, when I included CWD in Conda, the results turned out worse than with normal weight decay

jootanehorror

18 days ago

Hi, thank you for the very interesting work on Cautious Weight Decay and its integration with LION-K and MUON. I have a question specifically about the MUON + CWD variant (Algorithm 6).

In Algorithm 6, the MUON update is based on a momentum-like matrix M_t (from the stochastic gradient G_t), and then you apply the Newton–Schulz iteration to obtain O_t. As I understand it, O_t is an approximation to the matrix sign (or a spectrally normalized / orthogonalized transform of M_t). Because of this, the entries of O_t, and especially the product O_t * X_t, are not necessarily coordinate-wise aligned with the original gradient G_t or the momentum M_t. In particular, the sign of (O_t * X_t)_ij does not have to match the sign of G_t,ij or M_t,ij.

In the MUON + CWD update, the cautious weight decay mask is defined using the condition “O_t * X_t >= 0” (elementwise). This raises a few questions:
1. Conceptually, is this mask meant to approximate the more intuitive condition “gradient and parameter have the same sign”, i.e. something like a mask based on G_t and X_t? Or is it intentionally defined purely in terms of the post–Newton–Schulz update direction (O_t * X_t), regardless of the raw gradient signs?
2. Have you measured in practice how often the signs of (O_t * X_t)_ij and G_t,ij (or M_t,ij) disagree? In other words, how frequently does MUON + CWD disable or enable weight decay on coordinates where the raw gradient would suggest the opposite?
3. Do you see any theoretical or practical drawbacks in defining a “grad-aware” variant of MUON + CWD (for example, using a mask based on G_t and X_t, or a hybrid mask that combines information from G_t and O_t * X_t)? Would that conflict with your analysis for LION-K / MUON, or do you mainly view the current mask definition as a pragmatic design choice?

I am trying to understand whether, in the MUON setting, the cautious condition should be interpreted as “not opposing the actual optimizer update geometry” (given by O_t), rather than “not opposing the raw gradient direction”, and whether the potential mismatch between O_t and G_t at the coordinate level matters in practice.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.12402 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.12402 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.12402 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.