Abstract
Cautious Weight Decay (CWD) enhances optimizer performance by applying weight decay selectively, improving accuracy and loss in large-scale models without additional tuning.
We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that applies weight decay only to parameter coordinates whose signs align with the optimizer update. Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss and admits a bilevel interpretation: it induces sliding-mode behavior upon reaching the stationary manifold, allowing it to search for locally Pareto-optimal stationary points of the unmodified objective. In practice, CWD is a drop-in change for optimizers such as AdamW, Lion, and Muon, requiring no new hyperparameters or additional tuning. For language model pre-training and ImageNet classification, CWD consistently improves final loss and accuracy at million- to billion-parameter scales.
Community
We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that
applies weight decay only to parameter coordinates whose signs align with the optimizer update.
Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss and admits a bilevel interpretation: it induces sliding-mode
behavior upon reaching the stationary manifold, allowing it to search for locally Pareto-optimal
stationary points of the unmodified objective. In practice, CWD is a drop-in change for optimizers
such as AdamW, Lion, and Muon, requiring no new hyperparameters or additional tuning.
For language model pre-training and ImageNet classification, CWD consistently improves final
loss and accuracy at million- to billion-parameter scales.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates (2025)
- ANO : Faster is Better in Noisy Landscape (2025)
- REG: A Regularization Optimizer for Robust Training Dynamics (2025)
- Adaptive Memory Momentum via a Model-Based Framework for Deep Learning Optimization (2025)
- Conda: Column-Normalized Adam for Training Large Language Models Faster (2025)
- Gradient Shaping Beyond Clipping: A Functional Perspective on Update Magnitude Control (2025)
- Muon: Training and Trade-offs with Latent Attention and MoE (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Another question: I’m using the Conda optimizer (just to mention it). Do you have any idea how CWD could be added to it? In my case, when I included CWD in Conda, the results turned out worse than with normal weight decay
Hi, thank you for the very interesting work on Cautious Weight Decay and its integration with LION-K and MUON. I have a question specifically about the MUON + CWD variant (Algorithm 6).
In Algorithm 6, the MUON update is based on a momentum-like matrix M_t (from the stochastic gradient G_t), and then you apply the Newton–Schulz iteration to obtain O_t. As I understand it, O_t is an approximation to the matrix sign (or a spectrally normalized / orthogonalized transform of M_t). Because of this, the entries of O_t, and especially the product O_t * X_t, are not necessarily coordinate-wise aligned with the original gradient G_t or the momentum M_t. In particular, the sign of (O_t * X_t)_ij does not have to match the sign of G_t,ij or M_t,ij.
In the MUON + CWD update, the cautious weight decay mask is defined using the condition “O_t * X_t >= 0” (elementwise). This raises a few questions:
1. Conceptually, is this mask meant to approximate the more intuitive condition “gradient and parameter have the same sign”, i.e. something like a mask based on G_t and X_t? Or is it intentionally defined purely in terms of the post–Newton–Schulz update direction (O_t * X_t), regardless of the raw gradient signs?
2. Have you measured in practice how often the signs of (O_t * X_t)_ij and G_t,ij (or M_t,ij) disagree? In other words, how frequently does MUON + CWD disable or enable weight decay on coordinates where the raw gradient would suggest the opposite?
3. Do you see any theoretical or practical drawbacks in defining a “grad-aware” variant of MUON + CWD (for example, using a mask based on G_t and X_t, or a hybrid mask that combines information from G_t and O_t * X_t)? Would that conflict with your analysis for LION-K / MUON, or do you mainly view the current mask definition as a pragmatic design choice?
I am trying to understand whether, in the MUON setting, the cautious condition should be interpreted as “not opposing the actual optimizer update geometry” (given by O_t), rather than “not opposing the raw gradient direction”, and whether the potential mismatch between O_t and G_t at the coordinate level matters in practice.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper