Black-Box On-Policy Distillation of Large Language Models Paper • 2511.10643 • Published Nov 13, 2025 • 49
Fantastic Pretraining Optimizers and Where to Find Them Paper • 2509.02046 • Published Sep 2, 2025 • 13
Efficient Attention Mechanisms for Large Language Models: A Survey Paper • 2507.19595 • Published Jul 25, 2025 • 6
Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation Paper • 2507.06607 • Published Jul 9, 2025 • 10
Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation Paper • 2506.09991 • Published Jun 11, 2025 • 55
Multimodal Latent Language Modeling with Next-Token Diffusion Paper • 2412.08635 • Published Dec 11, 2024 • 48
Multimodal Latent Language Modeling with Next-Token Diffusion Paper • 2412.08635 • Published Dec 11, 2024 • 48
Retentive Network: A Successor to Transformer for Large Language Models Paper • 2307.08621 • Published Jul 17, 2023 • 172 • 34
Retentive Network: A Successor to Transformer for Large Language Models Paper • 2307.08621 • Published Jul 17, 2023 • 172 • 34
Retentive Network: A Successor to Transformer for Large Language Models Paper • 2307.08621 • Published Jul 17, 2023 • 172 • 34
Retentive Network: A Successor to Transformer for Large Language Models Paper • 2307.08621 • Published Jul 17, 2023 • 172 • 34
Retentive Network: A Successor to Transformer for Large Language Models Paper • 2307.08621 • Published Jul 17, 2023 • 172 • 34