Cartridges: Lightweight and general-purpose long context representations via self-study Paper • 2506.06266 • Published Jun 6, 2025 • 7
Archon: An Architecture Search Framework for Inference-Time Techniques Paper • 2409.15254 • Published Sep 23, 2024 • 1
Shrinking the Generation-Verification Gap with Weak Verifiers Paper • 2506.18203 • Published Jun 22, 2025 • 2
SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? Paper • 2410.03859 • Published Oct 4, 2024 • 1
VideoGameBench: Can Vision-Language Models complete popular video games? Paper • 2505.18134 • Published May 23, 2025 • 6
Automated Rewards via LLM-Generated Progress Functions Paper • 2410.09187 • Published Oct 11, 2024 • 1
Self-Generated In-Context Examples Improve LLM Agents for Sequential Decision-Making Tasks Paper • 2505.00234 • Published May 1, 2025 • 26
BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts Paper • 2408.08274 • Published Aug 15, 2024 • 13
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models Paper • 2306.11698 • Published Jun 20, 2023 • 12
Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT Paper • 2402.07440 • Published Feb 12, 2024 • 1
Simple linear attention language models balance the recall-throughput tradeoff Paper • 2402.18668 • Published Feb 28, 2024 • 20
Just read twice: closing the recall gap for recurrent language models Paper • 2407.05483 • Published Jul 7, 2024
LoLCATs: On Low-Rank Linearizing of Large Language Models Paper • 2410.10254 • Published Oct 14, 2024