Reasoning over mathematical objects: on-policy reward modeling and test time aggregation Paper • 2603.18886 • Published Mar 19 • 6
ThoughtTrace: Understanding User Thoughts in Real-World LLM Interactions Paper • 2605.20087 • Published May 19 • 18
Jointly Reinforcing Diversity and Quality in Language Model Generations Paper • 2509.02534 • Published Sep 2, 2025 • 25
SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning Paper • 2505.02363 • Published May 5, 2025 • 7
Error Norm Truncation: Robust Training in the Presence of Data Noise for Text Generation Models Paper • 2310.00840 • Published Oct 2, 2023