Chain-of-thought (CoT) reasoning is essential for solving complex tasks with large language models, yet outcome-centric reinforcement learning with verifiable rewards (RLVR) mainly optimizes final-answer correctness. This leaves CoT quality under-supervised and may push optimization toward shortcut or brittle reasoning strategies.
RLCER addresses this gap by explicitly rewarding how models think. It introduces self-proposed and self-evolving rubrics as natural-language supervision criteria for CoT. The core question is whether the policy itself can generate valid CoT rubrics and continuously improve them during training without human annotations.
The framework empowers RLVR to move beyond pure outcome reward: reasoning trajectories are rewarded by rubric satisfaction, while rubric generation quality is also optimized, enabling autonomous CoT supervision.