RLCER: Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics

RLCER: Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics

Leheng Sheng^1,2,* Wenchang Ma^1,* Ruixin Hong¹ Xiang Wang³ An Zhang^3,† Ke Shen^1,† Tat-Seng Chua²

¹Seed, ByteDance
²National University of Singapore
³University of Science and Technology of China
^*Equal Contribution ^†Corresponding Authors

Paper (TBD) Code (TBD)

Introduction

Chain-of-thought (CoT) reasoning is essential for solving complex tasks with large language models, yet outcome-centric reinforcement learning with verifiable rewards (RLVR) mainly optimizes final-answer correctness. This leaves CoT quality under-supervised and may push optimization toward shortcut or brittle reasoning strategies.

RLCER addresses this gap by explicitly rewarding how models think. It introduces self-proposed and self-evolving rubrics as natural-language supervision criteria for CoT. The core question is whether the policy itself can generate valid CoT rubrics and continuously improve them during training without human annotations.

The framework empowers RLVR to move beyond pure outcome reward: reasoning trajectories are rewarded by rubric satisfaction, while rubric generation quality is also optimized, enabling autonomous CoT supervision.

RLCER teaser — Key idea: self-proposed and self-evolving rubrics for CoT supervision.

Method

Two Roles in One Policy

RLCER uses a shared policy with two prompted roles: a reasoner that generates CoT and final answers, and a rubricator that proposes textual rubrics and scores for evaluating CoT quality. A frozen verifier checks whether a CoT satisfies each rubric.

Rewarding How to Think

The reasoner receives a combined reward from outcome correctness and rubric-based CoT quality. A rubric is treated as valid when its satisfaction pattern is positively correlated with answer correctness and remains discriminative across rollouts.

Self-Evolving Rubric Generation

The rubricator is rewarded by the fraction of valid rubrics among all generated rubrics, plus a format reward. This drives rubric generation to evolve toward more informative and less saturated supervision criteria over time.

RLCER key idea — The RLCER training loop with reasoner and rubricator roles.

RLCER reward process — Reward computation for reasoner and rubricator in RLCER.

Experiments

We evaluate RLCER on math and knowledge reasoning benchmarks including AIME24, AIME25, AMC23, GPQA-Diamond, and SuperGPQA subsets, using Qwen 4B/8B backbones. Training is based on DAPO-Math-17k and evaluated with pass@1 under multi-sample decoding.

Results show that self-proposed rubrics provide meaningful learning signals even without outcome rewards. When combined with outcome reward, RLCER consistently outperforms vanilla RLVR across datasets and scales, and self-evolving rubric optimization further strengthens reasoning performance.

RLCER performance dynamics — Performance dynamics during training with self-evolving rubrics.

AMC23 rubric-only — Rubric-only training on AMC23.

AIME25 rubric-only — Rubric-only training on AIME25.

Limitations

RLCER introduces an additional rubricator role, increasing rollout burden and training cost. Also, current validation focuses on RLVR scenarios with verifiable outcomes; transfer to non-verifiable domains remains an open direction.

Conclusion

Outcome-centric RLVR improves final-answer accuracy but often overlooks direct CoT supervision. RLCER augments RLVR with self-proposed, self-evolving rubrics to reward reasoning trajectories without human annotation. Across datasets and model sizes, this approach yields consistent gains over outcome-only RLVR, suggesting a practical path toward more autonomous reasoning improvement.

Citation

@article{sheng2026rlcer,
  title   = {Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics},
  author  = {Sheng, Leheng and Ma, Wenchang and Hong, Ruixin and Wang, Xiang and Zhang, An and Shen, Ke and Chua, Tat-Seng},
  journal = {arXiv preprint arXiv:XXXX.XXXXX},
  year    = {2026}
}