20260310_tmlr
The paper, Stepwise guided policy optimization: Coloring your incorrect reasoning in GRPO, coauthored with Peter Chen, Xiaopeng Li, Ziniu Li and Xi Chen was accepted to TMLR.
Enjoy Reading This Article?
Here are some more articles you might like to read next: