TreeRPO: Tree Relative Policy Optimization

Zhicheng Yang, Zhijiang Guo, Yinya Huang, Xiaodan Liang, Yiwei Wang, Jing Tang

June, 2025

Abstract

Large Language Models (LLMs) have shown remarkable reasoning capabilities through Reinforcement Learning with Verifiable Rewards (RLVR) methods. However, a key limitation of existing approaches is that rewards defined at the full trajectory level provide insufficient guidance for optimizing the intermediate steps of a reasoning process. To address this, we introduce TREERPO, a novel method that estimates the mathematical expectations of rewards at various reasoning steps using tree sampling. Unlike prior methods that rely on a separate step reward model, TREERPO directly estimates these rewards through this sampling process. Building on the group-relative reward training mechanism of GRPO, TREERPO innovatively computes rewards based on step-level groups generated during tree sampling. This advancement allows TREERPO to produce fine-grained and dense reward signals, significantly enhancing the learning process and overall performance of LLMs. Experimental results demonstrate that our TREERPO algorithm substantially improves the average Pass@1 accuracy of Qwen-2.5-Math on test benchmarks, increasing it from 19.0% to 35.5%. Furthermore, TREERPO significantly outperforms GRPO by 2.9% in performance while simultaneously reducing the average response length by 18.1%, showcasing its effectiveness and efficiency. Our code will be available at https://github.com/yangzhch6/TreeRPO.

Type

Manuscript

Publication

arXiv

Computer Science - Artificial Intelligence Computer Science - Machine Learning