Stephen Chung$^{,1,2}$, Wenyu Du$^{, 1,3}$, Jie Fu$^4$ (*Contributed equally )

$^{1}$DualityRL, $^{2}$University of Cambridge, $^{3}$University of Hong Kong, $^{4}$Shanghai AI Lab

Released on 26 Feb, 2025

[👨‍💻‍ Github][📚PDF][🤗 HF]

<aside> 💡

This blog presents our current project on Multi-Attempt RL for LLMs, aiming to enable LLMs to iteratively refine their responses based on past attempts. Instead of generating a single response per question like DeepSeek-R1, our proposed method allows multiple attempts, with feedback provided after incorrect responses. The multi-attempt task assigns rewards based on correctness across attempts, encouraging the model to refine its previous attempts and improve search efficiency.

Our preliminary experiments show that even a small LLM (e.g. Qwen2.5-Math-1.5B) trained on a multi-attempt task achieves significantly higher accuracy when evaluated with more attempts, improving the avarage accuacy across five math benchmarks (AIME 2024, MATH 500, AMC 2023, Minerva Math, and OlympiadBench) from 45.6% with 1 attempt to 52.5% with 2 attempts. This suggests that the model effectively learns to leverage previous failed attempts to refine its responses. In contrast, the same LLM trained on a standard single-turn task exhibits only a marginal improvement, increasing from 42.3% to 43.2% when given 2 attempts during evaluation. The results indicate that, compared to the standard single-turn task, an LLM trained on a multi-attempt task achieves better performance on math benchmarks while also learning to refine its responses more effectively based on user feedback.

</aside>

Table of Content

1 - Motivation — Multi-Attempt RL Enables Learning From Failures

Recent advancements in large-scale post-training of reinforcement learning (RL) for large language models (LLMs) have shown promise in enhancing reasoning capabilities, leading to emergent abilities like self-correction and self-refinement. Most existing methods rely on single-turn tasks, where the model receives a reward based on the correctness of its single response to a question. However, single-turn tasks may be inefficient due to sparse rewards, and they do not require the LLM to learn how to respond to user feedback. In this work, we propose a simple yet effective multi-turn task that enables LLMs to learn reasoning through RL.

Instead of requiring the LLM to provide a single response to a given question, we propose a multi-attempt task that allows the LLM to generate multiple responses based on feedback. Specifically, we first randomly sample $N$ as the number of remaining attempts for each question. The model initially generates a response to a given question as usual. If the response is correct or there are no remaining attempts (i.e., $N\leq1$), the dialogue ends. However, if the response is incorrect and there are remaining attempts (i.e., $N>1$), we provide feedback indicating that the answer is incorrect and prompt the LLM to try again, while decrementing the remaining attempts $N$ by 1. An example dialogue from an LLM during training is shown in Case Study.

Screenshot 2025-02-26 at 21.58.30.png

Evaluation accuracy as a function of the number of allowed attempts during evaluation, averaged across five benchmarks: AIME 2024, MATH 500, AMC 2023, Minerva Math, and OlympiadBench. Both LLMs are based on Qwen 2.5 Math 1.5B and fine-tuned via RL on a small math dataset in either multi-attempt tasks or single-turn tasks (baseline).

2 - (Preliminary) Experiments — Multi-Attempt Leads LLM Effectively Learn Self-Refinement

Our training pipeline is simple and applies standard RL to the multi-attempt task on a math problem dataset, largely similar to how DeepSeek R1 Zero is trained. In the multi-attempt task, a reward of +1 is given if the answer is correct in any attempt, -0.5 if the answer is incorrect but in the correct format, and -1 otherwise. We use standard Proximal Policy Optimization (PPO) as the training algorithm.

We fine-tune a small pretrained model, namely Qwen2.5-Math-1.5B, on 8K math questions provided in simpleRL-reason. We use PPO with a discount rate of $\gamma = 1$, lambda $\lambda = 0.99$, and a small KL divergence coefficient of 0.01. The LLM is trained for 160 episodes, generating a single sample per question in each episode, totaling $160 \times 8\text{K} = 1.28\text{M}$ training samples.

Our preliminary experiments show that even a small LLM, such as a 1.5B model, can effectively learn self-refinement capabilities. As illustrated in the above Figure, the evaluation accuracy of an LLM trained on a multi-attempt task improved from 45.6% to 52.5% on math benchmark when increasing the number of attempts from 1 to 2. In contrast, the same model trained on a single-turn task showed only a marginal gain, from 42.3% to 43.2%. We also observe that even under the standard 1-attempt evaluation, the multi-attempt LLM outperforms its single-turn counterpart, highlighting the benefits of multi-attempt training. We are currently scaling up the experiment to a 7B model and anticipate even greater improvements.