XINGHAN LI* - NOVEMBER 19, 2025
*Tsinghua University (work done during an internship advised by Prof. Simon Shaolei Du at the University of Washington)
<aside> 🔖 TL; DR: 1. Simple prompts are surprisingly good, but don’t work universally. 2. Variants of DFT match an offline version of RL, but are not promising in practice.
</aside>
Reinforcement Learning (RL), especially RL with Verifiable Reward (RLVR) approaches substantially improve reasoning performance. However their online nature hinders efficiency: they incorporate rollouts from the continuously updating model which are costly and non-parallelizable. If cheaper offline methods could match RL, we’d have a far more scalable path to stronger models.
The motivation of this project is two-fold:
<aside> 👉
Our key question is:
If we bring these two directions together—improving offline methods and simplifying RLVR—what will the result be?
Will we end up with a simple alternative to RLVR, or is the online nature of RLVR necessary for its full potential to be realized?
</aside>
The works done in this project supports the latter. We will look into two types of simple offline methods: prompting and variants of DFT in the sequel.
Prompting methods merely change the chat template provided to the base model during generation, and are probably the simplest offline methods we can apply. All the prompts below do not involve test-time training, so they require zero additional compute, and hardly inject any new expertise into the model.
To: Force the model to output “To” as the first word of the response.
π1: Use the problem π1 in [1] to perform in-context learning.
\\nQuestion: + π1 + \\nAnswer: + example answer to π1 + \\n\\nQuestion: + input + \\nAnswer:.\\n\\nQuestion: in response. Without this stop word configuration, the model will start memorizing other problems, causing the accuracy on Math500 to drop from 76.6% to 63.8%.QR: Short for “Question Repetition”. Force the model to begin with Question: and repeat the input first, with the real answer after that.
π1. Motivation: π1 will significantly promote that behaviour, so we just solidify it.