XINGHAN LI* - NOVEMBER 19, 2025

*Tsinghua University (work done during an internship advised by Prof. Simon Shaolei Du at the University of Washington)

<aside> 🔖 TL; DR: 1. Simple prompts are surprisingly good, but don’t work universally. 2. Variants of DFT match an offline version of RL, but are not promising in practice.

</aside>

Why this matters

Reinforcement Learning (RL), especially RL with Verifiable Reward (RLVR) approaches substantially improve reasoning performance. However their online nature hinders efficiency: they incorporate rollouts from the continuously updating model which are costly and non-parallelizable. If cheaper offline methods could match RL, we’d have a far more scalable path to stronger models.

Introduction

The motivation of this project is two-fold:

  1. SFT can be improved: a reweighting version of SFT called DFT [6] demonstrates better reasoning incentivization.
  2. RLVR can be simplified: If we simplify the dataset of RLVR to a singleton, most of its performance gain can still be recovered [1].

<aside> 👉

Our key question is:

If we bring these two directions together—improving offline methods and simplifying RLVR—what will the result be?

Will we end up with a simple alternative to RLVR, or is the online nature of RLVR necessary for its full potential to be realized?

</aside>

The works done in this project supports the latter. We will look into two types of simple offline methods: prompting and variants of DFT in the sequel.

1. Prompting

Prompting methods merely change the chat template provided to the base model during generation, and are probably the simplest offline methods we can apply. All the prompts below do not involve test-time training, so they require zero additional compute, and hardly inject any new expertise into the model.

Simple Prompts are Effective


To: Force the model to output “To” as the first word of the response.


π1: Use the problem π1 in [1] to perform in-context learning.


QR: Short for “Question Repetition”. Force the model to begin with Question: and repeat the input first, with the real answer after that.