Prompting and DFT for Math Reasoning: Evaluation and Understanding of Simple Offline Methods

XINGHAN LI* - NOVEMBER 19, 2025

*Tsinghua University (work done during an internship advised by Prof. Simon Shaolei Du at the University of Washington)

<aside> 🔖 TL; DR: 1. Simple prompts are surprisingly good, but don’t work universally. 2. Variants of DFT match an offline version of RL, but are not promising in practice.

</aside>

Why this matters

Reinforcement Learning (RL), especially RL with Verifiable Reward (RLVR) approaches substantially improve reasoning performance. However their online nature hinders efficiency: they incorporate rollouts from the continuously updating model which are costly and non-parallelizable. If cheaper offline methods could match RL, we’d have a far more scalable path to stronger models.

Introduction

The motivation of this project is two-fold:

SFT can be improved: a reweighting version of SFT called DFT [6] demonstrates better reasoning incentivization.
RLVR can be simplified: If we simplify the dataset of RLVR to a singleton, most of its performance gain can still be recovered [1].

<aside> 👉

Our key question is:

If we bring these two directions together—improving offline methods and simplifying RLVR—what will the result be?

Will we end up with a simple alternative to RLVR, or is the online nature of RLVR necessary for its full potential to be realized?

</aside>

The works done in this project supports the latter. We will look into two types of simple offline methods: prompting and variants of DFT in the sequel.

1. Prompting

Prompting methods merely change the chat template provided to the base model during generation, and are probably the simplest offline methods we can apply. All the prompts below do not involve test-time training, so they require zero additional compute, and hardly inject any new expertise into the model.

Simple Prompts are Effective

To: Force the model to output “To” as the first word of the response.

Motivation: The base model answers better if it starts with “To”.
This very simple trick alone can lead to 20% of improvement.

π1: Use the problem π1 in [1] to perform in-context learning.

The user prompt in this case is structured as \\nQuestion: + π1 + \\nAnswer: + example answer to π1 + \\n\\nQuestion: + input + \\nAnswer:.
We terminate generation once the model outputs \\n\\nQuestion: in response. Without this stop word configuration, the model will start memorizing other problems, causing the accuracy on Math500 to drop from 76.6% to 63.8%.

QR: Short for “Question Repetition”. Force the model to begin with Question: and repeat the input first, with the real answer after that.

Used on top of π1. Motivation: π1 will significantly promote that behaviour, so we just solidify it.