Unsupervised Search and Verification Author: Nimit Kalra [email protected]
Special thanks to Leonard Tang, Allen Nie, Will Brown, Laurens van der Maaten, Jason Weston, Jon Saad-Falcon, Sachit Menon, and Shi Feng for valuable conversations, direction, and feedback. Estimated reading time: 46min.
High-quality reasoning supervision is expensive — not just in dollars, but in latency, throughput, and annotation consistency. Human-labeled traces are noisy, subjective, and particularly hard to scale for non-general domains, where instructions are idiosyncratic and objective ground truths are often elusive. As a result, many recent works attempt to bootstrap supervision directly from the pretrained model, leveraging it as a strong prior over reasoning behaviors.
This approach rests on the superficial alignment hypothesis — that pretrained LLMs already encode useful reasoning capabilities, even if they aren’t surfaced reliably. But these capabilities may only emerge in domains close to the pretraining distribution. In bespoke or misaligned domains, poor self-calibration, hallucinations, and brittle verification become serious obstacles. For example, models may “solve” math problems simply by memorized heuristics (e.g., Qwen on MATH500), rather than learning transferable reasoning strategies.
Yet, using the same model to generate, evaluate, and learn from its own reasoning opens the door to self-improvement and self-play — where we continuously refine the model without external supervision. This reuse isn't just convenient; it creates a closed loop of learning that can be surprisingly effective, provided the generated supervision is at least directionally correct. Recent literature shows that such approximate signals — from self-consistency to self-certainty to internal coherence — are often sufficient to drive meaningful improvements.
But this raises key questions around
arXiv first submission date used ❗️ denotes papers that limit their results to Qwen-2.5+ models on Math/Coding
In Self-Taught Reasoner (STaR; Zelikman et al., March 2022), the authors iteratively bootstrap a training dataset of reasoning traces from a pretrained base model via simple rejection sampling. They few-shot prompt the first iteration with ground-truth reasoning traces and aggregate rollouts that lead to the correct answer. At each iteration, we finetune the base model on the aggregated dataset.
The authors show that SFT on this batch is equivalent to maximizing the policy gradient objective; however, problems that are above the model’s capability (i.e., the model cannot yet solve) require sampling from a surrogate/proposal model (rationalization; or prompting with a “hint”), leading to an off-policy estimate of an optimal reasoning trace. One such degenerate reasoning trace failure mode may be answer shortcutting where the model generates irrelevant/inconsistent reasoning before simply stating the hinted answer. Such traces muddle the training pool and do not generalize during rollouts; e.g.,
A: The answer must be a place where Billy could have been waiting for his wife to arrive from France. The airport is a place where people can wait for flights. Therefore, the answer is train station (e).
Moreover, using the final answer correctness as our sole reward allows spurious reasoning traces to slip through; i.e., poor quality reasoning that just happens to land on the correct answer by luck. This is particularly likely in the multiple choice or evaluation/judge setting.