Unsupervised Search and Verification — Aug 2025 Author: Nimit Kalra [email protected]
Special thanks to Leonard Tang, Allen Nie, Will Brown, Laurens van der Maaten, Jason Weston, Jon Saad-Falcon, Sachit Menon, Trung Vu, and Shi Feng for valuable conversations, direction, and feedback. Estimated reading time: 46min.
High-quality reasoning traces are noisy, subjective, and expensive to collect. A complementary approach to manual curation is bootstrapping directly from the pre/mid-trained model, leveraging it as a strong prior over reasoning behaviors.
Recent literature shows that unsupervised verification techniques, such as self-consistency or jointly-trained verifiers, are able to drive meaningful improvements. However, in bespoke or misaligned domains, brittle verification of poor-quality candidate reasoning becomes a serious obstacle. For example, models may “solve” math problems simply by memorized heuristics (e.g., Qwen on MATH500), rather than surfacing transferable reasoning strategies.
This raises key questions around
We’ll revisit techniques from inference-time scaling, post-training, and self-improvement.
arXiv first submission date used ❗️ denotes papers that limit their results to Qwen-2.5+ models on Math/Coding
In Self-Taught Reasoner (STaR; Zelikman et al., March 2022), the authors iteratively bootstrap a training dataset of reasoning traces from a pretrained base model via simple rejection sampling. They few-shot prompt the first iteration with ground-truth reasoning traces and aggregate rollouts that lead to the correct answer. At each iteration, we finetune the base model on the aggregated dataset.

The authors show that SFT on this batch is equivalent to maximizing the policy gradient objective; however, problems that are above the model’s capability (i.e., the model cannot yet solve) require sampling from a surrogate/proposal model (rationalization; or prompting with a “hint”), leading to an off-policy estimate of an optimal reasoning trace. One such degenerate reasoning trace failure mode may be answer shortcutting where the model generates irrelevant/inconsistent reasoning before simply stating the hinted answer. Such traces muddle the training pool and do not generalize during rollouts; e.g.,
A: The answer must be a place where Billy could have been waiting for his wife to arrive from France. The airport is a place where people can wait for flights. Therefore, the answer is train station (e).
Moreover, using the final answer correctness as our sole reward allows spurious reasoning traces to slip through; i.e., poor quality reasoning that just happens to land on the correct answer by luck. This is particularly likely in the multiple choice or evaluation/judge setting.