Unsupervised Search and Verification — Aug 2025 Author: Nimit Kalra [email protected]

Special thanks to Leonard Tang, Allen Nie, Will Brown, Laurens van der Maaten, Jason Weston, Jon Saad-Falcon, Sachit Menon, Trung Vu, and Shi Feng for valuable conversations, direction, and feedback. Estimated reading time: 46min.

Introduction

High-quality reasoning traces are noisy, subjective, and expensive to collect. A complementary approach to manual curation is bootstrapping directly from the pre/mid-trained model, leveraging it as a strong prior over reasoning behaviors.

Recent literature shows that unsupervised verification techniques, such as self-consistency or jointly-trained verifiers, are able to drive meaningful improvements. However, in bespoke or misaligned domains, brittle verification of poor-quality candidate reasoning becomes a serious obstacle. For example, models may “solve” math problems simply by memorized heuristics (e.g., Qwen on MATH500), rather than surfacing transferable reasoning strategies.

This raises key questions around

  1. Search: How can we efficiently surface diverse, high-quality reasoning paths from the model?
  2. Verification: What are the limitations of self-verification, especially with respect to reward hacking and self-preference bias, and how can we apply it reliably to judge reasoning traces themselves?
  3. and the Interplay: How should we coordinate search and verification in a feedback loop to enable stable bootstrapping in challenging or non-general domains?

We’ll revisit techniques from inference-time scaling, post-training, and self-improvement.

arXiv first submission date used ❗️ denotes papers that limit their results to Qwen-2.5+ models on Math/Coding


Case Study

In Self-Taught Reasoner (STaR; Zelikman et al., March 2022), the authors iteratively bootstrap a training dataset of reasoning traces from a pretrained base model via simple rejection sampling. They few-shot prompt the first iteration with ground-truth reasoning traces and aggregate rollouts that lead to the correct answer. At each iteration, we finetune the base model on the aggregated dataset.

Screenshot 2025-05-22 at 5.21.25 PM.png

The authors show that SFT on this batch is equivalent to maximizing the policy gradient objective; however, problems that are above the model’s capability (i.e., the model cannot yet solve) require sampling from a surrogate/proposal model (rationalization; or prompting with a “hint”), leading to an off-policy estimate of an optimal reasoning trace. One such degenerate reasoning trace failure mode may be answer shortcutting where the model generates irrelevant/inconsistent reasoning before simply stating the hinted answer. Such traces muddle the training pool and do not generalize during rollouts; e.g.,

A: The answer must be a place where Billy could have been waiting for his wife to arrive from France. The airport is a place where people can wait for flights. Therefore, the answer is train station (e).

Moreover, using the final answer correctness as our sole reward allows spurious reasoning traces to slip through; i.e., poor quality reasoning that just happens to land on the correct answer by luck. This is particularly likely in the multiple choice or evaluation/judge setting.