Inference-Time Diversity in RL-Trained Lean Theorem Provers

A Diagnostic Study

Zachary Burton

MIT ’26

ICML 2026 · AI for Math Workshop

RL-trained Lean theorem provers run out of ideas as you sample more. A fixed schedule of 15 tactic skeletons closes the gap, and the effect is RL-specific.

Summary

RL-trained Lean theorem provers hit a sampling ceiling. On miniF2F-test, doubling DeepSeek-Prover-V1.5-RL’s i.i.d. sampling budget from k=32 to k=64 adds zero new solved theorems: both budgets land at 42/244. A fixed schedule of 15 common tactic openings (simp, intro, constructor, …) breaks the plateau, with a mean gain of +12.3 ± 4.2 theorems across n=3 seeds (sign positive in every seed). The ablation shows the gain is about structure, not prompt diversity: paraphrased instructions match the baseline; irrelevant Lean comments actively degrade it. The same model’s non-RL base checkpoint proves zero theorems with or without skeletons, so RL produces both the proof ability and the narrow sampling that caps it.

Results

Line plot of theorems solved at k=16, 32, 64 for A-RL (38, 42, 42) versus B-RL (55, 58, 60), showing the A-RL plateau and the B-RL gradient. — Figure 1. A-RL flatlines at 42/244 from k=32 to k=64; B-RL keeps climbing.

Scaling on miniF2F-test (244 theorems) with DeepSeek-Prover-V1.5-RL:

Condition	k=16	k=32	k=64
A-RL (i.i.d. baseline)	38/244 (15.6%)	42/244 (17.2%)	42/244 (17.2%)
B-RL (skeleton schedule)	55/244 (22.5%)	58/244 (23.8%)	60/244 (24.6%)
Absolute gap	+17	+16	+18
Relative gain	+44.7%	+38.1%	+42.9%

Paper Table 1 (single-seed headline numbers). The abstract reports the across-seed mean Δ = +12.3 ± 4.2 theorems at k=16, n=3 seeds, sign preserved in every seed.

Venn diagram of solved theorems at k=16: B-RL solves 55 (20 unique), A-RL solves 38 (3 unique), intersection 35. — Figure 2. Solved-set overlap at k=16. 20 theorems are solved only by B-RL, 3 only by A-RL, 35 by both.

Bar chart showing solved theorems at k=16 for C2 (25), A-RL (38), C1 (38), C3 (48), B-RL (55), illustrating the C2 < A-RL = C1 < C3 < B-RL gradient. — Figure 3. Structural-content gradient at k=16: irrelevant tokens hurt; pure paraphrase ties the baseline; NL goal hints recover most of the gap; tactic skeletons go furthest.

Diversity ablation at k=16 isolates structural content from prompt diversity:

Condition	Solved	Pass@16
C2 (irrelevant Lean comments)	25/244	10.2%
A-RL (i.i.d. baseline)	38/244	15.6%
C1 (instruction paraphrases)	38/244	15.6%
C3 (natural-language goal hints only)	48/244	19.7%
B-RL (tactic skeletons)	55/244	22.5%

The A→B gap at k=16 is 17 theorems. C3 (NL hint, no tactic prefix) recovers 10 of those 17 on its own; the tactic stem in B picks up the other 7. C1 (paraphrased instructions, no prefix) adds nothing over A-RL. C2 (irrelevant comments) drops solves by 13.

Takeaway: the model already has the proofs. What changes between A and B is which head of a peaked sampling distribution gets reached. A forced first tactic is one way to reach a new head; an informative goal-hint comment is another, and on its own it closes 58% of the A→B gap.

Bar chart of per-skeleton contribution to B-RL k=16: empty skeleton 3.29 of 4.54 solves per shard, intros 0.62, simp 0.17, norm_num 0.17, others below 0.1. — Figure 4. Per-skeleton contribution within B-RL k=16 (24 shards). The `(empty)` bar is the C3 condition embedded inside B-RL: no tactic prefix, cycling through the 8 hint variants. It accounts for 3.29 of 4.54 solves per shard. The other 14 skeletons split the remaining 1.25, which is what closes A-RL’s gap to B-RL.

The phenomenon is RL-specific. The non-RL base checkpoint (V1.5-Base) proves 0/244 theorems at every k, with or without skeletons. RL creates both the proof capability and the inference-time collapse that limits it.
Distributional evidence. Across 13,159 stochastic samples, the median V1.5-RL theorem receives only 2 distinct first-tactic heads; 49% of the benchmark gets a deterministic single strategic opening.
Cross-model. Running A→B on two more 7B Lean provers: RL-trained DeepSeek-Prover-V2-7B adds +3 frontier theorems no i.i.d. baseline of any tested model reaches. SFT-trained Goedel-Prover loses −10.0 ± 4.4 theorems across n=3 seeds (sign negative every seed). Skeletons help RL provers and hurt SFT ones.

Reproduce

Setup (Python 3.10 + Lean 4.9.0-rc1, the toolchain V1.5 was trained on):

# 1. Python environment
python3.10 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# 2. Lean toolchain
curl -sSf https://raw.githubusercontent.com/leanprover/elan/master/elan-init.sh | sh -s -- -y
export PATH="$HOME/.elan/bin:$PATH"
(cd lean && lake update && lake exe cache get && lake build)

# 3. Start a local vLLM server (one A100 80GB for V1.5)
python -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/DeepSeek-Prover-V1.5-RL \
  --tensor-parallel-size 1 --max-model-len 2048 \
  --gpu-memory-utilization 0.9 --enforce-eager --trust-remote-code

Run the main experiments:

# Exp 1 + Exp 3 (V1.5-RL): A-RL, B-RL at k in {16, 32, 64} + C1/C2 ablations
bash scripts/run_exp1_exp3_rl.sh --provider openai_compat

# Exp 2 (V1.5-Base): A-BASE, B-BASE at k in {16, 32, 64}
bash scripts/run_exp2_base.sh --provider openai_compat

# Cross-model: V2-7B + Kimina-Prover at k in {16, 32}, max-tokens 8192
bash scripts/run_new_models.sh

The pod-bootstrap script scripts/setup_prime_pod.sh does Python + Lean + Mathlib cache + model weights end-to-end on a fresh Prime Intellect / RunPod A100 instance. Each run writes append-only events.jsonl plus per-attempt JSON and Lean files under outputs/<exp_hash>/; exp_hash is a deterministic function of the run config, so reruns are idempotent.

Citation

@inproceedings{burton2026inference,
  title     = {Inference-Time Diversity in RL-Trained Lean Theorem Provers: A Diagnostic Study},
  author    = {Burton, Zachary},
  booktitle = {ICML 2026 AI for Math Workshop},
  year      = {2026},
  eprint    = {2601.16172},
  archivePrefix = {arXiv},
  primaryClass  = {cs.AI},
  url       = {https://arxiv.org/abs/2601.16172}
}