RecursiveThink/baseline_vs_recursive.readme at main · Rohan-Siva/RecursiveThink · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
# Baseline vs Recursive: AIME 2025 Comparison

## Overview
Compared **Baseline (single-shot)** vs **RecursiveThink** on first 5 AIME 2025 problems using Mistral `mistral-large-latest`.

---

## Final Results Summary

| Approach | Correct | Accuracy | Avg Steps | Avg Time |
|----------|---------|----------|-----------|----------|
| **Baseline (k=1)** | 5/5 | **100%** | 1 | ~24s |
| **RecursiveThink (improved)** | 5/5 | **100%** | 1.4 | ~30s |
| RecursiveThink (old prompts) | 3/5 | 60% | 4.2 | ~50s |

---

## Problem-by-Problem Breakdown

| Problem | Expected | Baseline | Recursive (new) | Recursive (old) |
|---------|----------|----------|-----------------|-----------------|
| id_01 | 70 | ✅ | ✅ 1 step | ✅ 5 steps |
| id_02 | 588 | ✅ | ✅ 1 step | ❌ failed |
| id_03 | 16 | ✅ | ✅ 1 step | ✅ 4 steps |
| id_04 | 117 | ✅ | ✅ 2 steps | ✅ 5 steps |
| id_05 | 279 | ✅ | ✅ 2 steps | ❌ failed |

---

## Key Findings

### 1. Ablation-Informed Prompts Dramatically Improved Recursive
After applying lessons from ablation study:
- Accuracy: **60% → 100%** (+40%)
- Avg steps: **4.2 → 1.4** (3x faster)

### 2. What Improved Recursive Performance
Based on ablation study (see `ablation_results.readme`):
- ✅ **Expert persona**: "world-class expert problem solver"
- ✅ **CoT markers**: "Let me think through this..."
- ✅ **Removed self-attribution**: "your solution" → "the solution"
- ✅ **Confidence framing**: "Be confident in your expert assessment"

### 3. Token Limit Was Critical for Baseline
Initial baseline (max_tokens=1024) scored 0% due to truncation.
After increasing to 4096 tokens: 100% accuracy.

---

## When to Use Each Approach

| Use Case | Recommendation |
|----------|----------------|
| Need reasoning trace | Recursive |
| Maximum speed | Baseline |
| Token-limited API | Recursive |
| Complex multi-step | Recursive |
| Simple problems | Either |

---

## Files

- `prompt.py` — Improved prompts based on ablation
- `ablation_results.readme` — Ablation study findings
- `tts/results/aime_recursive_*` — Recursive results
- `tts/results/aime_baseline_*` — Baseline results