-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathbaseline_vs_recursive.readme
More file actions
67 lines (49 loc) · 2.09 KB
/
baseline_vs_recursive.readme
File metadata and controls
67 lines (49 loc) · 2.09 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
# Baseline vs Recursive: AIME 2025 Comparison
## Overview
Compared **Baseline (single-shot)** vs **RecursiveThink** on first 5 AIME 2025 problems using Mistral `mistral-large-latest`.
---
## Final Results Summary
| Approach | Correct | Accuracy | Avg Steps | Avg Time |
|----------|---------|----------|-----------|----------|
| **Baseline (k=1)** | 5/5 | **100%** | 1 | ~24s |
| **RecursiveThink (improved)** | 5/5 | **100%** | 1.4 | ~30s |
| RecursiveThink (old prompts) | 3/5 | 60% | 4.2 | ~50s |
---
## Problem-by-Problem Breakdown
| Problem | Expected | Baseline | Recursive (new) | Recursive (old) |
|---------|----------|----------|-----------------|-----------------|
| id_01 | 70 | ✅ | ✅ 1 step | ✅ 5 steps |
| id_02 | 588 | ✅ | ✅ 1 step | ❌ failed |
| id_03 | 16 | ✅ | ✅ 1 step | ✅ 4 steps |
| id_04 | 117 | ✅ | ✅ 2 steps | ✅ 5 steps |
| id_05 | 279 | ✅ | ✅ 2 steps | ❌ failed |
---
## Key Findings
### 1. Ablation-Informed Prompts Dramatically Improved Recursive
After applying lessons from ablation study:
- Accuracy: **60% → 100%** (+40%)
- Avg steps: **4.2 → 1.4** (3x faster)
### 2. What Improved Recursive Performance
Based on ablation study (see `ablation_results.readme`):
- ✅ **Expert persona**: "world-class expert problem solver"
- ✅ **CoT markers**: "Let me think through this..."
- ✅ **Removed self-attribution**: "your solution" → "the solution"
- ✅ **Confidence framing**: "Be confident in your expert assessment"
### 3. Token Limit Was Critical for Baseline
Initial baseline (max_tokens=1024) scored 0% due to truncation.
After increasing to 4096 tokens: 100% accuracy.
---
## When to Use Each Approach
| Use Case | Recommendation |
|----------|----------------|
| Need reasoning trace | Recursive |
| Maximum speed | Baseline |
| Token-limited API | Recursive |
| Complex multi-step | Recursive |
| Simple problems | Either |
---
## Files
- `prompt.py` — Improved prompts based on ablation
- `ablation_results.readme` — Ablation study findings
- `tts/results/aime_recursive_*` — Recursive results
- `tts/results/aime_baseline_*` — Baseline results