[Eval] Benchmark against NanoGPT hackers

## Motivation
Why is this evaluation important for HOPE/TITAN reproduction?
To see if Nested Learning can live up to the hype of emergent learning when compared to AR and Diffusion training method for basic pre-training

## Task details
- Dataset / benchmark: https://github.com/KellerJordan/modded-nanogpt
- Metric(s): time-to-Perplexity, basic LLM benchmarks
- Expected runtime / hardware: same hardware, same dataset

## Implementation sketch
Outline scripts/flags needed (e.g., extend `scripts/eval/zeroshot.py`).

## Acceptance criteria
Describe what needs to be captured (JSON fields, plots, etc.).
Reports on how over-optimized autoregression cannot outperform Nested Learning when given the same clock time and/or FLOPs, to target the same dataset but getting lower Perplexity. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Eval] Benchmark against NanoGPT hackers #11

Motivation

Task details

Implementation sketch

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Eval] Benchmark against NanoGPT hackers #11

Description

Motivation

Task details

Implementation sketch

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions