Skip to content

Commit 427cfdd

Browse files
authored
Merge pull request #88 from FrontierCS/docs/restructure-and-align-backend
docs: restructure documentation and align eval/batch backends
2 parents 1ece738 + 32f4a43 commit 427cfdd

76 files changed

Lines changed: 9147 additions & 615 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.claude/CLAUDE.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# Project Rules for Frontier-CS
2+
3+
## Backend Selection
4+
5+
**NEVER change the backend due to missing credentials or CI configuration issues.**
6+
7+
- Research track: always uses SkyPilot (cloud VMs)
8+
- Algorithmic track: always uses Docker (local)
9+
10+
If CI fails due to credentials/permissions, fix the credentials - do NOT change the code to use a different backend. The backend choice is intentional for each track's evaluation requirements.

.github/PULL_REQUEST_TEMPLATE.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
## Summary
22
<!-- Brief description of changes -->
33

4+
> Please read [CONTRIBUTING.md](../CONTRIBUTING.md) before submitting.
45
56
## Type of Change
67
- [ ] New research problem
@@ -21,4 +22,4 @@
2122
## CI Validation (for new problems)
2223
> When adding new problems, CI will automatically validate that your reference solution achieves score > 0.
2324
> - Algorithmic problems: Include `reference.cpp` in your problem directory
24-
> - Research problems: Include `reference.py` in your problem directory
25+
> - Research problems: Include `reference.py` (or `reference.cpp` if `language: cpp` in config.yaml)

.github/PULL_REQUEST_TEMPLATE/research_problem.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ labels: research-problem
2828
- [ ] `evaluate.sh` - Evaluation entry point
2929
- [ ] `evaluator.py` - Scoring logic (outputs 0-100 score)
3030
- [ ] `resources/` - Problem-specific code/data
31-
- [ ] `reference.py` - Reference solution **(required for CI)**
31+
- [ ] `reference.{py,cpp}` - Reference solution **(required for CI, extension matches `language` in config.yaml)**
3232

3333
### Problem Structure
3434
```
@@ -38,15 +38,15 @@ research/{problem_name}/
3838
├── set_up_env.sh
3939
├── evaluate.sh
4040
├── evaluator.py
41-
├── reference.py # Required: CI will validate this achieves score > 0
41+
├── reference.{py,cpp} # Required: CI validates score > 0 (extension per language)
4242
└── resources/
4343
└── ...
4444
```
4545

4646
### Testing
4747
- [ ] Verified `set_up_env.sh` runs successfully
4848
- [ ] Verified `evaluate.sh` runs and outputs a numeric score
49-
- [ ] **Reference solution (`reference.py`) achieves score > 0**
49+
- [ ] **Reference solution achieves score > 0**
5050

5151
**Test Results** (if available):
5252
```

.github/workflows/validate-problems.yml

Lines changed: 32 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -78,25 +78,53 @@ jobs:
7878
- name: Install dependencies
7979
run: uv sync
8080

81+
- name: Setup AWS credentials
82+
env:
83+
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
84+
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
85+
run: |
86+
mkdir -p ~/.aws
87+
cat > ~/.aws/credentials << EOF
88+
[default]
89+
aws_access_key_id = $AWS_ACCESS_KEY_ID
90+
aws_secret_access_key = $AWS_SECRET_ACCESS_KEY
91+
EOF
92+
cat > ~/.aws/config << EOF
93+
[default]
94+
region = us-east-1
95+
EOF
96+
echo "AWS credentials configured"
97+
8198
- name: Setup GCP credentials
8299
env:
83100
GCP_CREDS: ${{ secrets.GCP_CREDENTIALS }}
84101
run: |
85102
if [ -n "$GCP_CREDS" ]; then
86103
echo "$GCP_CREDS" > /tmp/gcp-key.json
87104
echo "GOOGLE_APPLICATION_CREDENTIALS=/tmp/gcp-key.json" >> $GITHUB_ENV
105+
gcloud auth activate-service-account --key-file=/tmp/gcp-key.json
106+
gcloud config set project ${{ secrets.GCP_PROJECT_ID }}
88107
echo "GCP credentials configured"
89-
else
90-
echo "No GCP credentials available, skipping..."
108+
fi
109+
110+
- name: Generate SSH key for SkyPilot
111+
run: |
112+
mkdir -p ~/.ssh
113+
if [ ! -f ~/.ssh/sky-key ]; then
114+
ssh-keygen -t rsa -b 4096 -f ~/.ssh/sky-key -N "" -C "sky-ci"
115+
echo "Generated SSH key for SkyPilot"
91116
fi
92117
93118
- name: Setup SkyPilot
94119
run: |
95-
uv run sky check || echo "SkyPilot check failed, continuing..."
120+
uv run sky check aws gcp || echo "SkyPilot check failed, continuing..."
96121
97122
- name: Validate problems
123+
timeout-minutes: 30
98124
run: |
99125
echo "Validating research problems: ${{ needs.detect-changes.outputs.research }}"
100126
uv run python scripts/validate_problems.py \
101127
--track research \
102-
--problems ${{ needs.detect-changes.outputs.research }}
128+
--timeout 1200 \
129+
--problems ${{ needs.detect-changes.outputs.research }} \
130+
--verbose

.github/workflows/weekly-eval.yml

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -100,9 +100,7 @@ jobs:
100100
--track research \
101101
--internal-dir internal \
102102
--results-repo results-repo \
103-
--workers $WORKERS \
104-
--clusters $CLUSTERS \
105-
--skypilot \
103+
-j $CLUSTERS \
106104
--push
107105
108106
- name: Run algorithmic evaluation
@@ -116,8 +114,7 @@ jobs:
116114
--track algorithmic \
117115
--internal-dir internal \
118116
--results-repo results-repo \
119-
--workers $WORKERS \
120-
--skypilot \
117+
-j $WORKERS \
121118
--push
122119
123120
- name: Upload results artifact

CONTRIBUTING.md

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
# Contributing to Frontier-CS
22

3-
Frontier-CS is currently an **invitation-only** project for new problems.
3+
> **For Problem Contributors**: Guidelines for creating and submitting new problems to Frontier-CS.
4+
5+
Frontier-CS is currently an **invitation-only** project for new problems.
46
Please create a GitHub pull request (PR) with your proposed problem following the guidelines below. After your PR is reviewed and merged, please send any hidden test data and reference solutions to the contact email provided at the end of this document.
57

68

@@ -130,11 +132,11 @@ research/{problem_name}/
130132
├── evaluate.sh # Evaluation entry point
131133
├── evaluator.py # Scoring logic
132134
├── readme # Problem description
133-
├── reference.py # Reference solution (required for CI validation)
135+
├── reference.{py,cpp} # Reference solution (required for CI, extension per language)
134136
└── resources/ # Problem-specific code/data
135137
```
136138

137-
> **Note**: The `reference.py` is required for CI validation. When you submit a PR, the CI will automatically run your reference solution and verify it achieves score > 0.
139+
> **Note**: A reference solution is required for CI validation. Use `reference.py` for Python problems or `reference.cpp` if `language: cpp` in config.yaml. The CI will automatically run your reference solution and verify it achieves score > 0.
138140

139141
### Solution Interface
140142

@@ -331,10 +333,12 @@ When you submit a PR that adds or modifies problems, CI will automatically valid
331333
| Track | File | Location |
332334
|-------|------|----------|
333335
| Algorithmic | `reference.cpp` | `algorithmic/problems/{id}/reference.cpp` |
334-
| Research | `reference.py` | `research/problems/{name}/reference.py` |
336+
| Research | `reference.{py,cpp}` | `research/problems/{name}/reference.{ext}` (extension per `language` in config.yaml) |
335337

336338
If the reference solution is missing or scores 0, the PR will be blocked from merging.
337339

340+
> **Important**: The reference solution must achieve score > 0. This is a design choice to ensure the evaluator is working correctly - a score > 0 proves that the evaluation pipeline can successfully compile/run the solution and produce a valid score. If the reference only scores 0, we cannot distinguish between "evaluator error" and "valid solution with no improvement". For problems that measure speedup against a baseline, the reference must be **faster than the baseline**, not just a copy of it.
341+
338342
### Local Testing
339343

340344
Before submitting a PR, test your reference solution locally:
@@ -343,8 +347,8 @@ Before submitting a PR, test your reference solution locally:
343347
# Algorithmic
344348
frontier eval algorithmic {id} algorithmic/problems/{id}/reference.cpp
345349

346-
# Research
347-
frontier eval research {name} research/problems/{name}/reference.py
350+
# Research (use .py or .cpp based on problem's language config)
351+
frontier eval research {name} research/problems/{name}/reference.{ext}
348352
```
349353

350354
## Contact

README.md

Lines changed: 9 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -150,9 +150,9 @@ frontier eval algorithmic 1 <your_solution.cpp> --unbounded
150150
### Python API
151151

152152
```python
153-
from frontier_cs import FrontierCSEvaluator
153+
from frontier_cs import SingleEvaluator
154154

155-
evaluator = FrontierCSEvaluator()
155+
evaluator = SingleEvaluator()
156156

157157
# Evaluate a research problem
158158
result = evaluator.evaluate("research", problem_id="flash_attn", code=my_code)
@@ -195,28 +195,24 @@ research/solutions/
195195

196196
```bash
197197
# Evaluate all research solutions (uses SkyPilot by default)
198-
uv run frontier-eval batch research
198+
frontier batch research
199199

200200
# Evaluate all algorithmic solutions (uses Docker by default)
201-
uv run frontier-eval batch algorithmic
201+
frontier batch algorithmic
202202

203203
# Filter by model or problem
204-
uv run frontier-eval batch research --model gpt5.1
205-
uv run frontier-eval batch research --problem flash_attn
206-
uv run frontier-eval batch research --model gpt5.1 --problem flash_attn
204+
frontier batch research --model gpt5.1
205+
frontier batch research --problem flash_attn
207206

208207
# Override default backend
209-
uv run frontier-eval batch research --backend docker
210-
uv run frontier-eval batch algorithmic --backend skypilot
208+
frontier batch research --backend docker
209+
frontier batch algorithmic --backend skypilot
211210
```
212211

213212
**Custom solutions directory:** You can test solutions from a custom directory with the same structure:
214213

215214
```bash
216-
# Your custom directory should have the same structure:
217-
# my_solutions/{problem}/{model}.py
218-
219-
uv run frontier-eval batch research --solutions-dir ./my_solutions
215+
frontier batch research --solutions-dir ./my_solutions
220216
```
221217

222218
Results are saved to `./results/batch/{track}/` by default. The state file tracks which (solution, problem) pairs have been evaluated, so you can:

SUBMIT.md

Lines changed: 19 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Evaluating Your Model
22

3-
Complete workflow for benchmarking your model on Frontier-CS and submitting results to the leaderboard.
3+
> **For Model Providers**: Complete workflow for benchmarking your model on Frontier-CS and submitting results to the leaderboard.
44
55
## Step 1: Prepare Solutions
66

@@ -19,7 +19,7 @@ research/solutions/gemm_optimization/squares/my_model.py
1919
algorithmic/solutions/1/my_model.cpp
2020
```
2121

22-
- **Research track**: Python (`.py`)
22+
- **Research track**: Python (`.py`) by default, or C++ (`.cpp`) if problem specifies `language: cpp` in config.yaml
2323
- **Algorithmic track**: C++17 (`.cpp`)
2424
- We recommend generating **5 variants per model** to compute Score@5
2525

@@ -36,7 +36,7 @@ research/solutions/
3636
└── ...
3737
```
3838
```bash
39-
frontier-eval batch research --model my_model
39+
frontier batch research --model my_model
4040
```
4141

4242
**2. Use your own directory**
@@ -48,7 +48,7 @@ frontier-eval batch research --model my_model
4848
└── ...
4949
```
5050
```bash
51-
frontier-eval batch research --solutions-dir ./my_solutions
51+
frontier batch research --solutions-dir ./my_solutions
5252
```
5353

5454
**3. Explicit pairs file**
@@ -59,39 +59,39 @@ frontier-eval batch research --solutions-dir ./my_solutions
5959
./my_solutions/cross_entropy/my_model.py:cross_entropy
6060
```
6161
```bash
62-
frontier-eval batch research --pairs-file pairs.txt
62+
frontier batch research --pairs-file pairs.txt
6363
```
6464

6565
### Backend Options
6666

6767
```bash
6868
# Research defaults to SkyPilot, algorithmic defaults to Docker
69-
frontier-eval batch research --backend docker
70-
frontier-eval batch algorithmic --backend skypilot
69+
frontier batch research --backend docker
70+
frontier batch algorithmic --backend skypilot
7171

7272
# Parallelism
73-
frontier-eval batch research --workers 20 --clusters 4
73+
frontier batch research --workers 20 --clusters 4
7474
```
7575

7676
### Result Storage
7777

7878
```bash
7979
# Local (default): results saved to ./results/batch/{track}/
80-
frontier-eval batch research
80+
frontier batch research
8181

8282
# Cloud bucket (requires --backend skypilot): results written directly to S3/GCS
83-
frontier-eval batch research --bucket-url s3://my-bucket/results
83+
frontier batch research --bucket-url s3://my-bucket/results
8484

8585
# Sync from bucket to local
86-
frontier-eval batch research --bucket-url s3://my-bucket/results --sync-bucket
86+
frontier batch research --bucket-url s3://my-bucket/results --sync-bucket
8787
```
8888

8989
### Control Options
9090

9191
```bash
92-
frontier-eval batch research --status # Check status
93-
frontier-eval batch research --no-resume # Force re-evaluate all
94-
frontier-eval batch research --retry-failed # Retry failed (including score=0)
92+
frontier batch research --status # Check status
93+
frontier batch research --no-resume # Force re-evaluate all
94+
frontier batch research --retry-failed # Retry failed (including score=0)
9595
```
9696

9797
- Incremental evaluation with hash-based caching (solution/problem changes trigger re-evaluation)
@@ -114,7 +114,7 @@ We welcome submissions from all models and agent frameworks. To have your result
114114

115115
### Algorithmic Problems
116116

117-
We currently release **1 -- 3 public test case** per problem for local testing and debugging. Full evaluation (with all test cases) is performed on our servers.
117+
We currently release **1-3 public test cases** per problem for local testing and debugging. Full evaluation (with all test cases) is performed on our servers.
118118

119119
#### What to Submit
120120

@@ -174,7 +174,7 @@ Problem (e.g., gemm_optimization, poc_generation)
174174

175175
Each variant has a unique **Problem ID** based on its path under `research/`.
176176

177-
The full list of all evaluatable variants is in [`research/problems.txt`](research/problems.txt) (109 variants total, aggregated into ~50 categories for reporting).
177+
The full list of all evaluatable variants is in [`research/scripts/problems.txt`](research/scripts/problems.txt).
178178

179179
| Type | Example Path | Problem ID |
180180
|------|-------------|------------|
@@ -309,7 +309,9 @@ export GOOGLE_API_KEY=...
309309

310310
### Generate Solutions
311311

312-
#### Research Track (Python)
312+
#### Research Track
313+
314+
Most research problems are Python, but some (e.g., `nbody_simulation`) require C++. The language is configured per-problem via `language` field in `config.yaml`.
313315

314316
```bash
315317
# Generate one solution

0 commit comments

Comments
 (0)