For Model Providers: Complete workflow for benchmarking your model on Frontier-CS and submitting results to the leaderboard.
Place your solutions in the correct directory structure:
{track}/solutions/{problem}/{model}.{ext}
{track}/solutions/{problem}/{model}_{variant}.{ext}
Examples:
research/solutions/flash_attn/my_model.py
research/solutions/flash_attn/my_model_1.py # variant 1
research/solutions/gemm_optimization/squares/my_model.py
algorithmic/solutions/1/my_model.cpp
- Research track: Python (
.py) by default, or C++ (.cpp) if problem specifieslanguage: cppin config.yaml - Algorithmic track: C++17 (
.cpp) - We recommend generating 5 variants per model to compute Score@5
Suppose you have a new model my_model and want to evaluate it. Three ways:
1. Put solutions in solutions/ directory
research/solutions/
├── flash_attn/my_model.py
├── cross_entropy/my_model.py
└── ...
frontier batch research --model my_model2. Use your own directory
./my_solutions/
├── flash_attn/my_model.py
├── cross_entropy/my_model.py
└── ...
frontier batch research --solutions-dir ./my_solutions3. Explicit pairs file
# pairs.txt
./my_solutions/flash_attn/my_model.py:flash_attn
./my_solutions/cross_entropy/my_model.py:cross_entropy
frontier batch research --pairs-file pairs.txt# Research defaults to SkyPilot, algorithmic defaults to Docker
frontier batch research --backend docker
frontier batch algorithmic --backend skypilot
# Parallelism
frontier batch research --workers 20 --clusters 4# Local (default): results saved to ./results/batch/{track}/
frontier batch research
# Cloud bucket (requires --backend skypilot): results written directly to S3/GCS
frontier batch research --bucket-url s3://my-bucket/results
# Sync from bucket to local
frontier batch research --bucket-url s3://my-bucket/results --sync-bucketfrontier batch research --status # Check status
frontier batch research --no-resume # Force re-evaluate all
frontier batch research --retry-failed # Retry failed (including score=0)- Incremental evaluation with hash-based caching (solution/problem changes trigger re-evaluation)
Results from public test case evaluation are saved to ./results/batch/{track}/:
| File | Content |
|---|---|
results.csv |
All evaluation results |
by_model.csv |
Score@1, Avg@5, Score@5 per model |
by_problem.csv |
Scores per problem |
failed.txt |
Failed evaluations |
pending.txt |
Pending evaluations |
We welcome submissions from all models and agent frameworks. To have your results included in our leaderboard, please follow the instructions below.
We currently release 1-3 public test cases per problem for local testing and debugging. Full evaluation (with all test cases) is performed on our servers.
- Solution files:
{problem_id}_{model_name}_solution.cppfor each problem - Model/Agent info: Name and version of the model or agent framework used
- Generation method: Brief description of how solutions were generated (e.g., one-shot, multi-turn, with/without feedback)
Organize your solutions as:
submissions/
├── 1_gpt4_solution.cpp
├── 2_gpt4_solution.cpp
├── ...
└── metadata.json
metadata.json:
{
"model": "gpt-4o",
"agent_framework": "custom",
"generation_method": "one-shot",
"date": "2025-01-15",
"notes": "Optional additional notes"
}Research problems require a solution.py file implementing the Solution class interface.
Research problems follow a hierarchical structure:
Problem (e.g., gemm_optimization, poc_generation)
└── Category (e.g., squares, heap_buffer_overflow)
└── Variant (e.g., arvo_21000)
| Level | Example | Description |
|---|---|---|
| Problem | gemm_optimization |
Top-level problem domain |
| Category | gemm_optimization/squares |
Scores are aggregated at this level for leaderboard reporting |
| Variant | poc_generation/heap_buffer_overflow/arvo_21000 |
Each variant is evaluated independently with its own README |
Key distinction:
- Evaluation: Each variant runs independently and produces its own score
- Reporting: Scores are aggregated by category for the leaderboard (e.g., all
heap_buffer_overflowvariants → one score)
Note: Some problems have only one level (e.g.,
flash_attn), which functions as both category and variant.
Each variant has a unique Problem ID based on its path under research/.
The full list of all evaluatable variants is in research/scripts/problems.txt.
| Type | Example Path | Problem ID |
|---|---|---|
| Single problem | research/flash_attn |
flash_attn |
| Problem with variants | research/gemm_optimization/squares |
gemm_optimization/squares |
| Nested variants | research/poc_generation/heap_buffer_overflow/arvo_21000 |
poc_generation/heap_buffer_overflow/arvo_21000 |
- Solution files:
solution.pyfor each problem, placed in a directory matching the Problem ID - Model/Agent info: Name and version of the model or agent framework used
- Local evaluation results (optional but recommended): Score from running the evaluator locally
Your submission zip should mirror the Problem ID directory structure:
submission.zip
├── flash_attn/
│ └── solution.py
├── gemm_optimization/
│ └── squares/
│ └── solution.py
├── cant_be_late/
│ └── high_availability_loose_deadline/
│ └── solution.py
├── poc_generation/
│ └── heap_buffer_overflow/
│ └── arvo_21000/
│ └── solution.py
└── metadata.json
Important: The directory structure must exactly match the Problem ID. For example:
flash_attn/solution.pygemm_optimization/squares/solution.py
Each solution.py must implement:
class Solution:
def __init__(self):
pass
def solve(self, *args):
# Returns: solution output (format varies by problem)
pass{
"model": "gpt-4o",
"agent_framework": "custom",
"generation_method": "one-shot",
"date": "2025-01-15",
"problems_solved": [
"flash_attn",
"gemm_optimization/squares",
"cant_be_late/high_availability_loose_deadline"
],
"notes": "Optional additional notes"
}Send your submission to:
- Email: qmang@berkeley.edu or wenhao.chai@princeton.edu
Please include:
- A zip/tar archive of your solutions following the format above
metadata.jsonwith model and method information- (Optional) Local evaluation results if you ran them
Accepted submissions will be evaluated on our full test suite and results will be published on the Frontier-CS Leaderboard.
After you submit, maintainers evaluate your solutions against the full private test suite. This runs automatically via weekly CI or manually by maintainers:
./scripts/run_eval.sh --track research
./scripts/run_eval.sh --track algorithmicOptions:
-j N: Parallelism (default: 10)--force: Force re-evaluate all--no-push: Don't push results
Results are saved to Frontier-CS-Result/ repository and published to the leaderboard.
If you want to use our scripts to batch-generate solutions with LLMs:
models.txt (research/scripts/models.txt or algorithmic/scripts/models.txt)
- One model name per line
- Supported formats:
gpt-5,claude-sonnet-4-5,gemini/gemini-2.5-pro,xai/grok-4,deepseek/deepseek-reasoner
indices.txt
- Controls how many variants to generate per (model, problem) pair
- Single number N = generate indices 0 to N-1
- Multiple lines = specify explicit indices
API Keys
Set environment variables for the providers you need. Multiple keys per provider are supported for load balancing (e.g., OPENAI_API_KEY, OPENAI_API_KEY2, OPENAI_API_KEY_2).
| Provider | Environment Variable | Models |
|---|---|---|
| OpenAI | OPENAI_API_KEY |
gpt-4o, gpt-5, o1, o3, ... |
| Anthropic | ANTHROPIC_API_KEY |
claude-sonnet-4-5, claude-opus-4, ... |
GOOGLE_API_KEY |
gemini-2.5-pro, gemini-2.5-flash, ... | |
| xAI | XAI_API_KEY |
grok-3, grok-3-mini, ... |
| DeepSeek | DEEPSEEK_API_KEY |
deepseek-r1, deepseek-chat, ... |
| OpenRouter | OPENROUTER_API_KEY |
openrouter/* models |
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-...
export GOOGLE_API_KEY=...Most research problems are Python, but some (e.g., nbody_simulation) require C++. The language is configured per-problem via language field in config.yaml.
# Generate one solution
python research/scripts/generate_solutions.py --problem flash_attn --model gpt-5 --indices 1
# Preview what would be generated
python research/scripts/generate_solutions.py --dryrunpython algorithmic/scripts/generate_solutions.py --model gpt-5Problem mode (generate new solutions):
python research/scripts/generate_solutions.py --problem flash_attn --model gpt-5Generates problems × models × indices (Cartesian product):
- Problems:
--problempatterns or--problems-file(default: auto-discover all problems) - Models:
--modellist or--models-file(default:models.txt) - Indices:
--indices Nor--indices-file(default:indices.txtor single solution)
Solution naming: {problem}/{model}.py for index 0, {problem}/{model}_{i}.py for index i.
Solution mode (regenerate existing solutions):
python research/scripts/generate_solutions.py --solution "flash_attn/gpt5*" --force- Matches existing solutions in
solutions/by pattern - Model inferred from solution filename (e.g.,
flash_attn/gpt5.py→ modelgpt5) - Requires
--forcesince solutions already exist - Still needs
models.txtor--modelto map prefix to model name
| Option | Description |
|---|---|
--problem / --problems-file |
Problem pattern or file (default: auto-discover) |
--model / --models-file |
Model(s) or file (default: models.txt) |
--indices / --indices-file |
Solution indices count or file (default: indices.txt) |
--solution PATTERN |
Regenerate existing solutions by pattern (mutually exclusive with --problem) |
--force |
Overwrite existing solutions |
--dryrun |
Preview without generating |
--concurrency N |
Parallel API calls |
--timeout SECONDS |
API timeout (default: 600s) |
Solutions are saved in nested directories under solutions/:
solutions/
├── flash_attn/
│ ├── gpt5.py
│ ├── gpt5_1.py
│ └── claude4.5sonnet.py
└── cross_entropy/
└── gpt5.py
python research/scripts/check_solutions.pyShows:
- Expected: models × problems × variants
- Generated: expected AND exists
- Missing: expected but NOT exists
- Failed:
.FAILEDmarker files (generation errors) - Extra: exists but NOT expected
- Empty: file exists but content is empty
Outputs a coverage progress bar and exports problems.txt.
If you want to modify our scripts:
-
Use OpenAI-compatible API (e.g., Azure, local models)
- Modify
base_urlparameter insrc/frontier_cs/gen/llm.pyinstantiate_llm_client - Or pass
base_urlwhen initializingGPTclass inllm_interface.py - DeepSeek, Grok, etc. are already implemented using OpenAI SDK with different base_url
- Modify
-
Add a new LLM provider
- Add a new class in
src/frontier_cs/gen/llm_interface.py(inheritLLMInterface, implementcall_llm) - Add provider handling in
src/frontier_cs/gen/llm.pyinstantiate_llm_client
- Add a new class in
-
Add model prefix mapping
- Edit
src/frontier_cs/models.pyget_model_prefix()to map model name → file prefix - Example:
claude-sonnet-4-5-20250929→claude4.5sonnet
- Edit
-
Modify prompt templates
- Research: system prompt in
research/scripts/generate_solutions.py - Algorithmic:
CPP_SYSTEM_PROMPTinalgorithmic/scripts/generate_solutions.py
- Research: system prompt in
-
Customize solution filename format
src/frontier_cs/gen/solution_format.pysrc/frontier_cs/models.py:get_solution_filename(),get_solution_path()