Skip to content

Latest commit

 

History

History
423 lines (313 loc) · 13.6 KB

File metadata and controls

423 lines (313 loc) · 13.6 KB

Evaluating Your Model

For Model Providers: Complete workflow for benchmarking your model on Frontier-CS and submitting results to the leaderboard.

Step 1: Prepare Solutions

Place your solutions in the correct directory structure:

{track}/solutions/{problem}/{model}.{ext}
{track}/solutions/{problem}/{model}_{variant}.{ext}

Examples:

research/solutions/flash_attn/my_model.py
research/solutions/flash_attn/my_model_1.py      # variant 1
research/solutions/gemm_optimization/squares/my_model.py
algorithmic/solutions/1/my_model.cpp
  • Research track: Python (.py) by default, or C++ (.cpp) if problem specifies language: cpp in config.yaml
  • Algorithmic track: C++17 (.cpp)
  • We recommend generating 5 variants per model to compute Score@5

Step 2: Run Evaluation

Suppose you have a new model my_model and want to evaluate it. Three ways:

1. Put solutions in solutions/ directory

research/solutions/
├── flash_attn/my_model.py
├── cross_entropy/my_model.py
└── ...
frontier batch research --model my_model

2. Use your own directory

./my_solutions/
├── flash_attn/my_model.py
├── cross_entropy/my_model.py
└── ...
frontier batch research --solutions-dir ./my_solutions

3. Explicit pairs file

# pairs.txt
./my_solutions/flash_attn/my_model.py:flash_attn
./my_solutions/cross_entropy/my_model.py:cross_entropy
frontier batch research --pairs-file pairs.txt

Backend Options

# Research defaults to SkyPilot, algorithmic defaults to Docker
frontier batch research --backend docker
frontier batch algorithmic --backend skypilot

# Parallelism
frontier batch research --workers 20 --clusters 4

Result Storage

# Local (default): results saved to ./results/batch/{track}/
frontier batch research

# Cloud bucket (requires --backend skypilot): results written directly to S3/GCS
frontier batch research --bucket-url s3://my-bucket/results

# Sync from bucket to local
frontier batch research --bucket-url s3://my-bucket/results --sync-bucket

Control Options

frontier batch research --status          # Check status
frontier batch research --no-resume       # Force re-evaluate all
frontier batch research --retry-failed    # Retry failed (including score=0)
  • Incremental evaluation with hash-based caching (solution/problem changes trigger re-evaluation)

Step 3: View Results

Results from public test case evaluation are saved to ./results/batch/{track}/:

File Content
results.csv All evaluation results
by_model.csv Score@1, Avg@5, Score@5 per model
by_problem.csv Scores per problem
failed.txt Failed evaluations
pending.txt Pending evaluations

Step 4: Submit to Leaderboard

We welcome submissions from all models and agent frameworks. To have your results included in our leaderboard, please follow the instructions below.

Algorithmic Problems

We currently release 1-3 public test cases per problem for local testing and debugging. Full evaluation (with all test cases) is performed on our servers.

What to Submit

  1. Solution files: {problem_id}_{model_name}_solution.cpp for each problem
  2. Model/Agent info: Name and version of the model or agent framework used
  3. Generation method: Brief description of how solutions were generated (e.g., one-shot, multi-turn, with/without feedback)

Submission Format

Organize your solutions as:

submissions/
├── 1_gpt4_solution.cpp
├── 2_gpt4_solution.cpp
├── ...
└── metadata.json

metadata.json:

{
  "model": "gpt-4o",
  "agent_framework": "custom",
  "generation_method": "one-shot",
  "date": "2025-01-15",
  "notes": "Optional additional notes"
}

Research Problems

Research problems require a solution.py file implementing the Solution class interface.

Problem Structure

Research problems follow a hierarchical structure:

Problem (e.g., gemm_optimization, poc_generation)
└── Category (e.g., squares, heap_buffer_overflow)
    └── Variant (e.g., arvo_21000)
Level Example Description
Problem gemm_optimization Top-level problem domain
Category gemm_optimization/squares Scores are aggregated at this level for leaderboard reporting
Variant poc_generation/heap_buffer_overflow/arvo_21000 Each variant is evaluated independently with its own README

Key distinction:

  • Evaluation: Each variant runs independently and produces its own score
  • Reporting: Scores are aggregated by category for the leaderboard (e.g., all heap_buffer_overflow variants → one score)

Note: Some problems have only one level (e.g., flash_attn), which functions as both category and variant.

Problem ID Format

Each variant has a unique Problem ID based on its path under research/.

The full list of all evaluatable variants is in research/scripts/problems.txt.

Type Example Path Problem ID
Single problem research/flash_attn flash_attn
Problem with variants research/gemm_optimization/squares gemm_optimization/squares
Nested variants research/poc_generation/heap_buffer_overflow/arvo_21000 poc_generation/heap_buffer_overflow/arvo_21000

What to Submit

  1. Solution files: solution.py for each problem, placed in a directory matching the Problem ID
  2. Model/Agent info: Name and version of the model or agent framework used
  3. Local evaluation results (optional but recommended): Score from running the evaluator locally

Submission Format

Your submission zip should mirror the Problem ID directory structure:

submission.zip
├── flash_attn/
│   └── solution.py
├── gemm_optimization/
│   └── squares/
│       └── solution.py
├── cant_be_late/
│   └── high_availability_loose_deadline/
│       └── solution.py
├── poc_generation/
│   └── heap_buffer_overflow/
│       └── arvo_21000/
│           └── solution.py
└── metadata.json

Important: The directory structure must exactly match the Problem ID. For example:

  • flash_attn/solution.py
  • gemm_optimization/squares/solution.py

Each solution.py must implement:

class Solution:
    def __init__(self):
        pass

    def solve(self, *args):
        # Returns: solution output (format varies by problem)
        pass

metadata.json

{
  "model": "gpt-4o",
  "agent_framework": "custom",
  "generation_method": "one-shot",
  "date": "2025-01-15",
  "problems_solved": [
    "flash_attn",
    "gemm_optimization/squares",
    "cant_be_late/high_availability_loose_deadline"
  ],
  "notes": "Optional additional notes"
}

How to Submit

Send your submission to:

Please include:

  1. A zip/tar archive of your solutions following the format above
  2. metadata.json with model and method information
  3. (Optional) Local evaluation results if you ran them

Leaderboard

Accepted submissions will be evaluated on our full test suite and results will be published on the Frontier-CS Leaderboard.

How We Evaluate Submissions

After you submit, maintainers evaluate your solutions against the full private test suite. This runs automatically via weekly CI or manually by maintainers:

./scripts/run_eval.sh --track research
./scripts/run_eval.sh --track algorithmic

Options:

  • -j N: Parallelism (default: 10)
  • --force: Force re-evaluate all
  • --no-push: Don't push results

Results are saved to Frontier-CS-Result/ repository and published to the leaderboard.


Using Our Generation Scripts (Optional)

If you want to use our scripts to batch-generate solutions with LLMs:

Configure

models.txt (research/scripts/models.txt or algorithmic/scripts/models.txt)

  • One model name per line
  • Supported formats: gpt-5, claude-sonnet-4-5, gemini/gemini-2.5-pro, xai/grok-4, deepseek/deepseek-reasoner

indices.txt

  • Controls how many variants to generate per (model, problem) pair
  • Single number N = generate indices 0 to N-1
  • Multiple lines = specify explicit indices

API Keys

Set environment variables for the providers you need. Multiple keys per provider are supported for load balancing (e.g., OPENAI_API_KEY, OPENAI_API_KEY2, OPENAI_API_KEY_2).

Provider Environment Variable Models
OpenAI OPENAI_API_KEY gpt-4o, gpt-5, o1, o3, ...
Anthropic ANTHROPIC_API_KEY claude-sonnet-4-5, claude-opus-4, ...
Google GOOGLE_API_KEY gemini-2.5-pro, gemini-2.5-flash, ...
xAI XAI_API_KEY grok-3, grok-3-mini, ...
DeepSeek DEEPSEEK_API_KEY deepseek-r1, deepseek-chat, ...
OpenRouter OPENROUTER_API_KEY openrouter/* models
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-...
export GOOGLE_API_KEY=...

Generate Solutions

Research Track

Most research problems are Python, but some (e.g., nbody_simulation) require C++. The language is configured per-problem via language field in config.yaml.

# Generate one solution
python research/scripts/generate_solutions.py --problem flash_attn --model gpt-5 --indices 1

# Preview what would be generated
python research/scripts/generate_solutions.py --dryrun

Algorithmic Track (C++)

python algorithmic/scripts/generate_solutions.py --model gpt-5

Two Modes

Problem mode (generate new solutions):

python research/scripts/generate_solutions.py --problem flash_attn --model gpt-5

Generates problems × models × indices (Cartesian product):

  • Problems: --problem patterns or --problems-file (default: auto-discover all problems)
  • Models: --model list or --models-file (default: models.txt)
  • Indices: --indices N or --indices-file (default: indices.txt or single solution)

Solution naming: {problem}/{model}.py for index 0, {problem}/{model}_{i}.py for index i.

Solution mode (regenerate existing solutions):

python research/scripts/generate_solutions.py --solution "flash_attn/gpt5*" --force
  • Matches existing solutions in solutions/ by pattern
  • Model inferred from solution filename (e.g., flash_attn/gpt5.py → model gpt5)
  • Requires --force since solutions already exist
  • Still needs models.txt or --model to map prefix to model name

Options

Option Description
--problem / --problems-file Problem pattern or file (default: auto-discover)
--model / --models-file Model(s) or file (default: models.txt)
--indices / --indices-file Solution indices count or file (default: indices.txt)
--solution PATTERN Regenerate existing solutions by pattern (mutually exclusive with --problem)
--force Overwrite existing solutions
--dryrun Preview without generating
--concurrency N Parallel API calls
--timeout SECONDS API timeout (default: 600s)

Output

Solutions are saved in nested directories under solutions/:

solutions/
├── flash_attn/
│   ├── gpt5.py
│   ├── gpt5_1.py
│   └── claude4.5sonnet.py
└── cross_entropy/
    └── gpt5.py

Check Coverage (Research Only)

python research/scripts/check_solutions.py

Shows:

  • Expected: models × problems × variants
  • Generated: expected AND exists
  • Missing: expected but NOT exists
  • Failed: .FAILED marker files (generation errors)
  • Extra: exists but NOT expected
  • Empty: file exists but content is empty

Outputs a coverage progress bar and exports problems.txt.

Customization Points

If you want to modify our scripts:

  1. Use OpenAI-compatible API (e.g., Azure, local models)

    • Modify base_url parameter in src/frontier_cs/gen/llm.py instantiate_llm_client
    • Or pass base_url when initializing GPT class in llm_interface.py
    • DeepSeek, Grok, etc. are already implemented using OpenAI SDK with different base_url
  2. Add a new LLM provider

    • Add a new class in src/frontier_cs/gen/llm_interface.py (inherit LLMInterface, implement call_llm)
    • Add provider handling in src/frontier_cs/gen/llm.py instantiate_llm_client
  3. Add model prefix mapping

    • Edit src/frontier_cs/models.py get_model_prefix() to map model name → file prefix
    • Example: claude-sonnet-4-5-20250929claude4.5sonnet
  4. Modify prompt templates

    • Research: system prompt in research/scripts/generate_solutions.py
    • Algorithmic: CPP_SYSTEM_PROMPT in algorithmic/scripts/generate_solutions.py
  5. Customize solution filename format

    • src/frontier_cs/gen/solution_format.py
    • src/frontier_cs/models.py: get_solution_filename(), get_solution_path()