FrontierCS
diff --git a/‎.claude/CLAUDE.md‎
Lines changed: 10 additions & 0 deletions b/‎.claude/CLAUDE.md‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎.github/PULL_REQUEST_TEMPLATE.md‎
Lines changed: 2 additions & 1 deletion b/‎.github/PULL_REQUEST_TEMPLATE.md‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎.github/PULL_REQUEST_TEMPLATE/research_problem.md‎
Lines changed: 3 additions & 3 deletions b/‎.github/PULL_REQUEST_TEMPLATE/research_problem.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎.github/workflows/validate-problems.yml‎
Lines changed: 32 additions & 4 deletions b/‎.github/workflows/validate-problems.yml‎
Lines changed: 32 additions & 4 deletions
diff --git a/‎.github/workflows/weekly-eval.yml‎
Lines changed: 2 additions & 5 deletions b/‎.github/workflows/weekly-eval.yml‎
Lines changed: 2 additions & 5 deletions
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 10 additions & 6 deletions b/‎CONTRIBUTING.md‎
Lines changed: 10 additions & 6 deletions
diff --git a/‎README.md‎
Lines changed: 9 additions & 13 deletions b/‎README.md‎
Lines changed: 9 additions & 13 deletions
diff --git a/‎SUBMIT.md‎
Lines changed: 19 additions & 17 deletions b/‎SUBMIT.md‎
Lines changed: 19 additions & 17 deletions
@@ -0,0 +1,10 @@
+# Project Rules for Frontier-CS
+
+## Backend Selection
+
+**NEVER change the backend due to missing credentials or CI configuration issues.**
+
+- Research track: always uses SkyPilot (cloud VMs)
+- Algorithmic track: always uses Docker (local)
+
+If CI fails due to credentials/permissions, fix the credentials - do NOT change the code to use a different backend. The backend choice is intentional for each track's evaluation requirements.
@@ -1,6 +1,7 @@
 ## Summary
 <!-- Brief description of changes -->
 
+> Please read [CONTRIBUTING.md](../CONTRIBUTING.md) before submitting.
 
 ## Type of Change
 - [ ] New research problem
@@ -21,4 +22,4 @@
 ## CI Validation (for new problems)
 > When adding new problems, CI will automatically validate that your reference solution achieves score > 0.
 > - Algorithmic problems: Include `reference.cpp` in your problem directory
-> - Research problems: Include `reference.py` in your problem directory
+> - Research problems: Include `reference.py` (or `reference.cpp` if `language: cpp` in config.yaml)
@@ -28,7 +28,7 @@ labels: research-problem
 - [ ] `evaluate.sh` - Evaluation entry point
 - [ ] `evaluator.py` - Scoring logic (outputs 0-100 score)
 - [ ] `resources/` - Problem-specific code/data
-- [ ] `reference.py` - Reference solution **(required for CI)**
+- [ ] `reference.{py,cpp}` - Reference solution **(required for CI, extension matches `language` in config.yaml)**
 
 ### Problem Structure
 ```
@@ -38,15 +38,15 @@ research/{problem_name}/
 ├── set_up_env.sh
 ├── evaluate.sh
 ├── evaluator.py
-├── reference.py     # Required: CI will validate this achieves score > 0
+├── reference.{py,cpp}  # Required: CI validates score > 0 (extension per language)
 └── resources/
     └── ...
 ```
 
 ### Testing
 - [ ] Verified `set_up_env.sh` runs successfully
 - [ ] Verified `evaluate.sh` runs and outputs a numeric score
-- [ ] **Reference solution (`reference.py`) achieves score > 0**
+- [ ] **Reference solution achieves score > 0**
 
 **Test Results** (if available):
 ```
 
@@ -78,25 +78,53 @@ jobs:
       - name: Install dependencies
         run: uv sync
 
+      - name: Setup AWS credentials
+        env:
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
+        run: |
+          mkdir -p ~/.aws
+          cat > ~/.aws/credentials << EOF
+          [default]
+          aws_access_key_id = $AWS_ACCESS_KEY_ID
+          aws_secret_access_key = $AWS_SECRET_ACCESS_KEY
+          EOF
+          cat > ~/.aws/config << EOF
+          [default]
+          region = us-east-1
+          EOF
+          echo "AWS credentials configured"
+
       - name: Setup GCP credentials
         env:
           GCP_CREDS: ${{ secrets.GCP_CREDENTIALS }}
         run: |
           if [ -n "$GCP_CREDS" ]; then
             echo "$GCP_CREDS" > /tmp/gcp-key.json
             echo "GOOGLE_APPLICATION_CREDENTIALS=/tmp/gcp-key.json" >> $GITHUB_ENV
+            gcloud auth activate-service-account --key-file=/tmp/gcp-key.json
+            gcloud config set project ${{ secrets.GCP_PROJECT_ID }}
             echo "GCP credentials configured"
-          else
-            echo "No GCP credentials available, skipping..."
+          fi
+
+      - name: Generate SSH key for SkyPilot
+        run: |
+          mkdir -p ~/.ssh
+          if [ ! -f ~/.ssh/sky-key ]; then
+            ssh-keygen -t rsa -b 4096 -f ~/.ssh/sky-key -N "" -C "sky-ci"
+            echo "Generated SSH key for SkyPilot"
           fi
 
       - name: Setup SkyPilot
         run: |
-          uv run sky check || echo "SkyPilot check failed, continuing..."
+          uv run sky check aws gcp || echo "SkyPilot check failed, continuing..."
 
       - name: Validate problems
+        timeout-minutes: 30
         run: |
           echo "Validating research problems: ${{ needs.detect-changes.outputs.research }}"
           uv run python scripts/validate_problems.py \
             --track research \
-            --problems ${{ needs.detect-changes.outputs.research }}
+            --timeout 1200 \
+            --problems ${{ needs.detect-changes.outputs.research }} \
+            --verbose
@@ -100,9 +100,7 @@ jobs:
             --track research \
             --internal-dir internal \
             --results-repo results-repo \
-            --workers $WORKERS \
-            --clusters $CLUSTERS \
-            --skypilot \
+            -j $CLUSTERS \
             --push
 
       - name: Run algorithmic evaluation
@@ -116,8 +114,7 @@ jobs:
             --track algorithmic \
             --internal-dir internal \
             --results-repo results-repo \
-            --workers $WORKERS \
-            --skypilot \
+            -j $WORKERS \
             --push
 
       - name: Upload results artifact
 
@@ -1,6 +1,8 @@
 # Contributing to Frontier-CS
 
-Frontier-CS is currently an **invitation-only** project for new problems. 
+> **For Problem Contributors**: Guidelines for creating and submitting new problems to Frontier-CS.
+
+Frontier-CS is currently an **invitation-only** project for new problems.
 Please create a GitHub pull request (PR) with your proposed problem following the guidelines below. After your PR is reviewed and merged, please send any hidden test data and reference solutions to the contact email provided at the end of this document.
 
 
@@ -130,11 +132,11 @@ research/{problem_name}/
 ├── evaluate.sh          # Evaluation entry point
 ├── evaluator.py         # Scoring logic
 ├── readme               # Problem description
-├── reference.py         # Reference solution (required for CI validation)
+├── reference.{py,cpp}   # Reference solution (required for CI, extension per language)
 └── resources/           # Problem-specific code/data
 ```
 
-> **Note**: The `reference.py` is required for CI validation. When you submit a PR, the CI will automatically run your reference solution and verify it achieves score > 0.
+> **Note**: A reference solution is required for CI validation. Use `reference.py` for Python problems or `reference.cpp` if `language: cpp` in config.yaml. The CI will automatically run your reference solution and verify it achieves score > 0.
 
 ### Solution Interface
 
@@ -331,10 +333,12 @@ When you submit a PR that adds or modifies problems, CI will automatically valid
 | Track | File | Location |
 |-------|------|----------|
 | Algorithmic | `reference.cpp` | `algorithmic/problems/{id}/reference.cpp` |
-| Research | `reference.py` | `research/problems/{name}/reference.py` |
+| Research | `reference.{py,cpp}` | `research/problems/{name}/reference.{ext}` (extension per `language` in config.yaml) |
 
 If the reference solution is missing or scores 0, the PR will be blocked from merging.
 
+> **Important**: The reference solution must achieve score > 0. This is a design choice to ensure the evaluator is working correctly - a score > 0 proves that the evaluation pipeline can successfully compile/run the solution and produce a valid score. If the reference only scores 0, we cannot distinguish between "evaluator error" and "valid solution with no improvement". For problems that measure speedup against a baseline, the reference must be **faster than the baseline**, not just a copy of it.
+
 ### Local Testing
 
 Before submitting a PR, test your reference solution locally:
@@ -343,8 +347,8 @@ Before submitting a PR, test your reference solution locally:
 # Algorithmic
 frontier eval algorithmic {id} algorithmic/problems/{id}/reference.cpp
 
-# Research
-frontier eval research {name} research/problems/{name}/reference.py
+# Research (use .py or .cpp based on problem's language config)
+frontier eval research {name} research/problems/{name}/reference.{ext}
 ```
 
 ## Contact
 
@@ -150,9 +150,9 @@ frontier eval algorithmic 1 <your_solution.cpp> --unbounded
 ### Python API
 
 ```python
-from frontier_cs import FrontierCSEvaluator
+from frontier_cs import SingleEvaluator
 
-evaluator = FrontierCSEvaluator()
+evaluator = SingleEvaluator()
 
 # Evaluate a research problem
 result = evaluator.evaluate("research", problem_id="flash_attn", code=my_code)
@@ -195,28 +195,24 @@ research/solutions/
 
 ```bash
 # Evaluate all research solutions (uses SkyPilot by default)
-uv run frontier-eval batch research
+frontier batch research
 
 # Evaluate all algorithmic solutions (uses Docker by default)
-uv run frontier-eval batch algorithmic
+frontier batch algorithmic
 
 # Filter by model or problem
-uv run frontier-eval batch research --model gpt5.1
-uv run frontier-eval batch research --problem flash_attn
-uv run frontier-eval batch research --model gpt5.1 --problem flash_attn
+frontier batch research --model gpt5.1
+frontier batch research --problem flash_attn
 
 # Override default backend
-uv run frontier-eval batch research --backend docker
-uv run frontier-eval batch algorithmic --backend skypilot
+frontier batch research --backend docker
+frontier batch algorithmic --backend skypilot
 ```
 
 **Custom solutions directory:** You can test solutions from a custom directory with the same structure:
 
 ```bash
-# Your custom directory should have the same structure:
-# my_solutions/{problem}/{model}.py
-
-uv run frontier-eval batch research --solutions-dir ./my_solutions
+frontier batch research --solutions-dir ./my_solutions
 ```
 
 Results are saved to `./results/batch/{track}/` by default. The state file tracks which (solution, problem) pairs have been evaluated, so you can:
 
@@ -1,6 +1,6 @@
 # Evaluating Your Model
 
-Complete workflow for benchmarking your model on Frontier-CS and submitting results to the leaderboard.
+> **For Model Providers**: Complete workflow for benchmarking your model on Frontier-CS and submitting results to the leaderboard.
 
 ## Step 1: Prepare Solutions
 
@@ -19,7 +19,7 @@ research/solutions/gemm_optimization/squares/my_model.py
 algorithmic/solutions/1/my_model.cpp
 ```
 
-- **Research track**: Python (`.py`)
+- **Research track**: Python (`.py`) by default, or C++ (`.cpp`) if problem specifies `language: cpp` in config.yaml
 - **Algorithmic track**: C++17 (`.cpp`)
 - We recommend generating **5 variants per model** to compute Score@5
 
@@ -36,7 +36,7 @@ research/solutions/
 └── ...
 ```
 ```bash
-frontier-eval batch research --model my_model
+frontier batch research --model my_model
 ```
 
 **2. Use your own directory**
@@ -48,7 +48,7 @@ frontier-eval batch research --model my_model
 └── ...
 ```
 ```bash
-frontier-eval batch research --solutions-dir ./my_solutions
+frontier batch research --solutions-dir ./my_solutions
 ```
 
 **3. Explicit pairs file**
@@ -59,39 +59,39 @@ frontier-eval batch research --solutions-dir ./my_solutions
 ./my_solutions/cross_entropy/my_model.py:cross_entropy
 ```
 ```bash
-frontier-eval batch research --pairs-file pairs.txt
+frontier batch research --pairs-file pairs.txt
 ```
 
 ### Backend Options
 
 ```bash
 # Research defaults to SkyPilot, algorithmic defaults to Docker
-frontier-eval batch research --backend docker
-frontier-eval batch algorithmic --backend skypilot
+frontier batch research --backend docker
+frontier batch algorithmic --backend skypilot
 
 # Parallelism
-frontier-eval batch research --workers 20 --clusters 4
+frontier batch research --workers 20 --clusters 4
 ```
 
 ### Result Storage
 
 ```bash
 # Local (default): results saved to ./results/batch/{track}/
-frontier-eval batch research
+frontier batch research
 
 # Cloud bucket (requires --backend skypilot): results written directly to S3/GCS
-frontier-eval batch research --bucket-url s3://my-bucket/results
+frontier batch research --bucket-url s3://my-bucket/results
 
 # Sync from bucket to local
-frontier-eval batch research --bucket-url s3://my-bucket/results --sync-bucket
+frontier batch research --bucket-url s3://my-bucket/results --sync-bucket
 ```
 
 ### Control Options
 
 ```bash
-frontier-eval batch research --status          # Check status
-frontier-eval batch research --no-resume       # Force re-evaluate all
-frontier-eval batch research --retry-failed    # Retry failed (including score=0)
+frontier batch research --status          # Check status
+frontier batch research --no-resume       # Force re-evaluate all
+frontier batch research --retry-failed    # Retry failed (including score=0)
 ```
 
 - Incremental evaluation with hash-based caching (solution/problem changes trigger re-evaluation)
@@ -114,7 +114,7 @@ We welcome submissions from all models and agent frameworks. To have your result
 
 ### Algorithmic Problems
 
-We currently release **1 -- 3 public test case** per problem for local testing and debugging. Full evaluation (with all test cases) is performed on our servers.
+We currently release **1-3 public test cases** per problem for local testing and debugging. Full evaluation (with all test cases) is performed on our servers.
 
 #### What to Submit
 
@@ -174,7 +174,7 @@ Problem (e.g., gemm_optimization, poc_generation)
 
 Each variant has a unique **Problem ID** based on its path under `research/`.
 
-The full list of all evaluatable variants is in [`research/problems.txt`](research/problems.txt) (109 variants total, aggregated into ~50 categories for reporting).
+The full list of all evaluatable variants is in [`research/scripts/problems.txt`](research/scripts/problems.txt).
 
 | Type | Example Path | Problem ID |
 |------|-------------|------------|
@@ -309,7 +309,9 @@ export GOOGLE_API_KEY=...
 
 ### Generate Solutions
 
-#### Research Track (Python)
+#### Research Track
+
+Most research problems are Python, but some (e.g., `nbody_simulation`) require C++. The language is configured per-problem via `language` field in `config.yaml`.
 
 ```bash
 # Generate one solution