Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
cfcaa84
feat: switch evals to 8k1k and restart server with native max context
Oseltamivir Mar 7, 2026
8d98b94
fix: cap gen_kwargs max_tokens to leave room for prompt
Oseltamivir Mar 7, 2026
4452868
fix: reserve 30% of context for prompt in eval gen_kwargs
Oseltamivir Mar 7, 2026
83cd7de
change eval calling
Oseltamivir Mar 9, 2026
161e7f6
fix: eval-only server wait and PEP 668 pip install
Oseltamivir Mar 9, 2026
73b6846
change gsm8k to 8-shot
Oseltamivir Mar 9, 2026
83e185d
refactor: decouple eval concurrency, cap gen tokens, fix eval config
Oseltamivir Mar 10, 2026
aad5336
decouple
Oseltamivir Mar 10, 2026
375f14e
fix: uninstall torchvision before lm_eval to fix ATOM container impor…
Oseltamivir Mar 11, 2026
1cdc72a
add gpqa
Oseltamivir Mar 11, 2026
49b0b90
fix: pass multiple eval tasks as separate args for older lm-eval compat
Oseltamivir Mar 11, 2026
cc59d50
fix: run eval tasks sequentially for cross-version lm-eval compat
Oseltamivir Mar 11, 2026
c8b5858
fix: use directory-based task discovery for multi-eval in single lm_e…
Oseltamivir Mar 11, 2026
ffee6c5
gsm8k only
Oseltamivir Mar 12, 2026
c577ca2
Merge branch 'main' into eval-8k1k-server-restart
Oseltamivir Mar 12, 2026
dc25ccd
fix: cap max_gen_tokens to server's max_model_len to avoid request re…
Oseltamivir Mar 12, 2026
285b662
fix: add eval context length override to remaining 18 scripts
Oseltamivir Mar 12, 2026
2d9d7ba
fix: default EVAL_TASKS_DIR to utils/evals directory, not single yaml…
Oseltamivir Mar 13, 2026
1ca2173
fix: reduce eval context multiplier to 5x and increase request timeou…
Oseltamivir Mar 13, 2026
7458eea
Merge branch 'main' into eval-8k1k-server-restart
Oseltamivir Mar 13, 2026
6999263
test other prompt
Oseltamivir Mar 15, 2026
ba45203
pr
Oseltamivir Mar 15, 2026
826035c
test evals
Oseltamivir Mar 15, 2026
34bc7c4
resolve claude issues
Oseltamivir Mar 15, 2026
4978aed
torchvision
Oseltamivir Mar 16, 2026
c038e1b
make stuff neater, ready for merge
Oseltamivir Mar 16, 2026
29d69fa
resolve issues, add --no-evals, change default to flag-less
Oseltamivir Mar 18, 2026
327fd6d
ctxt len
Oseltamivir Mar 18, 2026
6350b6b
resolve claude
Oseltamivir Mar 18, 2026
9ae1ae4
Merge branch 'main' into eval-8k1k-server-restart
Oseltamivir Mar 18, 2026
dec6f60
h200 change
Oseltamivir Mar 18, 2026
b08e063
final touches
Oseltamivir Mar 19, 2026
f04881d
test normal perf-changelog
Oseltamivir Mar 19, 2026
86764fa
Merge branch 'main' into eval-8k1k-server-restart
Oseltamivir Mar 19, 2026
c17619f
test normal perf-changelog
Oseltamivir Mar 20, 2026
d30f807
all evals
Oseltamivir Mar 20, 2026
766a742
remove pycache
Oseltamivir Mar 20, 2026
5d5dd7b
argmax error
Oseltamivir Mar 20, 2026
f54b09c
Merge branch 'main' into eval-8k1k-server-restart
Oseltamivir Mar 20, 2026
b9551f6
merge main
Oseltamivir Mar 21, 2026
f0fff18
blocking rm
Oseltamivir Mar 21, 2026
8bf41f0
Merge branch 'main' into eval-8k1k-server-restart
Oseltamivir Mar 21, 2026
38b80a8
standardize
Oseltamivir Mar 22, 2026
5beca55
reduce ctxt OOM
Oseltamivir Mar 22, 2026
4dcfc92
reduce ctxt OOM
Oseltamivir Mar 22, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 38 additions & 17 deletions .github/workflows/benchmark-tmpl.yml
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,11 @@ on:
type: boolean
required: true
default: false
eval-only:
description: "Run only evals (skip throughput benchmark)"
type: boolean
required: false
default: false
random-range-ratio:
required: false
type: string
Expand Down Expand Up @@ -83,6 +88,8 @@ env:
SPEC_DECODING: ${{ inputs.spec-decoding }}
DISAGG: ${{ inputs.disagg }}
RUN_EVAL: ${{ inputs.run-eval }}
EVAL_ONLY: ${{ inputs.eval-only }}
PYTHONDONTWRITEBYTECODE: '1'

permissions:
contents: read
Expand All @@ -91,7 +98,7 @@ jobs:
benchmark:
runs-on: ${{ inputs.runner }}
timeout-minutes: 300
name: "${{ inputs.exp-name }} ${{ inputs.precision }} ${{ inputs.runner }} ${{ inputs.framework }} | tp=${{ inputs.tp }} ep=${{ inputs.ep }} dpa=${{ inputs.dp-attn }} | disagg-${{ inputs.disagg }} spec-${{ inputs.spec-decoding }} conc-${{ inputs.conc }}${{ inputs.run-eval && ' | eval' || '' }}"
name: "${{ inputs.exp-name }} ${{ inputs.precision }} ${{ inputs.runner }} ${{ inputs.framework }} | tp=${{ inputs.tp }} ep=${{ inputs.ep }} dpa=${{ inputs.dp-attn }} | disagg-${{ inputs.disagg }} spec-${{ inputs.spec-decoding }} conc-${{ inputs.conc }}${{ inputs.eval-only && ' | eval-only' || (inputs.run-eval && ' | eval' || '') }}"
steps:
- name: Resource cleanup (pre-run)
run: &resource-cleanup |
Expand Down Expand Up @@ -145,28 +152,42 @@ jobs:
echo "RESULT_FILENAME=${RESULT_FILENAME}" >> $GITHUB_ENV

bash ./runners/launch_${RUNNER_NAME%%_*}.sh
FOUND_RESULT_FILE=
for i in {1..10}; do
if [ -f "$RESULT_FILENAME.json" ]; then
FOUND_RESULT_FILE=true
break

if [ "${{ inputs.eval-only }}" = "true" ]; then
echo "Eval-only mode: skipping benchmark result file check"
# Verify eval produced results
if ! ls results*.json 1>/dev/null 2>&1; then
echo "Eval-only run failed: no results*.json files found." >&2
exit 1
fi
echo "Waiting for result file... (attempt $i)"
sleep 1
done
# Verify eval scores meet minimum threshold (85%)
python3 utils/evals/validate_scores.py
else
FOUND_RESULT_FILE=
for i in {1..10}; do
if [ -f "$RESULT_FILENAME.json" ]; then
FOUND_RESULT_FILE=true
break
fi
echo "Waiting for result file... (attempt $i)"
sleep 1
done

if [ -z "$FOUND_RESULT_FILE" ]; then
echo "Run failed: Benchmark result $RESULT_FILENAME.json not found." >&2
exit 1
if [ -z "$FOUND_RESULT_FILE" ]; then
echo "Run failed: Benchmark result $RESULT_FILENAME.json not found." >&2
exit 1
fi
fi

- name: Process result
if: ${{ !inputs.eval-only }}
env:
RUNNER_TYPE: ${{ inputs.runner }}
run: |
python3 utils/process_result.py

- name: Upload result
if: ${{ !inputs.eval-only }}
uses: actions/upload-artifact@bbbca2ddaa5d8feaa63e36b76fdaad77386f024f # v7.0.0
with:
name: bmk_${{ env.RESULT_FILENAME }}
Expand All @@ -176,31 +197,31 @@ jobs:
if: always()
uses: actions/upload-artifact@bbbca2ddaa5d8feaa63e36b76fdaad77386f024f # v7.0.0
with:
name: server_logs_${{ env.RESULT_FILENAME }}
name: ${{ inputs.eval-only && 'eval_server_logs_' || 'server_logs_' }}${{ env.RESULT_FILENAME }}
path: server.log
if-no-files-found: ignore

- name: Upload GPU metrics
if: always()
uses: actions/upload-artifact@bbbca2ddaa5d8feaa63e36b76fdaad77386f024f # v7.0.0
with:
name: gpu_metrics_${{ env.RESULT_FILENAME }}
name: ${{ inputs.eval-only && 'eval_gpu_metrics_' || 'gpu_metrics_' }}${{ env.RESULT_FILENAME }}
path: gpu_metrics.csv
if-no-files-found: ignore

- name: Upload eval results (if any)
if: ${{ env.RUN_EVAL == 'true' }}
if: ${{ always() && (env.RUN_EVAL == 'true' || inputs.eval-only) }}
uses: actions/upload-artifact@bbbca2ddaa5d8feaa63e36b76fdaad77386f024f # v7.0.0
with:
name: eval_${{ env.EXP_NAME }}_${{ env.RESULT_FILENAME }}
path: |
meta_env.json
results*.json
sample*.jsonl
if-no-files-found: ignore
if-no-files-found: ${{ inputs.eval-only && 'error' || 'ignore' }}

- name: Cleanup eval outputs (post-upload)
if: ${{ env.RUN_EVAL == 'true' }}
if: ${{ always() && (env.RUN_EVAL == 'true' || inputs.eval-only) }}
run: |
rm -f meta_env.json || true
Comment on lines 197 to 226
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The Cleanup eval outputs step removes meta_env.json and results*.json but never removes sample*.jsonl files, which are also moved into the workspace CWD by append_lm_eval_summary and then uploaded as artifacts. On persistent self-hosted GPU runners, these files accumulate across runs, causing subsequent artifact uploads to mix stale sample data from prior runs with current results. Fix: add 'rm -f sample*.jsonl || true' to the cleanup step.

Extended reasoning...

Analysis of bug_004: Cleanup step omits sample*.jsonl files

What the bug is and how it manifests

The 'Cleanup eval outputs (post-upload)' step in benchmark-tmpl.yml (lines 223-228) removes meta_env.json and results*.json after uploading eval artifacts, but it does NOT remove sample*.jsonl files. These sample files are explicitly included in the artifact upload path (lines 217-220) and are moved into the workspace CWD during benchmark execution. Because persistent self-hosted GPU runners reuse the same workspace directory across jobs, any sample*.jsonl files left behind will accumulate across runs.

The specific code path that triggers it

  1. run_lm_eval is invoked with --log_samples (benchmark_lib.sh line 681), causing lm-eval to write samples_{task}_*.jsonl files into a temporary output directory.
  2. append_lm_eval_summary uses find "${out_dir}" -type f -name "*.json*" (benchmark_lib.sh line 755) -- the glob .json matches both .json and .jsonl extensions -- and moves all matched files into the workspace CWD.
  3. The upload step (lines 212-221) correctly includes sample*.jsonl in its path block.
  4. The cleanup step (lines 223-228) only runs 'rm -f meta_env.json' and 'rm -f results*.json', leaving sample*.jsonl files on disk.

Why existing code does not prevent it

There is no pre-run workspace sweep for sample*.jsonl. The resource cleanup steps (pre-run and post-run) only handle Docker containers and SLURM jobs, not leftover workspace files. The comment on line 227 says 'Remove any eval results JSONs that were moved into workspace' but the implementation only removes .json files, missing the .jsonl sample files that were also moved in by the same .json glob.

What the impact would be

On persistent self-hosted GPU runners, each eval run that produces sample logs will leave sample*.jsonl files in the workspace. After N runs, N sets of sample files accumulate. The next artifact upload will pick up all of them (since sample*.jsonl is a glob), bundling stale samples from previous runs into the current run's eval artifact. This corrupts eval artifact provenance -- a reviewer examining eval samples cannot tell which run they belong to.

How to fix it

Add one line to the cleanup step:

- name: Cleanup eval outputs (post-upload)
  if: ${{ always() && (env.RUN_EVAL == 'true' || inputs.eval-only) }}
  run: |
    rm -f meta_env.json || true
    rm -f results*.json || true
    rm -f sample*.jsonl || true   # add this line

Step-by-step proof

  1. Run A executes with run-eval: true. run_lm_eval --log_samples writes samples_mmlu_run_a.jsonl to a temp output dir.
  2. append_lm_eval_summary runs find matching .json, moves samples_mmlu_run_a.jsonl to workspace CWD.
  3. Upload step uploads meta_env.json, results_mmlu.json, and samples_mmlu_run_a.jsonl -- correct.
  4. Cleanup step removes meta_env.json and results_mmlu.json but NOT samples_mmlu_run_a.jsonl.
  5. Run B starts on the same runner. Workspace still contains samples_mmlu_run_a.jsonl from Run A.
  6. Run B produces samples_mmlu_run_b.jsonl (its own results).
  7. Upload step now uploads BOTH samples_mmlu_run_a.jsonl (stale, from Run A) and samples_mmlu_run_b.jsonl (current) -- artifact is contaminated with stale data from prior run.

# Remove any eval results JSONs that were moved into workspace
Expand Down
40 changes: 37 additions & 3 deletions .github/workflows/e2e-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ jobs:
outputs:
single-node-config: ${{ steps.get-jobs.outputs.single-node-config }}
multi-node-config: ${{ steps.get-jobs.outputs.multi-node-config }}
eval-config: ${{ steps.get-jobs.outputs.eval-config }}
steps:
- name: Checkout code (ref)
if: ${{ inputs.ref && inputs.ref != '' }}
Expand All @@ -53,10 +54,12 @@ jobs:
pip install pydantic
CONFIG_JSON=$(python3 ${GITHUB_WORKSPACE}/utils/matrix_logic/generate_sweep_configs.py \
${{ inputs.generate-cli-command || github.event.inputs.generate-cli-command }})
SINGLE=$(echo "$CONFIG_JSON" | python3 -c "import sys,json; d=json.load(sys.stdin); print(json.dumps([x for x in d if 'prefill' not in x]))")
SINGLE=$(echo "$CONFIG_JSON" | python3 -c "import sys,json; d=json.load(sys.stdin); print(json.dumps([x for x in d if 'prefill' not in x and not x.get('run-eval', False)]))")
MULTI=$(echo "$CONFIG_JSON" | python3 -c "import sys,json; d=json.load(sys.stdin); print(json.dumps([x for x in d if 'prefill' in x]))")
EVALS=$(echo "$CONFIG_JSON" | python3 -c "import sys,json; d=json.load(sys.stdin); print(json.dumps([x for x in d if 'prefill' not in x and x.get('run-eval', False)]))")
echo "single-node-config=$SINGLE" >> $GITHUB_OUTPUT
echo "multi-node-config=$MULTI" >> $GITHUB_OUTPUT
echo "eval-config=$EVALS" >> $GITHUB_OUTPUT

test-sweep-multi-node:
needs: get-jobs
Expand Down Expand Up @@ -123,7 +126,38 @@ jobs:
conc: ${{ matrix.config.conc }}
spec-decoding: ${{ matrix.config.spec-decoding }}
disagg: ${{ matrix.config.disagg }}
run-eval: ${{ matrix.config.run-eval }}
run-eval: false
ref: ${{ inputs.ref }}

test-sweep-evals:
needs: get-jobs
if: ${{ needs.get-jobs.outputs.eval-config != '[]' }}
uses: ./.github/workflows/benchmark-tmpl.yml
name: eval /
strategy:
fail-fast: false
matrix:
config: ${{ fromJson(needs.get-jobs.outputs.eval-config) }}
secrets: inherit
with:
exp-name: ${{ matrix.config.exp-name }}
isl: ${{ matrix.config.isl }}
osl: ${{ matrix.config.osl }}
max-model-len: ${{ matrix.config.max-model-len }}
runner: ${{ matrix.config.runner }}
image: ${{ matrix.config.image }}
model: ${{ matrix.config.model }}
model-prefix: ${{ matrix.config.model-prefix }}
framework: ${{ matrix.config.framework }}
precision: ${{ matrix.config.precision }}
tp: ${{ matrix.config.tp }}
ep: ${{ matrix.config.ep }}
dp-attn: ${{ matrix.config.dp-attn }}
conc: ${{ matrix.config.conc }}
spec-decoding: ${{ matrix.config.spec-decoding }}
disagg: ${{ matrix.config.disagg }}
run-eval: true
eval-only: true
ref: ${{ inputs.ref }}

collect-results:
Expand All @@ -135,7 +169,7 @@ jobs:
result-prefix: "bmk"

collect-evals:
needs: [test-sweep-multi-node, test-sweep-single-node]
needs: [test-sweep-evals]
if: ${{ always() }}
uses: ./.github/workflows/collect-evals.yml
secrets: inherit
Expand Down
49 changes: 36 additions & 13 deletions .github/workflows/run-sweep.yml
Original file line number Diff line number Diff line change
Expand Up @@ -183,6 +183,36 @@ jobs:
secrets: inherit
with: *single-node-inputs

sweep-evals:
needs: setup
if: ${{ toJson(fromJson(needs.setup.outputs.search-space-config).evals) != '[]' && toJson(fromJson(needs.setup.outputs.search-space-config).evals) != 'null' }}
uses: ./.github/workflows/benchmark-tmpl.yml
name: eval /
strategy:
fail-fast: false
matrix:
config: ${{ fromJson(needs.setup.outputs.search-space-config).evals }}
secrets: inherit
with:
exp-name: ${{ matrix.config.exp-name }}
isl: ${{ matrix.config.isl }}
osl: ${{ matrix.config.osl }}
max-model-len: ${{ matrix.config.max-model-len }}
runner: ${{ matrix.config.runner }}
image: ${{ matrix.config.image }}
model: ${{ matrix.config.model }}
model-prefix: ${{ matrix.config.model-prefix }}
framework: ${{ matrix.config.framework }}
precision: ${{ matrix.config.precision }}
tp: ${{ matrix.config.tp }}
ep: ${{ matrix.config.ep }}
dp-attn: ${{ matrix.config.dp-attn }}
conc: ${{ matrix.config.conc }}
spec-decoding: ${{ matrix.config.spec-decoding }}
disagg: ${{ matrix.config.disagg }}
run-eval: true
eval-only: true

collect-results:
needs:
[
Expand All @@ -201,16 +231,7 @@ jobs:
result-prefix: "bmk"

collect-evals:
needs:
[
sweep-single-node-1k1k,
sweep-single-node-1k8k,
sweep-single-node-8k1k,
sweep-multi-node-1k1k,
sweep-multi-node-1k8k,
sweep-multi-node-8k1k,
setup,
]
needs: [sweep-evals, setup]
if: ${{ always() && needs.setup.result != 'skipped' }}
uses: ./.github/workflows/collect-evals.yml
secrets: inherit
Comment on lines 233 to 237
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The collect-evals job in run-sweep.yml (line 235) uses if: ${{ always() && needs.setup.result != 'skipped' }} which does not check whether sweep-evals was skipped. When a PR adds no 8k1k single-node eval configs, sweep-evals is skipped and no eval_* artifacts are uploaded, but collect-evals still runs and collect-evals.yml calls actions/download-artifact@v8 without error-on-missing-artifacts: false, causing a spurious CI failure. The same issue exists in e2e-tests.yml where collect-evals uses if: ${{ always() }} with no check on test-sweep-evals.result. Fix: add && needs.sweep-evals.result != 'skipped' to the condition in run-sweep.yml, and similarly gate collect-evals in e2e-tests.yml on test-sweep-evals.result != 'skipped'.

Extended reasoning...

What the bug is and how it manifests

In run-sweep.yml line 235, the collect-evals job was updated to depend on [sweep-evals, setup] instead of all the sweep jobs. However, its condition remains if: ${{ always() && needs.setup.result != 'skipped' }}. The sweep-evals job itself has a guard:

if: ${{ toJson(fromJson(needs.setup.outputs.search-space-config).evals) != '[]' && ... != 'null' }}

When a PR only adds multi-node configs, or only adds non-8k1k single-node configs, mark_eval_entries() in generate_sweep_configs.py (which exclusively marks 8k1k entries) produces an empty evals list, process_changelog.py generates no eval entries, and sweep-evals is skipped. Because collect-evals only checks needs.setup.result != 'skipped' and not needs.sweep-evals.result != 'skipped', collect-evals still runs.

The specific code path

  1. PR is opened that adds only multi-node or non-8k1k benchmark configs
  2. process_changelog.py runs with --evals-only, produces an empty list
  3. sweep-evals evaluates its condition: evals == []skipped
  4. No eval_* artifacts are uploaded to the workflow run
  5. collect-evals evaluates always() && needs.setup.result != 'skipped'true (setup ran successfully)
  6. collect-evals runs and invokes collect-evals.yml
  7. Inside collect-evals.yml, actions/download-artifact@v8 with pattern eval_* finds no matching artifacts
  8. In actions/download-artifact v4+, the default behavior is to fail when no matching artifacts are found (unlike v3 which silently succeeded)
  9. The step exits with a non-zero code → spurious CI failure for an otherwise-valid PR

Why existing code doesn't prevent it

The condition needs.setup.result != 'skipped' guards against the case where setup itself is skipped (e.g. if the workflow was cancelled before it started), but it does nothing to guard against sweep-evals being skipped due to an empty eval matrix. The always() function forces the job to run regardless of upstream job statuses, so even if sweep-evals is skipped, collect-evals runs. actions/download-artifact@v8 does not have error-on-missing-artifacts: false set in collect-evals.yml, so it fails hard when no eval_* artifacts exist.

Impact

Any PR that does not produce 8k1k single-node eval configs will experience a spurious CI failure in the collect-evals job, despite the actual benchmark and eval logic working correctly. Additionally, since collect-evals.result is 'failure' (not 'skipped'), the trigger-ingest job (line 295-318 in run-sweep.yml) may still fire because its condition checks collect-evals.result != 'skipped', potentially triggering an ingest with no eval data.

The same issue exists in e2e-tests.yml where collect-evals uses if: ${{ always() }} with no guard on test-sweep-evals.result.

How to fix

In run-sweep.yml, change line 235 from:

if: ${{ always() && needs.setup.result != 'skipped' }}

to:

if: ${{ always() && needs.setup.result != 'skipped' && needs.sweep-evals.result != 'skipped' }}

In e2e-tests.yml, change:

if: ${{ always() }}

to:

if: ${{ always() && needs.test-sweep-evals.result != 'skipped' }}

Alternatively, add error-on-missing-artifacts: false to the actions/download-artifact@v8 call in collect-evals.yml and add null-handling in collect_eval_results.py for the case of no artifacts.

Step-by-step proof

  1. A PR is submitted that adds only dsr1-fp4-mi355x-sglang (which has ISL=1024, OSL=1024 — a 1k1k config, not 8k1k)
  2. process_changelog.py runs generate_sweep_configs.py --evals-only for this config
  3. mark_eval_entries() finds no entries with isl=8192, osl=1024 → returns empty list
  4. sweep-evals condition: evals == []skipped, no artifacts uploaded
  5. collect-evals condition: always() && setup.result != 'skipped'true → runs
  6. actions/download-artifact@v8 searches for eval_* artifacts → finds none → exits 1
  7. collect-evals fails, PR shows a red CI check for an otherwise-valid benchmark PR

Expand All @@ -221,10 +242,12 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Extract and save changelog metadata
env:
CONFIG_JSON: ${{ needs.setup.outputs.search-space-config }}
run: |
echo "$CONFIG_JSON" | jq '.changelog_metadata' > changelog_metadata.json
cat <<'CONFIGEOF' > _full_config.json
${{ needs.setup.outputs.search-space-config }}
CONFIGEOF
jq '.changelog_metadata' _full_config.json > changelog_metadata.json
rm -f _full_config.json

- name: Upload changelog artifact
uses: actions/upload-artifact@bbbca2ddaa5d8feaa63e36b76fdaad77386f024f # v7.0.0
Expand Down
Loading
Loading