Skip to content

Separate eval-only workflow and change to 8k1k#911

Open
Oseltamivir wants to merge 45 commits intomainfrom
eval-8k1k-server-restart
Open

Separate eval-only workflow and change to 8k1k#911
Oseltamivir wants to merge 45 commits intomainfrom
eval-8k1k-server-restart

Conversation

@Oseltamivir
Copy link
Collaborator

@Oseltamivir Oseltamivir commented Mar 15, 2026

Summary

  • Decouple eval jobs from benchmark jobs into a dedicated eval / workflow stage, so evals can run independently with evals-only: true in perf-changelog
  • Add eval-only input to benchmark-tmpl.yml that skips throughput benchmarks and result file checks, running only lm_eval
  • Fix GSM8k prompt template: change #### <answer> to #### [number] to prevent models (e.g. Kimi-K2.5) from outputting the literal <answer> tag, which caused strict-match to return [invalid]
  • Add eval support to all 49 single-node benchmark scripts (server restart with native max context, eval context override, request timeout increase)
  • Update process_changelog.py and generate_sweep_configs.py to emit a separate evals matrix for eval-only jobs
  • Changed from 1k8k to 8k1k
  • Groups now include dp-attn (previously missing, so dp-attn variants were merged)
  • Instead of picking max-conc at highest/lowest TP, it picks all TPs at the highest conc + all TPs at the median conc.

Separate jobs on a normal run tested with:

- config-keys:
    - gptoss-fp4-h100-vllm
    - dsr1-fp8-mi325x-sglang

Eval Context Override (per-framework patterns)

Each script calls compute_eval_context_length(model, baseline_ctx) which computes
baseline_ctx × 5 capped at the model's native max (read via transformers.AutoConfig).
The calling convention varies by framework because each passes context length differently:

Framework Input to compute_eval_context_length How it reaches the server
SGLang ISL + OSL + 20 (dynamic) Captured into _eval_ctx, appended as
--context-length $_eval_ctx via EVAL_CONTEXT_ARGS
vLLM / TRT-LLM $MAX_MODEL_LEN (pre-existing variable) Overrides
MAX_MODEL_LEN in-place; flows into YAML config or --max_seq_len flag
ATOM 10240 (hardcoded baseline) Written into CALCULATED_MAX_MODEL_LEN string
as --max-model-len $eval_ctx

Implementation Details

  • Standardize eval-only context sizing across single-node scripts: compute EVAL_MAX_MODEL_LEN
    as min((ISL + OSL + 20) * 5, model_native_max) and plumb it into each framework's server
    args/config (--context-length, max-model-len, etc.)
  • get_native_max_context_length(model) — reads max_position_embeddings / max_sequence_length / seq_length / n_positions from the model's HuggingFace config.
  • max_output_tokens: dynamically computed as eval_context_len - 4096 (capped at
    16384), was fixed 8192
  • Score validation: new validate_scores.py — post-eval gate checking exact_match metrics ≥ 0.85, exits non-zero on failure or zero metrics found
  • Failure mode: eval-only uses if-no-files-found: error (was ignore)
  • GSM8k prompt: #### <answer>#### [number]
  • ATOM torchvision workaround: uninstalls torchvision before lm-eval install to avoid circular import
  • PYTHONDONTWRITEBYTECODE=1 added to workflow env to avoid EACCES error on pycache.
Screenshot 2026-03-21 at 9 05 29 PM

New job step: evals

Oseltamivir and others added 22 commits March 8, 2026 08:31
- Switch eval selection from 1k8k to 8k1k in mark_eval_entries()
- After throughput benchmark, kill server and restart with model's
  native max context length (max_position_embeddings) for eval
- Replace hardcoded gen_max_tokens=16384 and max_tokens=8192 with
  the native max model length
- Add _start_eval_server() supporting sglang, vllm, trtllm, and atom
- Add EVAL_SERVER_EXTRA_ARGS to benchmark scripts for correctness flags

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
max_tokens in gen_kwargs must be less than max_length to leave room
for the input prompt. Without this cap, the server rejects requests
where input_tokens + max_tokens > max_model_len.

Uses max_length - 4096 as the generation cap (min 8192).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use 70% of max_length for max_tokens generation cap, leaving 30%
for the prompt. Echo the budget for visibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add --no-skip-eval-only flag to wait_for_server_ready so
  _start_eval_server properly waits for the eval server
- Fix pip3 install in dsr1_fp8_h200.sh with --break-system-packages

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Bump eval concurrency to 64 via EVAL_CONCURRENT_REQUESTS, independent of benchmark CONC
- Cap eval max_gen_tokens to 8192 to avoid KV cache issues
- Read num_fewshot from eval YAML instead of CLI override
- Add 'trt' as alias for trtllm in eval server case
- Reduce seq len configs for OOM prevention
- Change eval sweep to top of curve and middle configs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t error

The ATOM container has a broken torchvision that causes circular import
errors when lm_eval loads. Since we use local-chat-completions (API-based),
torchvision is not needed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…val call

Pass utils/evals/ directory to --tasks so lm-eval globs all *.yaml files
and runs gsm8k + gpqa_diamond consecutively in one invocation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…jection

max_gen_tokens was hardcoded to 16384 but servers with smaller
EVAL_MAX_MODEL_LEN (e.g. 9416) reject requests exceeding their limit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
15 SGLang scripts now pass --context-length with compute_eval_context_length
when EVAL_ONLY=true. 3 vLLM scripts override MAX_MODEL_LEN similarly.
Reverts the max_gen_tokens cap since the server should have sufficient context.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… file

Was defaulting to utils/evals/gsm8k.yaml which caused lm-eval to only
run gsm8k. Directory path lets lm-eval glob all *.yaml files (gsm8k + gpqa).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…t to 1800s

5x avoids TRT OOM on H200 (47K vs 94K context). 1800s timeout prevents
single-request timeouts on slow models like Kimi K2.5.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
# Conflicts:
#	benchmarks/single_node/dsr1_fp4_b200.sh
#	benchmarks/single_node/dsr1_fp4_mi355x.sh
#	benchmarks/single_node/dsr1_fp8_b200.sh
#	benchmarks/single_node/dsr1_fp8_b200_mtp.sh
#	benchmarks/single_node/dsr1_fp8_b200_trt_mtp.sh
#	benchmarks/single_node/dsr1_fp8_h200_trt_mtp.sh
#	benchmarks/single_node/dsr1_fp8_mi300x.sh
#	benchmarks/single_node/dsr1_fp8_mi355x.sh
#	benchmarks/single_node/glm5_fp8_mi355x.sh
#	benchmarks/single_node/gptoss_fp4_mi300x.sh
#	benchmarks/single_node/gptoss_fp4_mi325x.sh
#	benchmarks/single_node/gptoss_fp4_mi355x.sh
#	benchmarks/single_node/kimik2.5_fp4_b200.sh
#	benchmarks/single_node/kimik2.5_int4_b200.sh
#	benchmarks/single_node/kimik2.5_int4_mi325x.sh
#	benchmarks/single_node/kimik2.5_int4_mi355x.sh
#	benchmarks/single_node/minimaxm2.5_fp8_b200.sh
#	benchmarks/single_node/minimaxm2.5_fp8_h200.sh
#	benchmarks/single_node/minimaxm2.5_fp8_mi300x.sh
#	benchmarks/single_node/minimaxm2.5_fp8_mi325x.sh
#	benchmarks/single_node/minimaxm2.5_fp8_mi355x.sh
#	benchmarks/single_node/qwen3.5_bf16_mi300x.sh
#	benchmarks/single_node/qwen3.5_bf16_mi325x.sh
#	benchmarks/single_node/qwen3.5_bf16_mi355x.sh
#	benchmarks/single_node/qwen3.5_fp8_mi300x.sh
#	benchmarks/single_node/qwen3.5_fp8_mi325x.sh
#	benchmarks/single_node/qwen3.5_fp8_mi355x.sh
@github-actions
Copy link
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

Copy link
Contributor

@functionstackx functionstackx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR @Oseltamivir ! seems great that we now parallel evals instead of having them in sequential

mostly LGTM

just had a couple of questions

  1. does this PR change the evals to use the industry standard num of shots for gsm8k?
  2. previously when evals hard crashed, the script still exits with success exit code, is this fixed now?

@Oseltamivir Oseltamivir requested a review from a team March 15, 2026 22:08
@Oseltamivir Oseltamivir force-pushed the eval-8k1k-server-restart branch from 600582e to 826035c Compare March 15, 2026 22:08
@Oseltamivir
Copy link
Collaborator Author

  1. Changed to 5-shot
  2. Yes, also added suggestion of failing when result <85%

@Oseltamivir Oseltamivir force-pushed the eval-8k1k-server-restart branch from cfe5ca1 to 766a742 Compare March 20, 2026 04:25
Copy link
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All critical bugs from prior review rounds have been addressed — LGTM.

Extended reasoning...

Overview

PR #911 separates eval-only runs into a dedicated workflow path, switches evals to the 8k1k workload, adds validate_scores.py, and updates 49 benchmark scripts with EVAL_ONLY blocks, context length expansion, and --concurrent-requests removal. 69 files changed across GitHub Actions workflows, benchmark shell scripts, Python utilities, and documentation.

Security Risks

No security-sensitive code is touched. The use of subprocess.run in process_changelog.py and git diff in get_added_lines operates on local filesystem paths and repository refs passed via argparse, with no shell=True — no injection risk. No auth, crypto, or permissions code is modified.

Level of Scrutiny

This PR required high scrutiny due to its size and the critical nature of eval correctness. The review spanned 10 rounds over six days (2026-03-15 through 2026-03-20) and identified six 🔴 critical bugs that have all been fixed: EVAL_ONLY env var propagation to Docker containers, EVAL_MAX_MODEL_LEN export in subshell contexts, validate_scores.py silent pass when no metrics matched, missing --max_seq_len in gptoss_fp4_h200_trt.sh, missing always() guard on eval artifact upload, and stale --concurrent-requests in glm5_fp8_b200.sh and qwen3.5_fp8_b200_mtp.sh (the last two confirmed fixed in the current diff).

Other Factors

Remaining 🟡 issues are minor or acknowledged: exit vs return in run_eval() is a latent artifact-stranding edge case on eval failure; median concurrency selecting index len//2 collapses to highest for 2-element lists (documented in AGENTS.md); e2e SINGLE filter duplication when --evals-only is passed wastes CI resources but produces correct results. None of these block merging. All inline comments are resolved and the PR has a human LGTM.

@Oseltamivir Oseltamivir force-pushed the eval-8k1k-server-restart branch from e6a8801 to 4dcfc92 Compare March 22, 2026 05:44
if [ "$benchmark_ctx" -eq 0 ] 2>/dev/null; then
benchmark_ctx="${native_max:-0}"
fi
local eval_ctx=$(( benchmark_ctx * 1 ))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The compute_eval_context_length function uses a multiplier of 1 instead of 5, so the eval context is never expanded. For a typical 8k1k workload (ISL=8192, OSL=1024), this produces an eval context of 9416 tokens instead of 47080, completely defeating the purpose of eval-only mode. Fix: change benchmark_ctx * 1 to benchmark_ctx * 5 on line 617 of benchmarks/benchmark_lib.sh.

Extended reasoning...

What the bug is

In benchmarks/benchmark_lib.sh line 617, the newly introduced compute_eval_context_length() function contains:

local eval_ctx=$(( benchmark_ctx * 1 ))

The multiplier is 1 instead of 5. The function comment directly above (lines 601-603) explicitly states: "Uses 5x the benchmark context capped at the model's native max." The PR description also confirms the intent: compute baseline_ctx x 5 capped at the model native max.

The specific code path that triggers it

Every benchmark script introduced or modified by this PR calls setup_eval_context(), which in turn calls compute_eval_context_length(). The computed value is then used to set --context-length, MAX_MODEL_LEN, or CALCULATED_MAX_MODEL_LEN for the inference server. With multiplier=1, the server starts with the exact same context window as the benchmark workload — the eval-context expansion is a no-op.

Why existing code does not prevent it

The function was newly introduced in this PR, so there was no prior implementation to compare against. The bug is a simple typo (* 1 instead of * 5). The cap logic checking whether eval_ctx > native_max is correct but irrelevant since the expansion never occurs in the first place.

Impact

For a typical 8k1k workload (ISL=8192, OSL=1024), the benchmark context is ISL + OSL + 20 = 9236 tokens. With the correct multiplier, eval_ctx = min(9236 x 5, native_max) = 46180. With the current multiplier of 1, eval_ctx = 9236 x 1 = 9236. The server starts with only ~9236 tokens of context instead of ~46180. The whole point of eval-only mode is to give lm-eval a generous context budget to avoid truncation failures — all 49 benchmark scripts that call setup_eval_context() are affected, making eval context expansion completely non-functional.

Step-by-step proof

  1. Eval-only job dispatches with EVAL_ONLY=true, ISL=8192, OSL=1024.
  2. Benchmark script calls setup_eval_context(), which calls compute_eval_context_length "$MODEL" "$((8192 + 1024 + 20))".
  3. Inside the function: benchmark_ctx=9236, eval_ctx = 9236 * 1 = 9236.
  4. Since 9236 < native_max (e.g., 163840 for DeepSeek-R1), no cap is applied.
  5. EVAL_MAX_MODEL_LEN=9236 is exported; server starts with --context-length 9236.
  6. Expected: eval_ctx = 9236 * 5 = 46180; server should start with --context-length 46180.
  7. Result: eval context is identical to benchmark context — eval-only mode's core feature is broken.

Fix

Change line 617 from benchmark_ctx * 1 to benchmark_ctx * 5.

Comment on lines 233 to 237
collect-evals:
needs:
[
sweep-single-node-1k1k,
sweep-single-node-1k8k,
sweep-single-node-8k1k,
sweep-multi-node-1k1k,
sweep-multi-node-1k8k,
sweep-multi-node-8k1k,
setup,
]
needs: [sweep-evals, setup]
if: ${{ always() && needs.setup.result != 'skipped' }}
uses: ./.github/workflows/collect-evals.yml
secrets: inherit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The collect-evals job in run-sweep.yml (line 235) uses if: ${{ always() && needs.setup.result != 'skipped' }} which does not check whether sweep-evals was skipped. When a PR adds no 8k1k single-node eval configs, sweep-evals is skipped and no eval_* artifacts are uploaded, but collect-evals still runs and collect-evals.yml calls actions/download-artifact@v8 without error-on-missing-artifacts: false, causing a spurious CI failure. The same issue exists in e2e-tests.yml where collect-evals uses if: ${{ always() }} with no check on test-sweep-evals.result. Fix: add && needs.sweep-evals.result != 'skipped' to the condition in run-sweep.yml, and similarly gate collect-evals in e2e-tests.yml on test-sweep-evals.result != 'skipped'.

Extended reasoning...

What the bug is and how it manifests

In run-sweep.yml line 235, the collect-evals job was updated to depend on [sweep-evals, setup] instead of all the sweep jobs. However, its condition remains if: ${{ always() && needs.setup.result != 'skipped' }}. The sweep-evals job itself has a guard:

if: ${{ toJson(fromJson(needs.setup.outputs.search-space-config).evals) != '[]' && ... != 'null' }}

When a PR only adds multi-node configs, or only adds non-8k1k single-node configs, mark_eval_entries() in generate_sweep_configs.py (which exclusively marks 8k1k entries) produces an empty evals list, process_changelog.py generates no eval entries, and sweep-evals is skipped. Because collect-evals only checks needs.setup.result != 'skipped' and not needs.sweep-evals.result != 'skipped', collect-evals still runs.

The specific code path

  1. PR is opened that adds only multi-node or non-8k1k benchmark configs
  2. process_changelog.py runs with --evals-only, produces an empty list
  3. sweep-evals evaluates its condition: evals == []skipped
  4. No eval_* artifacts are uploaded to the workflow run
  5. collect-evals evaluates always() && needs.setup.result != 'skipped'true (setup ran successfully)
  6. collect-evals runs and invokes collect-evals.yml
  7. Inside collect-evals.yml, actions/download-artifact@v8 with pattern eval_* finds no matching artifacts
  8. In actions/download-artifact v4+, the default behavior is to fail when no matching artifacts are found (unlike v3 which silently succeeded)
  9. The step exits with a non-zero code → spurious CI failure for an otherwise-valid PR

Why existing code doesn't prevent it

The condition needs.setup.result != 'skipped' guards against the case where setup itself is skipped (e.g. if the workflow was cancelled before it started), but it does nothing to guard against sweep-evals being skipped due to an empty eval matrix. The always() function forces the job to run regardless of upstream job statuses, so even if sweep-evals is skipped, collect-evals runs. actions/download-artifact@v8 does not have error-on-missing-artifacts: false set in collect-evals.yml, so it fails hard when no eval_* artifacts exist.

Impact

Any PR that does not produce 8k1k single-node eval configs will experience a spurious CI failure in the collect-evals job, despite the actual benchmark and eval logic working correctly. Additionally, since collect-evals.result is 'failure' (not 'skipped'), the trigger-ingest job (line 295-318 in run-sweep.yml) may still fire because its condition checks collect-evals.result != 'skipped', potentially triggering an ingest with no eval data.

The same issue exists in e2e-tests.yml where collect-evals uses if: ${{ always() }} with no guard on test-sweep-evals.result.

How to fix

In run-sweep.yml, change line 235 from:

if: ${{ always() && needs.setup.result != 'skipped' }}

to:

if: ${{ always() && needs.setup.result != 'skipped' && needs.sweep-evals.result != 'skipped' }}

In e2e-tests.yml, change:

if: ${{ always() }}

to:

if: ${{ always() && needs.test-sweep-evals.result != 'skipped' }}

Alternatively, add error-on-missing-artifacts: false to the actions/download-artifact@v8 call in collect-evals.yml and add null-handling in collect_eval_results.py for the case of no artifacts.

Step-by-step proof

  1. A PR is submitted that adds only dsr1-fp4-mi355x-sglang (which has ISL=1024, OSL=1024 — a 1k1k config, not 8k1k)
  2. process_changelog.py runs generate_sweep_configs.py --evals-only for this config
  3. mark_eval_entries() finds no entries with isl=8192, osl=1024 → returns empty list
  4. sweep-evals condition: evals == []skipped, no artifacts uploaded
  5. collect-evals condition: always() && setup.result != 'skipped'true → runs
  6. actions/download-artifact@v8 searches for eval_* artifacts → finds none → exits 1
  7. collect-evals fails, PR shows a red CI check for an otherwise-valid benchmark PR

Comment on lines 197 to 226
if: always()
uses: actions/upload-artifact@bbbca2ddaa5d8feaa63e36b76fdaad77386f024f # v7.0.0
with:
name: server_logs_${{ env.RESULT_FILENAME }}
name: ${{ inputs.eval-only && 'eval_server_logs_' || 'server_logs_' }}${{ env.RESULT_FILENAME }}
path: server.log
if-no-files-found: ignore

- name: Upload GPU metrics
if: always()
uses: actions/upload-artifact@bbbca2ddaa5d8feaa63e36b76fdaad77386f024f # v7.0.0
with:
name: gpu_metrics_${{ env.RESULT_FILENAME }}
name: ${{ inputs.eval-only && 'eval_gpu_metrics_' || 'gpu_metrics_' }}${{ env.RESULT_FILENAME }}
path: gpu_metrics.csv
if-no-files-found: ignore

- name: Upload eval results (if any)
if: ${{ env.RUN_EVAL == 'true' }}
if: ${{ always() && (env.RUN_EVAL == 'true' || inputs.eval-only) }}
uses: actions/upload-artifact@bbbca2ddaa5d8feaa63e36b76fdaad77386f024f # v7.0.0
with:
name: eval_${{ env.EXP_NAME }}_${{ env.RESULT_FILENAME }}
path: |
meta_env.json
results*.json
sample*.jsonl
if-no-files-found: ignore
if-no-files-found: ${{ inputs.eval-only && 'error' || 'ignore' }}

- name: Cleanup eval outputs (post-upload)
if: ${{ env.RUN_EVAL == 'true' }}
if: ${{ always() && (env.RUN_EVAL == 'true' || inputs.eval-only) }}
run: |
rm -f meta_env.json || true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The Cleanup eval outputs step removes meta_env.json and results*.json but never removes sample*.jsonl files, which are also moved into the workspace CWD by append_lm_eval_summary and then uploaded as artifacts. On persistent self-hosted GPU runners, these files accumulate across runs, causing subsequent artifact uploads to mix stale sample data from prior runs with current results. Fix: add 'rm -f sample*.jsonl || true' to the cleanup step.

Extended reasoning...

Analysis of bug_004: Cleanup step omits sample*.jsonl files

What the bug is and how it manifests

The 'Cleanup eval outputs (post-upload)' step in benchmark-tmpl.yml (lines 223-228) removes meta_env.json and results*.json after uploading eval artifacts, but it does NOT remove sample*.jsonl files. These sample files are explicitly included in the artifact upload path (lines 217-220) and are moved into the workspace CWD during benchmark execution. Because persistent self-hosted GPU runners reuse the same workspace directory across jobs, any sample*.jsonl files left behind will accumulate across runs.

The specific code path that triggers it

  1. run_lm_eval is invoked with --log_samples (benchmark_lib.sh line 681), causing lm-eval to write samples_{task}_*.jsonl files into a temporary output directory.
  2. append_lm_eval_summary uses find "${out_dir}" -type f -name "*.json*" (benchmark_lib.sh line 755) -- the glob .json matches both .json and .jsonl extensions -- and moves all matched files into the workspace CWD.
  3. The upload step (lines 212-221) correctly includes sample*.jsonl in its path block.
  4. The cleanup step (lines 223-228) only runs 'rm -f meta_env.json' and 'rm -f results*.json', leaving sample*.jsonl files on disk.

Why existing code does not prevent it

There is no pre-run workspace sweep for sample*.jsonl. The resource cleanup steps (pre-run and post-run) only handle Docker containers and SLURM jobs, not leftover workspace files. The comment on line 227 says 'Remove any eval results JSONs that were moved into workspace' but the implementation only removes .json files, missing the .jsonl sample files that were also moved in by the same .json glob.

What the impact would be

On persistent self-hosted GPU runners, each eval run that produces sample logs will leave sample*.jsonl files in the workspace. After N runs, N sets of sample files accumulate. The next artifact upload will pick up all of them (since sample*.jsonl is a glob), bundling stale samples from previous runs into the current run's eval artifact. This corrupts eval artifact provenance -- a reviewer examining eval samples cannot tell which run they belong to.

How to fix it

Add one line to the cleanup step:

- name: Cleanup eval outputs (post-upload)
  if: ${{ always() && (env.RUN_EVAL == 'true' || inputs.eval-only) }}
  run: |
    rm -f meta_env.json || true
    rm -f results*.json || true
    rm -f sample*.jsonl || true   # add this line

Step-by-step proof

  1. Run A executes with run-eval: true. run_lm_eval --log_samples writes samples_mmlu_run_a.jsonl to a temp output dir.
  2. append_lm_eval_summary runs find matching .json, moves samples_mmlu_run_a.jsonl to workspace CWD.
  3. Upload step uploads meta_env.json, results_mmlu.json, and samples_mmlu_run_a.jsonl -- correct.
  4. Cleanup step removes meta_env.json and results_mmlu.json but NOT samples_mmlu_run_a.jsonl.
  5. Run B starts on the same runner. Workspace still contains samples_mmlu_run_a.jsonl from Run A.
  6. Run B produces samples_mmlu_run_b.jsonl (its own results).
  7. Upload step now uploads BOTH samples_mmlu_run_a.jsonl (stale, from Run A) and samples_mmlu_run_b.jsonl (current) -- artifact is contaminated with stale data from prior run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants