Separate eval-only workflow and change to 8k1k by Oseltamivir · Pull Request #911 · SemiAnalysisAI/InferenceX

Oseltamivir · 2026-03-15T21:15:10Z

Summary

Decouple eval jobs from benchmark jobs into a dedicated eval / workflow stage, so evals can run independently with evals-only: true in perf-changelog
Add eval-only input to benchmark-tmpl.yml that skips throughput benchmarks and result file checks, running only lm_eval
Fix GSM8k prompt template: change #### <answer> to #### [number] to prevent models (e.g. Kimi-K2.5) from outputting the literal <answer> tag, which caused strict-match to return [invalid]
Add eval support to all 49 single-node benchmark scripts (server restart with native max context, eval context override, request timeout increase)
Update process_changelog.py and generate_sweep_configs.py to emit a separate evals matrix for eval-only jobs
Changed from 1k8k to 8k1k
Groups now include dp-attn (previously missing, so dp-attn variants were merged)
Instead of picking max-conc at highest/lowest TP, it picks all TPs at the highest conc + all TPs at the median conc.

Separate jobs on a normal run tested with:

- config-keys:
    - gptoss-fp4-h100-vllm
    - dsr1-fp8-mi325x-sglang

Eval Context Override (per-framework patterns)

Each script calls compute_eval_context_length(model, baseline_ctx) which computes
baseline_ctx × 5 capped at the model's native max (read via transformers.AutoConfig).
The calling convention varies by framework because each passes context length differently:

Framework	Input to `compute_eval_context_length`	How it reaches the server
SGLang	`ISL + OSL + 20` (dynamic)	Captured into `_eval_ctx`, appended as
`--context-length $_eval_ctx` via `EVAL_CONTEXT_ARGS`
vLLM / TRT-LLM	`$MAX_MODEL_LEN` (pre-existing variable)	Overrides
`MAX_MODEL_LEN` in-place; flows into YAML config or `--max_seq_len` flag
ATOM	`10240` (hardcoded baseline)	Written into `CALCULATED_MAX_MODEL_LEN` string
as `--max-model-len $eval_ctx`

Implementation Details

Standardize eval-only context sizing across single-node scripts: compute EVAL_MAX_MODEL_LEN
as min((ISL + OSL + 20) * 5, model_native_max) and plumb it into each framework's server
args/config (--context-length, max-model-len, etc.)
get_native_max_context_length(model) — reads max_position_embeddings / max_sequence_length / seq_length / n_positions from the model's HuggingFace config.
max_output_tokens: dynamically computed as eval_context_len - 4096 (capped at
16384), was fixed 8192
Score validation: new validate_scores.py — post-eval gate checking exact_match metrics ≥ 0.85, exits non-zero on failure or zero metrics found
Failure mode: eval-only uses if-no-files-found: error (was ignore)
GSM8k prompt: #### <answer> → #### [number]
ATOM torchvision workaround: uninstalls torchvision before lm-eval install to avoid circular import
PYTHONDONTWRITEBYTECODE=1 added to workflow env to avoid EACCES error on pycache.

New job step: evals

- Switch eval selection from 1k8k to 8k1k in mark_eval_entries() - After throughput benchmark, kill server and restart with model's native max context length (max_position_embeddings) for eval - Replace hardcoded gen_max_tokens=16384 and max_tokens=8192 with the native max model length - Add _start_eval_server() supporting sglang, vllm, trtllm, and atom - Add EVAL_SERVER_EXTRA_ARGS to benchmark scripts for correctness flags Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

max_tokens in gen_kwargs must be less than max_length to leave room for the input prompt. Without this cap, the server rejects requests where input_tokens + max_tokens > max_model_len. Uses max_length - 4096 as the generation cap (min 8192). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use 70% of max_length for max_tokens generation cap, leaving 30% for the prompt. Echo the budget for visibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add --no-skip-eval-only flag to wait_for_server_ready so _start_eval_server properly waits for the eval server - Fix pip3 install in dsr1_fp8_h200.sh with --break-system-packages Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Bump eval concurrency to 64 via EVAL_CONCURRENT_REQUESTS, independent of benchmark CONC - Cap eval max_gen_tokens to 8192 to avoid KV cache issues - Read num_fewshot from eval YAML instead of CLI override - Add 'trt' as alias for trtllm in eval server case - Reduce seq len configs for OOM prevention - Change eval sweep to top of curve and middle configs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…t error The ATOM container has a broken torchvision that causes circular import errors when lm_eval loads. Since we use local-chat-completions (API-based), torchvision is not needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…val call Pass utils/evals/ directory to --tasks so lm-eval globs all *.yaml files and runs gsm8k + gpqa_diamond consecutively in one invocation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…jection max_gen_tokens was hardcoded to 16384 but servers with smaller EVAL_MAX_MODEL_LEN (e.g. 9416) reject requests exceeding their limit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

15 SGLang scripts now pass --context-length with compute_eval_context_length when EVAL_ONLY=true. 3 vLLM scripts override MAX_MODEL_LEN similarly. Reverts the max_gen_tokens cap since the server should have sufficient context. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… file Was defaulting to utils/evals/gsm8k.yaml which caused lm-eval to only run gsm8k. Directory path lets lm-eval glob all *.yaml files (gsm8k + gpqa). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…t to 1800s 5x avoids TRT OOM on H200 (47K vs 94K context). 1800s timeout prevents single-request timeouts on slow models like Kimi K2.5. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

# Conflicts: # benchmarks/single_node/dsr1_fp4_b200.sh # benchmarks/single_node/dsr1_fp4_mi355x.sh # benchmarks/single_node/dsr1_fp8_b200.sh # benchmarks/single_node/dsr1_fp8_b200_mtp.sh # benchmarks/single_node/dsr1_fp8_b200_trt_mtp.sh # benchmarks/single_node/dsr1_fp8_h200_trt_mtp.sh # benchmarks/single_node/dsr1_fp8_mi300x.sh # benchmarks/single_node/dsr1_fp8_mi355x.sh # benchmarks/single_node/glm5_fp8_mi355x.sh # benchmarks/single_node/gptoss_fp4_mi300x.sh # benchmarks/single_node/gptoss_fp4_mi325x.sh # benchmarks/single_node/gptoss_fp4_mi355x.sh # benchmarks/single_node/kimik2.5_fp4_b200.sh # benchmarks/single_node/kimik2.5_int4_b200.sh # benchmarks/single_node/kimik2.5_int4_mi325x.sh # benchmarks/single_node/kimik2.5_int4_mi355x.sh # benchmarks/single_node/minimaxm2.5_fp8_b200.sh # benchmarks/single_node/minimaxm2.5_fp8_h200.sh # benchmarks/single_node/minimaxm2.5_fp8_mi300x.sh # benchmarks/single_node/minimaxm2.5_fp8_mi325x.sh # benchmarks/single_node/minimaxm2.5_fp8_mi355x.sh # benchmarks/single_node/qwen3.5_bf16_mi300x.sh # benchmarks/single_node/qwen3.5_bf16_mi325x.sh # benchmarks/single_node/qwen3.5_bf16_mi355x.sh # benchmarks/single_node/qwen3.5_fp8_mi300x.sh # benchmarks/single_node/qwen3.5_fp8_mi325x.sh # benchmarks/single_node/qwen3.5_fp8_mi355x.sh

github-actions · 2026-03-15T21:15:18Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

functionstackx

thanks for the PR @Oseltamivir ! seems great that we now parallel evals instead of having them in sequential

mostly LGTM

just had a couple of questions

does this PR change the evals to use the industry standard num of shots for gsm8k?
previously when evals hard crashed, the script still exits with success exit code, is this fixed now?

.github/workflows/benchmark-tmpl.yml

.github/workflows/e2e-tests.yml

benchmarks/single_node/kimik2.5_fp4_b200.sh

benchmarks/benchmark_lib.sh

utils/evals/EVALS.md

utils/matrix_logic/generate_sweep_configs.py

Oseltamivir · 2026-03-15T22:08:58Z

Changed to 5-shot
Yes, also added suggestion of failing when result <85%

AGENTS.md

.github/workflows/benchmark-tmpl.yml

AGENTS.md

benchmarks/benchmark_lib.sh

.github/workflows/e2e-tests.yml

benchmarks/single_node/glm5_fp8_b200.sh

claude

All critical bugs from prior review rounds have been addressed — LGTM.

Extended reasoning...

Overview

PR #911 separates eval-only runs into a dedicated workflow path, switches evals to the 8k1k workload, adds validate_scores.py, and updates 49 benchmark scripts with EVAL_ONLY blocks, context length expansion, and --concurrent-requests removal. 69 files changed across GitHub Actions workflows, benchmark shell scripts, Python utilities, and documentation.

Security Risks

No security-sensitive code is touched. The use of subprocess.run in process_changelog.py and git diff in get_added_lines operates on local filesystem paths and repository refs passed via argparse, with no shell=True — no injection risk. No auth, crypto, or permissions code is modified.

Level of Scrutiny

This PR required high scrutiny due to its size and the critical nature of eval correctness. The review spanned 10 rounds over six days (2026-03-15 through 2026-03-20) and identified six 🔴 critical bugs that have all been fixed: EVAL_ONLY env var propagation to Docker containers, EVAL_MAX_MODEL_LEN export in subshell contexts, validate_scores.py silent pass when no metrics matched, missing --max_seq_len in gptoss_fp4_h200_trt.sh, missing always() guard on eval artifact upload, and stale --concurrent-requests in glm5_fp8_b200.sh and qwen3.5_fp8_b200_mtp.sh (the last two confirmed fixed in the current diff).

Other Factors

Remaining 🟡 issues are minor or acknowledged: exit vs return in run_eval() is a latent artifact-stranding edge case on eval failure; median concurrency selecting index len//2 collapses to highest for 2-element lists (documented in AGENTS.md); e2e SINGLE filter duplication when --evals-only is passed wastes CI resources but produces correct results. None of these block merging. All inline comments are resolved and the PR has a human LGTM.

benchmarks/single_node/dsr1_fp8_b200_trt.sh

benchmarks/single_node/glm5_fp8_h200.sh

.github/workflows/benchmark-tmpl.yml

claude · 2026-03-22T06:44:00Z

benchmarks/benchmark_lib.sh

+    if [ "$benchmark_ctx" -eq 0 ] 2>/dev/null; then
+        benchmark_ctx="${native_max:-0}"
+    fi
+    local eval_ctx=$(( benchmark_ctx * 1 ))


🔴 The compute_eval_context_length function uses a multiplier of 1 instead of 5, so the eval context is never expanded. For a typical 8k1k workload (ISL=8192, OSL=1024), this produces an eval context of 9416 tokens instead of 47080, completely defeating the purpose of eval-only mode. Fix: change benchmark_ctx * 1 to benchmark_ctx * 5 on line 617 of benchmarks/benchmark_lib.sh.

Extended reasoning...

What the bug is

In benchmarks/benchmark_lib.sh line 617, the newly introduced compute_eval_context_length() function contains:

local eval_ctx=$(( benchmark_ctx * 1 ))

The multiplier is 1 instead of 5. The function comment directly above (lines 601-603) explicitly states: "Uses 5x the benchmark context capped at the model's native max." The PR description also confirms the intent: compute baseline_ctx x 5 capped at the model native max.

The specific code path that triggers it

Every benchmark script introduced or modified by this PR calls setup_eval_context(), which in turn calls compute_eval_context_length(). The computed value is then used to set --context-length, MAX_MODEL_LEN, or CALCULATED_MAX_MODEL_LEN for the inference server. With multiplier=1, the server starts with the exact same context window as the benchmark workload — the eval-context expansion is a no-op.

Why existing code does not prevent it

The function was newly introduced in this PR, so there was no prior implementation to compare against. The bug is a simple typo (* 1 instead of * 5). The cap logic checking whether eval_ctx > native_max is correct but irrelevant since the expansion never occurs in the first place.

Impact

For a typical 8k1k workload (ISL=8192, OSL=1024), the benchmark context is ISL + OSL + 20 = 9236 tokens. With the correct multiplier, eval_ctx = min(9236 x 5, native_max) = 46180. With the current multiplier of 1, eval_ctx = 9236 x 1 = 9236. The server starts with only ~9236 tokens of context instead of ~46180. The whole point of eval-only mode is to give lm-eval a generous context budget to avoid truncation failures — all 49 benchmark scripts that call setup_eval_context() are affected, making eval context expansion completely non-functional.

Step-by-step proof

Eval-only job dispatches with EVAL_ONLY=true, ISL=8192, OSL=1024.

Benchmark script calls setup_eval_context(), which calls compute_eval_context_length "$MODEL" "$((8192 + 1024 + 20))".

Inside the function: benchmark_ctx=9236, eval_ctx = 9236 * 1 = 9236.

Since 9236 < native_max (e.g., 163840 for DeepSeek-R1), no cap is applied.

EVAL_MAX_MODEL_LEN=9236 is exported; server starts with --context-length 9236.

Expected: eval_ctx = 9236 * 5 = 46180; server should start with --context-length 46180.

Result: eval context is identical to benchmark context — eval-only mode's core feature is broken.

Fix

Change line 617 from benchmark_ctx * 1 to benchmark_ctx * 5.

claude · 2026-03-22T06:44:00Z

.github/workflows/run-sweep.yml

    collect-evals:
-        needs:
-            [
-                sweep-single-node-1k1k,
-                sweep-single-node-1k8k,
-                sweep-single-node-8k1k,
-                sweep-multi-node-1k1k,
-                sweep-multi-node-1k8k,
-                sweep-multi-node-8k1k,
-                setup,
-            ]
+        needs: [sweep-evals, setup]
        if: ${{ always() && needs.setup.result != 'skipped' }}
        uses: ./.github/workflows/collect-evals.yml
        secrets: inherit


🔴 The collect-evals job in run-sweep.yml (line 235) uses if: ${{ always() && needs.setup.result != 'skipped' }} which does not check whether sweep-evals was skipped. When a PR adds no 8k1k single-node eval configs, sweep-evals is skipped and no eval_* artifacts are uploaded, but collect-evals still runs and collect-evals.yml calls actions/download-artifact@v8 without error-on-missing-artifacts: false, causing a spurious CI failure. The same issue exists in e2e-tests.yml where collect-evals uses if: ${{ always() }} with no check on test-sweep-evals.result. Fix: add && needs.sweep-evals.result != 'skipped' to the condition in run-sweep.yml, and similarly gate collect-evals in e2e-tests.yml on test-sweep-evals.result != 'skipped'.

Extended reasoning...

What the bug is and how it manifests

In run-sweep.yml line 235, the collect-evals job was updated to depend on [sweep-evals, setup] instead of all the sweep jobs. However, its condition remains if: ${{ always() && needs.setup.result != 'skipped' }}. The sweep-evals job itself has a guard:

if: ${{ toJson(fromJson(needs.setup.outputs.search-space-config).evals) != '[]' && ... != 'null' }}

When a PR only adds multi-node configs, or only adds non-8k1k single-node configs, mark_eval_entries() in generate_sweep_configs.py (which exclusively marks 8k1k entries) produces an empty evals list, process_changelog.py generates no eval entries, and sweep-evals is skipped. Because collect-evals only checks needs.setup.result != 'skipped' and not needs.sweep-evals.result != 'skipped', collect-evals still runs.

The specific code path

PR is opened that adds only multi-node or non-8k1k benchmark configs

process_changelog.py runs with --evals-only, produces an empty list

sweep-evals evaluates its condition: evals == [] → skipped

No eval_* artifacts are uploaded to the workflow run

collect-evals evaluates always() && needs.setup.result != 'skipped' → true (setup ran successfully)

collect-evals runs and invokes collect-evals.yml

Inside collect-evals.yml, actions/download-artifact@v8 with pattern eval_* finds no matching artifacts

In actions/download-artifact v4+, the default behavior is to fail when no matching artifacts are found (unlike v3 which silently succeeded)

The step exits with a non-zero code → spurious CI failure for an otherwise-valid PR

Why existing code doesn't prevent it

The condition needs.setup.result != 'skipped' guards against the case where setup itself is skipped (e.g. if the workflow was cancelled before it started), but it does nothing to guard against sweep-evals being skipped due to an empty eval matrix. The always() function forces the job to run regardless of upstream job statuses, so even if sweep-evals is skipped, collect-evals runs. actions/download-artifact@v8 does not have error-on-missing-artifacts: false set in collect-evals.yml, so it fails hard when no eval_* artifacts exist.

Impact

Any PR that does not produce 8k1k single-node eval configs will experience a spurious CI failure in the collect-evals job, despite the actual benchmark and eval logic working correctly. Additionally, since collect-evals.result is 'failure' (not 'skipped'), the trigger-ingest job (line 295-318 in run-sweep.yml) may still fire because its condition checks collect-evals.result != 'skipped', potentially triggering an ingest with no eval data.

The same issue exists in e2e-tests.yml where collect-evals uses if: ${{ always() }} with no guard on test-sweep-evals.result.

How to fix

In run-sweep.yml, change line 235 from:

if: ${{ always() && needs.setup.result != 'skipped' }}

to:

if: ${{ always() && needs.setup.result != 'skipped' && needs.sweep-evals.result != 'skipped' }}

In e2e-tests.yml, change:

if: ${{ always() }}

to:

if: ${{ always() && needs.test-sweep-evals.result != 'skipped' }}

Alternatively, add error-on-missing-artifacts: false to the actions/download-artifact@v8 call in collect-evals.yml and add null-handling in collect_eval_results.py for the case of no artifacts.

Step-by-step proof

A PR is submitted that adds only dsr1-fp4-mi355x-sglang (which has ISL=1024, OSL=1024 — a 1k1k config, not 8k1k)

process_changelog.py runs generate_sweep_configs.py --evals-only for this config

mark_eval_entries() finds no entries with isl=8192, osl=1024 → returns empty list

sweep-evals condition: evals == [] → skipped, no artifacts uploaded

collect-evals condition: always() && setup.result != 'skipped' → true → runs

actions/download-artifact@v8 searches for eval_* artifacts → finds none → exits 1

collect-evals fails, PR shows a red CI check for an otherwise-valid benchmark PR

claude · 2026-03-22T06:44:00Z

.github/workflows/benchmark-tmpl.yml

        if: always()
        uses: actions/upload-artifact@bbbca2ddaa5d8feaa63e36b76fdaad77386f024f # v7.0.0
        with:
-          name: server_logs_${{ env.RESULT_FILENAME }}
+          name: ${{ inputs.eval-only && 'eval_server_logs_' || 'server_logs_' }}${{ env.RESULT_FILENAME }}
          path: server.log
          if-no-files-found: ignore

      - name: Upload GPU metrics
        if: always()
        uses: actions/upload-artifact@bbbca2ddaa5d8feaa63e36b76fdaad77386f024f # v7.0.0
        with:
-          name: gpu_metrics_${{ env.RESULT_FILENAME }}
+          name: ${{ inputs.eval-only && 'eval_gpu_metrics_' || 'gpu_metrics_' }}${{ env.RESULT_FILENAME }}
          path: gpu_metrics.csv
          if-no-files-found: ignore

      - name: Upload eval results (if any)
-        if: ${{ env.RUN_EVAL == 'true' }}
+        if: ${{ always() && (env.RUN_EVAL == 'true' || inputs.eval-only) }}
        uses: actions/upload-artifact@bbbca2ddaa5d8feaa63e36b76fdaad77386f024f # v7.0.0
        with:
          name: eval_${{ env.EXP_NAME }}_${{ env.RESULT_FILENAME }}
          path: |
            meta_env.json
            results*.json
            sample*.jsonl
-          if-no-files-found: ignore
+          if-no-files-found: ${{ inputs.eval-only && 'error' || 'ignore' }}

      - name: Cleanup eval outputs (post-upload)
-        if: ${{ env.RUN_EVAL == 'true' }}
+        if: ${{ always() && (env.RUN_EVAL == 'true' || inputs.eval-only) }}
        run: |
          rm -f meta_env.json || true


🟡 The Cleanup eval outputs step removes meta_env.json and results*.json but never removes sample*.jsonl files, which are also moved into the workspace CWD by append_lm_eval_summary and then uploaded as artifacts. On persistent self-hosted GPU runners, these files accumulate across runs, causing subsequent artifact uploads to mix stale sample data from prior runs with current results. Fix: add 'rm -f sample*.jsonl || true' to the cleanup step.

Extended reasoning...

Analysis of bug_004: Cleanup step omits sample*.jsonl files

What the bug is and how it manifests

The 'Cleanup eval outputs (post-upload)' step in benchmark-tmpl.yml (lines 223-228) removes meta_env.json and results*.json after uploading eval artifacts, but it does NOT remove sample*.jsonl files. These sample files are explicitly included in the artifact upload path (lines 217-220) and are moved into the workspace CWD during benchmark execution. Because persistent self-hosted GPU runners reuse the same workspace directory across jobs, any sample*.jsonl files left behind will accumulate across runs.

The specific code path that triggers it

run_lm_eval is invoked with --log_samples (benchmark_lib.sh line 681), causing lm-eval to write samples_{task}_*.jsonl files into a temporary output directory.

append_lm_eval_summary uses find "${out_dir}" -type f -name "*.json*" (benchmark_lib.sh line 755) -- the glob .json matches both .json and .jsonl extensions -- and moves all matched files into the workspace CWD.

The upload step (lines 212-221) correctly includes sample*.jsonl in its path block.

The cleanup step (lines 223-228) only runs 'rm -f meta_env.json' and 'rm -f results*.json', leaving sample*.jsonl files on disk.

Why existing code does not prevent it

There is no pre-run workspace sweep for sample*.jsonl. The resource cleanup steps (pre-run and post-run) only handle Docker containers and SLURM jobs, not leftover workspace files. The comment on line 227 says 'Remove any eval results JSONs that were moved into workspace' but the implementation only removes .json files, missing the .jsonl sample files that were also moved in by the same .json glob.

What the impact would be

On persistent self-hosted GPU runners, each eval run that produces sample logs will leave sample*.jsonl files in the workspace. After N runs, N sets of sample files accumulate. The next artifact upload will pick up all of them (since sample*.jsonl is a glob), bundling stale samples from previous runs into the current run's eval artifact. This corrupts eval artifact provenance -- a reviewer examining eval samples cannot tell which run they belong to.

How to fix it

Add one line to the cleanup step:

- name: Cleanup eval outputs (post-upload) if: ${{ always() && (env.RUN_EVAL == 'true' || inputs.eval-only) }} run: | rm -f meta_env.json || true rm -f results*.json || true rm -f sample*.jsonl || true # add this line

Step-by-step proof

Run A executes with run-eval: true. run_lm_eval --log_samples writes samples_mmlu_run_a.jsonl to a temp output dir.

append_lm_eval_summary runs find matching .json, moves samples_mmlu_run_a.jsonl to workspace CWD.

Upload step uploads meta_env.json, results_mmlu.json, and samples_mmlu_run_a.jsonl -- correct.

Cleanup step removes meta_env.json and results_mmlu.json but NOT samples_mmlu_run_a.jsonl.

Run B starts on the same runner. Workspace still contains samples_mmlu_run_a.jsonl from Run A.

Run B produces samples_mmlu_run_b.jsonl (its own results).

Upload step now uploads BOTH samples_mmlu_run_a.jsonl (stale, from Run A) and samples_mmlu_run_b.jsonl (current) -- artifact is contaminated with stale data from prior run.

Oseltamivir and others added 22 commits March 8, 2026 08:31

fix: reserve 30% of context for prompt in eval gen_kwargs

4452868

Use 70% of max_length for max_tokens generation cap, leaving 30% for the prompt. Echo the budget for visibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

change eval calling

83cd7de

change gsm8k to 8-shot

73b6846

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

decouple

aad5336

add gpqa

1cdc72a

fix: pass multiple eval tasks as separate args for older lm-eval compat

49b0b90

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: run eval tasks sequentially for cross-version lm-eval compat

cc59d50

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: use directory-based task discovery for multi-eval in single lm_e…

c8b5858

…val call Pass utils/evals/ directory to --tasks so lm-eval globs all *.yaml files and runs gsm8k + gpqa_diamond consecutively in one invocation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gsm8k only

ffee6c5

Merge branch 'main' into eval-8k1k-server-restart

c577ca2

fix: cap max_gen_tokens to server's max_model_len to avoid request re…

dc25ccd

…jection max_gen_tokens was hardcoded to 16384 but servers with smaller EVAL_MAX_MODEL_LEN (e.g. 9416) reject requests exceeding their limit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: default EVAL_TASKS_DIR to utils/evals directory, not single yaml…

2d9d7ba

… file Was defaulting to utils/evals/gsm8k.yaml which caused lm-eval to only run gsm8k. Directory path lets lm-eval glob all *.yaml files (gsm8k + gpqa). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: reduce eval context multiplier to 5x and increase request timeou…

1ca2173

…t to 1800s 5x avoids TRT OOM on H200 (47K vs 94K context). 1800s timeout prevents single-request timeouts on slow models like Kimi K2.5. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

test other prompt

6999263

pr

ba45203

github-project-automation bot added this to InferenceMAX Board Mar 15, 2026

functionstackx requested changes Mar 15, 2026

View reviewed changes

claude bot reviewed Mar 15, 2026

View reviewed changes

Oseltamivir requested a review from a team March 15, 2026 22:08

test evals

826035c

Oseltamivir force-pushed the eval-8k1k-server-restart branch from 600582e to 826035c Compare March 15, 2026 22:08

claude bot reviewed Mar 18, 2026

View reviewed changes

AGENTS.md Show resolved Hide resolved

Oseltamivir and others added 3 commits March 18, 2026 16:02

resolve claude

6350b6b

Merge branch 'main' into eval-8k1k-server-restart

9ae1ae4

h200 change

dec6f60

claude bot reviewed Mar 18, 2026

View reviewed changes

.github/workflows/benchmark-tmpl.yml Show resolved Hide resolved

final touches

b08e063

claude bot reviewed Mar 19, 2026

View reviewed changes

.github/workflows/benchmark-tmpl.yml Show resolved Hide resolved

AGENTS.md Show resolved Hide resolved

Oseltamivir and others added 2 commits March 19, 2026 16:16

test normal perf-changelog

f04881d

Merge branch 'main' into eval-8k1k-server-restart

86764fa

claude bot reviewed Mar 19, 2026

View reviewed changes

benchmarks/benchmark_lib.sh Show resolved Hide resolved

claude bot reviewed Mar 20, 2026

View reviewed changes

benchmarks/benchmark_lib.sh Show resolved Hide resolved

Oseltamivir added 3 commits March 19, 2026 18:13

test normal perf-changelog

c17619f

all evals

d30f807

remove pycache

766a742

Oseltamivir force-pushed the eval-8k1k-server-restart branch from cfe5ca1 to 766a742 Compare March 20, 2026 04:25

claude bot reviewed Mar 20, 2026

View reviewed changes

.github/workflows/e2e-tests.yml Outdated Show resolved Hide resolved

Oseltamivir added 2 commits March 20, 2026 09:19

argmax error

5d5dd7b

Merge branch 'main' into eval-8k1k-server-restart

f54b09c

claude bot reviewed Mar 20, 2026

View reviewed changes

benchmarks/single_node/glm5_fp8_b200.sh Show resolved Hide resolved

merge main

b9551f6

claude bot reviewed Mar 21, 2026

View reviewed changes

Oseltamivir and others added 2 commits March 21, 2026 13:24

blocking rm

f0fff18

Merge branch 'main' into eval-8k1k-server-restart

8bf41f0

claude bot reviewed Mar 21, 2026

View reviewed changes

benchmarks/single_node/dsr1_fp8_b200_trt.sh Show resolved Hide resolved

claude bot reviewed Mar 21, 2026

View reviewed changes

benchmarks/single_node/glm5_fp8_h200.sh Show resolved Hide resolved

.github/workflows/benchmark-tmpl.yml Show resolved Hide resolved

Oseltamivir added 3 commits March 21, 2026 22:26

standardize

38b80a8

reduce ctxt OOM

5beca55

reduce ctxt OOM

4dcfc92

Oseltamivir force-pushed the eval-8k1k-server-restart branch from e6a8801 to 4dcfc92 Compare March 22, 2026 05:44

claude bot reviewed Mar 22, 2026

View reviewed changes

Conversation

Oseltamivir commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Eval Context Override (per-framework patterns)

Implementation Details

Uh oh!

github-actions bot commented Mar 15, 2026

Uh oh!

functionstackx left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Oseltamivir commented Mar 15, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Overview

Security Risks

Level of Scrutiny

Other Factors

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude bot Mar 22, 2026

Choose a reason for hiding this comment

What the bug is

The specific code path that triggers it

Why existing code does not prevent it

Impact

Step-by-step proof

Fix

Uh oh!

claude bot Mar 22, 2026

Choose a reason for hiding this comment

What the bug is and how it manifests

The specific code path

Why existing code doesn't prevent it

Impact

How to fix

Step-by-step proof

Uh oh!

claude bot Mar 22, 2026

Choose a reason for hiding this comment

Analysis of bug_004: Cleanup step omits sample*.jsonl files

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Oseltamivir commented Mar 15, 2026 •

edited

Loading

functionstackx left a comment •

edited

Loading