-
Notifications
You must be signed in to change notification settings - Fork 104
Separate eval-only workflow and change to 8k1k #911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
cfcaa84
8d98b94
4452868
83cd7de
161e7f6
73b6846
83e185d
aad5336
375f14e
1cdc72a
49b0b90
cc59d50
c8b5858
ffee6c5
c577ca2
dc25ccd
285b662
2d9d7ba
1ca2173
7458eea
6999263
ba45203
826035c
34bc7c4
4978aed
c038e1b
29d69fa
327fd6d
6350b6b
9ae1ae4
dec6f60
b08e063
f04881d
86764fa
c17619f
d30f807
766a742
5d5dd7b
f54b09c
b9551f6
f0fff18
8bf41f0
38b80a8
5beca55
4dcfc92
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -54,6 +54,11 @@ on: | |
| type: boolean | ||
| required: true | ||
| default: false | ||
| eval-only: | ||
| description: "Run only evals (skip throughput benchmark)" | ||
| type: boolean | ||
| required: false | ||
| default: false | ||
| random-range-ratio: | ||
| required: false | ||
| type: string | ||
|
|
@@ -83,6 +88,8 @@ env: | |
| SPEC_DECODING: ${{ inputs.spec-decoding }} | ||
| DISAGG: ${{ inputs.disagg }} | ||
| RUN_EVAL: ${{ inputs.run-eval }} | ||
| EVAL_ONLY: ${{ inputs.eval-only }} | ||
| PYTHONDONTWRITEBYTECODE: '1' | ||
|
|
||
| permissions: | ||
| contents: read | ||
|
|
@@ -91,7 +98,7 @@ jobs: | |
| benchmark: | ||
| runs-on: ${{ inputs.runner }} | ||
| timeout-minutes: 300 | ||
| name: "${{ inputs.exp-name }} ${{ inputs.precision }} ${{ inputs.runner }} ${{ inputs.framework }} | tp=${{ inputs.tp }} ep=${{ inputs.ep }} dpa=${{ inputs.dp-attn }} | disagg-${{ inputs.disagg }} spec-${{ inputs.spec-decoding }} conc-${{ inputs.conc }}${{ inputs.run-eval && ' | eval' || '' }}" | ||
| name: "${{ inputs.exp-name }} ${{ inputs.precision }} ${{ inputs.runner }} ${{ inputs.framework }} | tp=${{ inputs.tp }} ep=${{ inputs.ep }} dpa=${{ inputs.dp-attn }} | disagg-${{ inputs.disagg }} spec-${{ inputs.spec-decoding }} conc-${{ inputs.conc }}${{ inputs.eval-only && ' | eval-only' || (inputs.run-eval && ' | eval' || '') }}" | ||
| steps: | ||
| - name: Resource cleanup (pre-run) | ||
| run: &resource-cleanup | | ||
|
|
@@ -145,28 +152,42 @@ jobs: | |
| echo "RESULT_FILENAME=${RESULT_FILENAME}" >> $GITHUB_ENV | ||
|
|
||
| bash ./runners/launch_${RUNNER_NAME%%_*}.sh | ||
| FOUND_RESULT_FILE= | ||
| for i in {1..10}; do | ||
| if [ -f "$RESULT_FILENAME.json" ]; then | ||
| FOUND_RESULT_FILE=true | ||
| break | ||
|
|
||
| if [ "${{ inputs.eval-only }}" = "true" ]; then | ||
Oseltamivir marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| echo "Eval-only mode: skipping benchmark result file check" | ||
| # Verify eval produced results | ||
| if ! ls results*.json 1>/dev/null 2>&1; then | ||
| echo "Eval-only run failed: no results*.json files found." >&2 | ||
| exit 1 | ||
| fi | ||
| echo "Waiting for result file... (attempt $i)" | ||
| sleep 1 | ||
| done | ||
| # Verify eval scores meet minimum threshold (85%) | ||
| python3 utils/evals/validate_scores.py | ||
| else | ||
| FOUND_RESULT_FILE= | ||
| for i in {1..10}; do | ||
| if [ -f "$RESULT_FILENAME.json" ]; then | ||
| FOUND_RESULT_FILE=true | ||
| break | ||
| fi | ||
| echo "Waiting for result file... (attempt $i)" | ||
| sleep 1 | ||
| done | ||
|
|
||
| if [ -z "$FOUND_RESULT_FILE" ]; then | ||
| echo "Run failed: Benchmark result $RESULT_FILENAME.json not found." >&2 | ||
| exit 1 | ||
| if [ -z "$FOUND_RESULT_FILE" ]; then | ||
| echo "Run failed: Benchmark result $RESULT_FILENAME.json not found." >&2 | ||
| exit 1 | ||
| fi | ||
| fi | ||
|
|
||
| - name: Process result | ||
| if: ${{ !inputs.eval-only }} | ||
| env: | ||
| RUNNER_TYPE: ${{ inputs.runner }} | ||
| run: | | ||
| python3 utils/process_result.py | ||
|
|
||
| - name: Upload result | ||
| if: ${{ !inputs.eval-only }} | ||
| uses: actions/upload-artifact@bbbca2ddaa5d8feaa63e36b76fdaad77386f024f # v7.0.0 | ||
| with: | ||
| name: bmk_${{ env.RESULT_FILENAME }} | ||
|
|
@@ -176,31 +197,31 @@ jobs: | |
| if: always() | ||
| uses: actions/upload-artifact@bbbca2ddaa5d8feaa63e36b76fdaad77386f024f # v7.0.0 | ||
| with: | ||
| name: server_logs_${{ env.RESULT_FILENAME }} | ||
| name: ${{ inputs.eval-only && 'eval_server_logs_' || 'server_logs_' }}${{ env.RESULT_FILENAME }} | ||
| path: server.log | ||
| if-no-files-found: ignore | ||
|
|
||
| - name: Upload GPU metrics | ||
| if: always() | ||
| uses: actions/upload-artifact@bbbca2ddaa5d8feaa63e36b76fdaad77386f024f # v7.0.0 | ||
| with: | ||
| name: gpu_metrics_${{ env.RESULT_FILENAME }} | ||
| name: ${{ inputs.eval-only && 'eval_gpu_metrics_' || 'gpu_metrics_' }}${{ env.RESULT_FILENAME }} | ||
| path: gpu_metrics.csv | ||
| if-no-files-found: ignore | ||
|
|
||
| - name: Upload eval results (if any) | ||
| if: ${{ env.RUN_EVAL == 'true' }} | ||
| if: ${{ always() && (env.RUN_EVAL == 'true' || inputs.eval-only) }} | ||
| uses: actions/upload-artifact@bbbca2ddaa5d8feaa63e36b76fdaad77386f024f # v7.0.0 | ||
| with: | ||
| name: eval_${{ env.EXP_NAME }}_${{ env.RESULT_FILENAME }} | ||
| path: | | ||
| meta_env.json | ||
| results*.json | ||
| sample*.jsonl | ||
| if-no-files-found: ignore | ||
| if-no-files-found: ${{ inputs.eval-only && 'error' || 'ignore' }} | ||
|
|
||
| - name: Cleanup eval outputs (post-upload) | ||
| if: ${{ env.RUN_EVAL == 'true' }} | ||
| if: ${{ always() && (env.RUN_EVAL == 'true' || inputs.eval-only) }} | ||
| run: | | ||
| rm -f meta_env.json || true | ||
|
Comment on lines
197
to
226
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🟡 The Cleanup eval outputs step removes meta_env.json and results*.json but never removes sample*.jsonl files, which are also moved into the workspace CWD by append_lm_eval_summary and then uploaded as artifacts. On persistent self-hosted GPU runners, these files accumulate across runs, causing subsequent artifact uploads to mix stale sample data from prior runs with current results. Fix: add 'rm -f sample*.jsonl || true' to the cleanup step. Extended reasoning...Analysis of bug_004: Cleanup step omits sample*.jsonl filesWhat the bug is and how it manifests The 'Cleanup eval outputs (post-upload)' step in benchmark-tmpl.yml (lines 223-228) removes meta_env.json and results*.json after uploading eval artifacts, but it does NOT remove sample*.jsonl files. These sample files are explicitly included in the artifact upload path (lines 217-220) and are moved into the workspace CWD during benchmark execution. Because persistent self-hosted GPU runners reuse the same workspace directory across jobs, any sample*.jsonl files left behind will accumulate across runs. The specific code path that triggers it
Why existing code does not prevent it There is no pre-run workspace sweep for sample*.jsonl. The resource cleanup steps (pre-run and post-run) only handle Docker containers and SLURM jobs, not leftover workspace files. The comment on line 227 says 'Remove any eval results JSONs that were moved into workspace' but the implementation only removes .json files, missing the .jsonl sample files that were also moved in by the same .json glob. What the impact would be On persistent self-hosted GPU runners, each eval run that produces sample logs will leave sample*.jsonl files in the workspace. After N runs, N sets of sample files accumulate. The next artifact upload will pick up all of them (since sample*.jsonl is a glob), bundling stale samples from previous runs into the current run's eval artifact. This corrupts eval artifact provenance -- a reviewer examining eval samples cannot tell which run they belong to. How to fix it Add one line to the cleanup step: - name: Cleanup eval outputs (post-upload)
if: ${{ always() && (env.RUN_EVAL == 'true' || inputs.eval-only) }}
run: |
rm -f meta_env.json || true
rm -f results*.json || true
rm -f sample*.jsonl || true # add this lineStep-by-step proof
|
||
| # Remove any eval results JSONs that were moved into workspace | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -183,6 +183,36 @@ jobs: | |
| secrets: inherit | ||
| with: *single-node-inputs | ||
|
|
||
| sweep-evals: | ||
| needs: setup | ||
| if: ${{ toJson(fromJson(needs.setup.outputs.search-space-config).evals) != '[]' && toJson(fromJson(needs.setup.outputs.search-space-config).evals) != 'null' }} | ||
| uses: ./.github/workflows/benchmark-tmpl.yml | ||
| name: eval / | ||
| strategy: | ||
| fail-fast: false | ||
| matrix: | ||
| config: ${{ fromJson(needs.setup.outputs.search-space-config).evals }} | ||
| secrets: inherit | ||
| with: | ||
| exp-name: ${{ matrix.config.exp-name }} | ||
| isl: ${{ matrix.config.isl }} | ||
| osl: ${{ matrix.config.osl }} | ||
| max-model-len: ${{ matrix.config.max-model-len }} | ||
| runner: ${{ matrix.config.runner }} | ||
| image: ${{ matrix.config.image }} | ||
| model: ${{ matrix.config.model }} | ||
| model-prefix: ${{ matrix.config.model-prefix }} | ||
| framework: ${{ matrix.config.framework }} | ||
| precision: ${{ matrix.config.precision }} | ||
| tp: ${{ matrix.config.tp }} | ||
| ep: ${{ matrix.config.ep }} | ||
| dp-attn: ${{ matrix.config.dp-attn }} | ||
| conc: ${{ matrix.config.conc }} | ||
| spec-decoding: ${{ matrix.config.spec-decoding }} | ||
| disagg: ${{ matrix.config.disagg }} | ||
| run-eval: true | ||
| eval-only: true | ||
|
|
||
| collect-results: | ||
| needs: | ||
| [ | ||
|
|
@@ -201,16 +231,7 @@ jobs: | |
| result-prefix: "bmk" | ||
|
|
||
| collect-evals: | ||
| needs: | ||
| [ | ||
| sweep-single-node-1k1k, | ||
| sweep-single-node-1k8k, | ||
| sweep-single-node-8k1k, | ||
| sweep-multi-node-1k1k, | ||
| sweep-multi-node-1k8k, | ||
| sweep-multi-node-8k1k, | ||
| setup, | ||
| ] | ||
| needs: [sweep-evals, setup] | ||
| if: ${{ always() && needs.setup.result != 'skipped' }} | ||
| uses: ./.github/workflows/collect-evals.yml | ||
| secrets: inherit | ||
|
Comment on lines
233
to
237
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔴 The Extended reasoning...What the bug is and how it manifestsIn if: ${{ toJson(fromJson(needs.setup.outputs.search-space-config).evals) != '[]' && ... != 'null' }}When a PR only adds multi-node configs, or only adds non-8k1k single-node configs, The specific code path
Why existing code doesn't prevent itThe condition ImpactAny PR that does not produce 8k1k single-node eval configs will experience a spurious CI failure in the The same issue exists in How to fixIn if: ${{ always() && needs.setup.result != 'skipped' }}to: if: ${{ always() && needs.setup.result != 'skipped' && needs.sweep-evals.result != 'skipped' }}In if: ${{ always() }}to: if: ${{ always() && needs.test-sweep-evals.result != 'skipped' }}Alternatively, add Step-by-step proof
|
||
|
|
@@ -221,10 +242,12 @@ jobs: | |
| runs-on: ubuntu-latest | ||
| steps: | ||
| - name: Extract and save changelog metadata | ||
| env: | ||
| CONFIG_JSON: ${{ needs.setup.outputs.search-space-config }} | ||
| run: | | ||
| echo "$CONFIG_JSON" | jq '.changelog_metadata' > changelog_metadata.json | ||
| cat <<'CONFIGEOF' > _full_config.json | ||
| ${{ needs.setup.outputs.search-space-config }} | ||
| CONFIGEOF | ||
| jq '.changelog_metadata' _full_config.json > changelog_metadata.json | ||
| rm -f _full_config.json | ||
|
|
||
| - name: Upload changelog artifact | ||
| uses: actions/upload-artifact@bbbca2ddaa5d8feaa63e36b76fdaad77386f024f # v7.0.0 | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.