Fix(eval): Ignore flaky and otherwise unsuitable tests#40
Merged
Conversation
Port the June 2026 test-ignore updates from the internal reference harness into ProgramBench across 41 instances (+270 ignored tests total), keeping all metadata verbatim, and update CLAUDE.md accordingly. Branch-hash keys and per-branch test name-sets are unchanged, so no test blob re-upload is needed; only ignore metadata differs. The testorg__calculator fixture is untouched. Internal-reference-commit: 1ca7719a46b9fc521af2353e287b4f07f4c071ea Internal-reference-commit: a8f153ad485daf95e2041b5e872b5b715b7f5410 Internal-reference-commit: b6466b9e33c2abd88ce8ca06037c876b1edcc864 Internal-reference-commit: d7f13854873fd23b60606729d649f31a4dc7d7ac Internal-reference-commit: 7b3a00e057cc625c86bcb9988a439ae50adb0828 Internal-reference-commit: 83488589c5237b4b9af904e1809997df735106f0 Internal-reference-commit: 3bbf4632e7d5e978b817bc168fa0897d0bbc2161 Internal-reference-commit: 691dcaa75a89d8d6f22fc7253c641375df9c3bdf
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates ProgramBench’s task evaluation metadata to ignore newly identified flaky, non-discriminating, or otherwise unsuitable behavioral tests, and documents the ignore-reason taxonomy in CLAUDE.md.
Changes:
- Added new
branches.<hash>.ignored_tests[]entries across many tasktests.jsonfiles to exclude flaky/deterministically-failing tests from scoring (with detailed reason metadata). - Extended ignore coverage for several TUI/network/timing-sensitive test suites and golden drift/build-stamp/toolchain-sensitive cases.
- Documented ignore reason IDs and semantics in
CLAUDE.md.
Reviewed changes
Copilot reviewed 42 out of 42 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| src/programbench/data/tasks/zk-org__zk.10d93d5/tests.json | Adds ignore metadata for a golden-output drift failure in list formatting. |
| src/programbench/data/tasks/ys-l__flamelens.0b4dc33/tests.json | Adds ignore metadata for a flaky TUI render/timing test. |
| src/programbench/data/tasks/yassinebridi__serpl.c48a9d7/tests.json | Adds many ignored tests due to flaky TUI/tmux/db-lock/test-isolation behavior. |
| src/programbench/data/tasks/y2z__monolith.8702e66/tests.json | Ignores a deterministic gold failure due to external resource non-reproducibility. |
| src/programbench/data/tasks/unhappychoice__gittype.34b72d0/tests.json | Adds multiple ignored tests flagged as flaky under gold runs. |
| src/programbench/data/tasks/tstack__lnav.ee34494/tests.json | Ignores a deterministic build-stamp-dependent usage/version mismatch. |
| src/programbench/data/tasks/svenstaro__genact.16f96e3/tests.json | Ignores a randomized-output flaky test module case. |
| src/programbench/data/tasks/stranger6667__jsonschema.d52e881/tests.json | Ignores a deterministic gold failure tied to toolchain drift/external TLS endpoint. |
| src/programbench/data/tasks/sqlite__sqlite.839433d/tests.json | Ignores several flaky tests due to shared DB state / isolation races. |
| src/programbench/data/tasks/sheepla__pingu.926d475/tests.json | Ignores deterministic toolchain/build-stamp-sensitive failures in CLI/version outputs. |
| src/programbench/data/tasks/rust-embedded__svd2rust.1760b5e/tests.json | Adds ignored tests for build-stamp/toolchain-sensitive output expectations. |
| src/programbench/data/tasks/rs__curlie.5dfcbb1/tests.json | Ignores network-dependent/flaky curl connection tests and snapshot flakes. |
| src/programbench/data/tasks/rhysd__kiro-editor.4157485/tests.json | Adds ignored tests for flaky TUI render timing and UTF-8 input state capture. |
| src/programbench/data/tasks/rcoh__angle-grinder.9c2fc88/tests.json | Ignores flaky progressive/ANSI TTY rendering snapshot cases. |
| src/programbench/data/tasks/raviqqe__muffet.a882908/tests.json | Ignores a flaky local HTTP test-server dial race. |
| src/programbench/data/tasks/pls-rs__pls.4e1ae50/tests.json | Ignores a flaky filtering cutoff test. |
| src/programbench/data/tasks/peco__peco.4e58dad/tests.json | Ignores multiple flaky tcell/TUI screen-init and tty-dependent cases. |
| src/programbench/data/tasks/orf__gping.26eb5b9/tests.json | Ignores flaky TUI ping-stats rendering timing races. |
| src/programbench/data/tasks/noborus__ov.b96c2ba/tests.json | Adds several ignored tests due to flaky TUI/snapshot behavior. |
| src/programbench/data/tasks/nikolassv__bartib.6b9b5ce/tests.json | Ignores a deterministic toolchain formatting drift in table output. |
| src/programbench/data/tasks/mkj__dropbear.75f699b/tests.json | Ignores numerous SSH socket/port/host-key race flakes in integration tests. |
| src/programbench/data/tasks/kyoheiu__felix.95df390/tests.json | Ignores flaky narrow-terminal layout calculation behavior. |
| src/programbench/data/tasks/kisielk__errcheck.dacab89/tests.json | Ignores a deterministic toolchain-dependent exit-code/output expectation. |
| src/programbench/data/tasks/junegunn__fzf.b56d614/tests.json | Ignores build-stamp-dependent version exactness tests. |
| src/programbench/data/tasks/jrnxf__thokr.09375ef/tests.json | Ignores timing-dependent elapsed-time and TUI timer rendering flakes. |
| src/programbench/data/tasks/jgm__pandoc.5caad90/tests.json | Ignores deterministic pandoc-types API/toolchain drift failures in JSON AST output. |
| src/programbench/data/tasks/isona__dirble.e2dea9f/tests.json | Ignores network-dependent/flaky local HTTP test-server connection races. |
| src/programbench/data/tasks/htop-dev__htop.523600b/tests.json | Ignores a flaky ultra-intensive TUI/tree test. |
| src/programbench/data/tasks/hatoo__oha.8dc6349/tests.json | Ignores multiple flaky output/TUI timing cases and a deterministic toolchain drift. |
| src/programbench/data/tasks/gromacs__gromacs.665ea4c/tests.json | Ignores deterministic build-stamp-dependent structure-related tests. |
| src/programbench/data/tasks/ffmpeg__ffmpeg.360a402/tests.json | Ignores deterministic feature-set/build-configuration dependent help/list outputs and a setup race. |
| src/programbench/data/tasks/elkowar__pipr.fae0b17/tests.json | Ignores a flaky TUI command list window interaction test. |
| src/programbench/data/tasks/ekzhang__bore.8e059cd/tests.json | Ignores a flaky harvest/proxy test. |
| src/programbench/data/tasks/duckdb__duckdb.bdb65ec/tests.json | Ignores a flaky SQL nondeterminism case (row order without ORDER BY). |
| src/programbench/data/tasks/dandavison__delta.acd758f/tests.json | Ignores a flaky git-grep output-related test. |
| src/programbench/data/tasks/chirlu__sox.42b3557/tests.json | Ignores deterministic build feature-set dependent CLI option/help expectation tests. |
| src/programbench/data/tasks/canop__broot.d6c798e/tests.json | Ignores a flaky TUI panel size toggle test. |
| src/programbench/data/tasks/burntsushi__ripgrep.3b7fd44/tests.json | Ignores nondeterministic output ordering flakes in CLI/vimgrep tests. |
| src/programbench/data/tasks/bensadeh__tailspin.6278437/tests.json | Ignores a flaky follow-mode test. |
| src/programbench/data/tasks/ast-grep__ast-grep.dde0fe0/tests.json | Ignores transient nondeterminism in HTML injection matching and diagnostic ordering. |
| src/programbench/data/tasks/antonmedv__walk.bf802ef/tests.json | Ignores multiple flaky TUI/listing output snapshot cases and delete/undo races. |
| CLAUDE.md | Documents ignored-test storage location and the meaning of ignore reason IDs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Port the June 2026 test-ignore updates from the internal reference harness into ProgramBench across 41 instances (+270 ignored tests total), keeping all metadata verbatim, and update CLAUDE.md accordingly.
Branch-hash keys and per-branch test name-sets are unchanged, so no test blob re-upload is needed; only ignore metadata differs. The testorg__calculator fixture is untouched.
Internal-reference-commit: 1ca7719a46b9fc521af2353e287b4f07f4c071ea
Internal-reference-commit: a8f153ad485daf95e2041b5e872b5b715b7f5410
Internal-reference-commit: b6466b9e33c2abd88ce8ca06037c876b1edcc864
Internal-reference-commit: d7f13854873fd23b60606729d649f31a4dc7d7ac
Internal-reference-commit: 7b3a00e057cc625c86bcb9988a439ae50adb0828
Internal-reference-commit: 83488589c5237b4b9af904e1809997df735106f0
Internal-reference-commit: 3bbf4632e7d5e978b817bc168fa0897d0bbc2161
Internal-reference-commit: 691dcaa75a89d8d6f22fc7253c641375df9c3bdf