Skip to content

Fix(eval): Ignore flaky and otherwise unsuitable tests#40

Merged
klieret merged 1 commit into
mainfrom
sync-june-fixes
Jun 18, 2026
Merged

Fix(eval): Ignore flaky and otherwise unsuitable tests#40
klieret merged 1 commit into
mainfrom
sync-june-fixes

Conversation

@klieret

@klieret klieret commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Port the June 2026 test-ignore updates from the internal reference harness into ProgramBench across 41 instances (+270 ignored tests total), keeping all metadata verbatim, and update CLAUDE.md accordingly.

Branch-hash keys and per-branch test name-sets are unchanged, so no test blob re-upload is needed; only ignore metadata differs. The testorg__calculator fixture is untouched.

Internal-reference-commit: 1ca7719a46b9fc521af2353e287b4f07f4c071ea
Internal-reference-commit: a8f153ad485daf95e2041b5e872b5b715b7f5410
Internal-reference-commit: b6466b9e33c2abd88ce8ca06037c876b1edcc864
Internal-reference-commit: d7f13854873fd23b60606729d649f31a4dc7d7ac
Internal-reference-commit: 7b3a00e057cc625c86bcb9988a439ae50adb0828
Internal-reference-commit: 83488589c5237b4b9af904e1809997df735106f0
Internal-reference-commit: 3bbf4632e7d5e978b817bc168fa0897d0bbc2161
Internal-reference-commit: 691dcaa75a89d8d6f22fc7253c641375df9c3bdf

Port the June 2026 test-ignore updates from the internal reference harness
into ProgramBench across 41 instances (+270 ignored tests total), keeping all
metadata verbatim, and update CLAUDE.md accordingly.

Branch-hash keys and per-branch test name-sets are unchanged, so no test
blob re-upload is needed; only ignore metadata differs. The
testorg__calculator fixture is untouched.

Internal-reference-commit: 1ca7719a46b9fc521af2353e287b4f07f4c071ea
Internal-reference-commit: a8f153ad485daf95e2041b5e872b5b715b7f5410
Internal-reference-commit: b6466b9e33c2abd88ce8ca06037c876b1edcc864
Internal-reference-commit: d7f13854873fd23b60606729d649f31a4dc7d7ac
Internal-reference-commit: 7b3a00e057cc625c86bcb9988a439ae50adb0828
Internal-reference-commit: 83488589c5237b4b9af904e1809997df735106f0
Internal-reference-commit: 3bbf4632e7d5e978b817bc168fa0897d0bbc2161
Internal-reference-commit: 691dcaa75a89d8d6f22fc7253c641375df9c3bdf
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 18, 2026
@klieret klieret changed the title Data: sync test-ignore updates from internal reference Fix(eval): Ignore flaky and otherwise unsuitable tests Jun 18, 2026
@klieret klieret requested a review from Copilot June 18, 2026 02:37

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates ProgramBench’s task evaluation metadata to ignore newly identified flaky, non-discriminating, or otherwise unsuitable behavioral tests, and documents the ignore-reason taxonomy in CLAUDE.md.

Changes:

  • Added new branches.<hash>.ignored_tests[] entries across many task tests.json files to exclude flaky/deterministically-failing tests from scoring (with detailed reason metadata).
  • Extended ignore coverage for several TUI/network/timing-sensitive test suites and golden drift/build-stamp/toolchain-sensitive cases.
  • Documented ignore reason IDs and semantics in CLAUDE.md.

Reviewed changes

Copilot reviewed 42 out of 42 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/programbench/data/tasks/zk-org__zk.10d93d5/tests.json Adds ignore metadata for a golden-output drift failure in list formatting.
src/programbench/data/tasks/ys-l__flamelens.0b4dc33/tests.json Adds ignore metadata for a flaky TUI render/timing test.
src/programbench/data/tasks/yassinebridi__serpl.c48a9d7/tests.json Adds many ignored tests due to flaky TUI/tmux/db-lock/test-isolation behavior.
src/programbench/data/tasks/y2z__monolith.8702e66/tests.json Ignores a deterministic gold failure due to external resource non-reproducibility.
src/programbench/data/tasks/unhappychoice__gittype.34b72d0/tests.json Adds multiple ignored tests flagged as flaky under gold runs.
src/programbench/data/tasks/tstack__lnav.ee34494/tests.json Ignores a deterministic build-stamp-dependent usage/version mismatch.
src/programbench/data/tasks/svenstaro__genact.16f96e3/tests.json Ignores a randomized-output flaky test module case.
src/programbench/data/tasks/stranger6667__jsonschema.d52e881/tests.json Ignores a deterministic gold failure tied to toolchain drift/external TLS endpoint.
src/programbench/data/tasks/sqlite__sqlite.839433d/tests.json Ignores several flaky tests due to shared DB state / isolation races.
src/programbench/data/tasks/sheepla__pingu.926d475/tests.json Ignores deterministic toolchain/build-stamp-sensitive failures in CLI/version outputs.
src/programbench/data/tasks/rust-embedded__svd2rust.1760b5e/tests.json Adds ignored tests for build-stamp/toolchain-sensitive output expectations.
src/programbench/data/tasks/rs__curlie.5dfcbb1/tests.json Ignores network-dependent/flaky curl connection tests and snapshot flakes.
src/programbench/data/tasks/rhysd__kiro-editor.4157485/tests.json Adds ignored tests for flaky TUI render timing and UTF-8 input state capture.
src/programbench/data/tasks/rcoh__angle-grinder.9c2fc88/tests.json Ignores flaky progressive/ANSI TTY rendering snapshot cases.
src/programbench/data/tasks/raviqqe__muffet.a882908/tests.json Ignores a flaky local HTTP test-server dial race.
src/programbench/data/tasks/pls-rs__pls.4e1ae50/tests.json Ignores a flaky filtering cutoff test.
src/programbench/data/tasks/peco__peco.4e58dad/tests.json Ignores multiple flaky tcell/TUI screen-init and tty-dependent cases.
src/programbench/data/tasks/orf__gping.26eb5b9/tests.json Ignores flaky TUI ping-stats rendering timing races.
src/programbench/data/tasks/noborus__ov.b96c2ba/tests.json Adds several ignored tests due to flaky TUI/snapshot behavior.
src/programbench/data/tasks/nikolassv__bartib.6b9b5ce/tests.json Ignores a deterministic toolchain formatting drift in table output.
src/programbench/data/tasks/mkj__dropbear.75f699b/tests.json Ignores numerous SSH socket/port/host-key race flakes in integration tests.
src/programbench/data/tasks/kyoheiu__felix.95df390/tests.json Ignores flaky narrow-terminal layout calculation behavior.
src/programbench/data/tasks/kisielk__errcheck.dacab89/tests.json Ignores a deterministic toolchain-dependent exit-code/output expectation.
src/programbench/data/tasks/junegunn__fzf.b56d614/tests.json Ignores build-stamp-dependent version exactness tests.
src/programbench/data/tasks/jrnxf__thokr.09375ef/tests.json Ignores timing-dependent elapsed-time and TUI timer rendering flakes.
src/programbench/data/tasks/jgm__pandoc.5caad90/tests.json Ignores deterministic pandoc-types API/toolchain drift failures in JSON AST output.
src/programbench/data/tasks/isona__dirble.e2dea9f/tests.json Ignores network-dependent/flaky local HTTP test-server connection races.
src/programbench/data/tasks/htop-dev__htop.523600b/tests.json Ignores a flaky ultra-intensive TUI/tree test.
src/programbench/data/tasks/hatoo__oha.8dc6349/tests.json Ignores multiple flaky output/TUI timing cases and a deterministic toolchain drift.
src/programbench/data/tasks/gromacs__gromacs.665ea4c/tests.json Ignores deterministic build-stamp-dependent structure-related tests.
src/programbench/data/tasks/ffmpeg__ffmpeg.360a402/tests.json Ignores deterministic feature-set/build-configuration dependent help/list outputs and a setup race.
src/programbench/data/tasks/elkowar__pipr.fae0b17/tests.json Ignores a flaky TUI command list window interaction test.
src/programbench/data/tasks/ekzhang__bore.8e059cd/tests.json Ignores a flaky harvest/proxy test.
src/programbench/data/tasks/duckdb__duckdb.bdb65ec/tests.json Ignores a flaky SQL nondeterminism case (row order without ORDER BY).
src/programbench/data/tasks/dandavison__delta.acd758f/tests.json Ignores a flaky git-grep output-related test.
src/programbench/data/tasks/chirlu__sox.42b3557/tests.json Ignores deterministic build feature-set dependent CLI option/help expectation tests.
src/programbench/data/tasks/canop__broot.d6c798e/tests.json Ignores a flaky TUI panel size toggle test.
src/programbench/data/tasks/burntsushi__ripgrep.3b7fd44/tests.json Ignores nondeterministic output ordering flakes in CLI/vimgrep tests.
src/programbench/data/tasks/bensadeh__tailspin.6278437/tests.json Ignores a flaky follow-mode test.
src/programbench/data/tasks/ast-grep__ast-grep.dde0fe0/tests.json Ignores transient nondeterminism in HTML injection matching and diagnostic ordering.
src/programbench/data/tasks/antonmedv__walk.bf802ef/tests.json Ignores multiple flaky TUI/listing output snapshot cases and delete/undo races.
CLAUDE.md Documents ignored-test storage location and the meaning of ignore reason IDs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@klieret klieret merged commit 102c952 into main Jun 18, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants