Fix(eval): Ignore flaky and otherwise unsuitable tests by klieret · Pull Request #40 · facebookresearch/ProgramBench

klieret · 2026-06-18T02:32:47Z

Port the June 2026 test-ignore updates from the internal reference harness into ProgramBench across 41 instances (+270 ignored tests total), keeping all metadata verbatim, and update CLAUDE.md accordingly.

Branch-hash keys and per-branch test name-sets are unchanged, so no test blob re-upload is needed; only ignore metadata differs. The testorg__calculator fixture is untouched.

Internal-reference-commit: 1ca7719a46b9fc521af2353e287b4f07f4c071ea
Internal-reference-commit: a8f153ad485daf95e2041b5e872b5b715b7f5410
Internal-reference-commit: b6466b9e33c2abd88ce8ca06037c876b1edcc864
Internal-reference-commit: d7f13854873fd23b60606729d649f31a4dc7d7ac
Internal-reference-commit: 7b3a00e057cc625c86bcb9988a439ae50adb0828
Internal-reference-commit: 83488589c5237b4b9af904e1809997df735106f0
Internal-reference-commit: 3bbf4632e7d5e978b817bc168fa0897d0bbc2161
Internal-reference-commit: 691dcaa75a89d8d6f22fc7253c641375df9c3bdf

Port the June 2026 test-ignore updates from the internal reference harness into ProgramBench across 41 instances (+270 ignored tests total), keeping all metadata verbatim, and update CLAUDE.md accordingly. Branch-hash keys and per-branch test name-sets are unchanged, so no test blob re-upload is needed; only ignore metadata differs. The testorg__calculator fixture is untouched. Internal-reference-commit: 1ca7719a46b9fc521af2353e287b4f07f4c071ea Internal-reference-commit: a8f153ad485daf95e2041b5e872b5b715b7f5410 Internal-reference-commit: b6466b9e33c2abd88ce8ca06037c876b1edcc864 Internal-reference-commit: d7f13854873fd23b60606729d649f31a4dc7d7ac Internal-reference-commit: 7b3a00e057cc625c86bcb9988a439ae50adb0828 Internal-reference-commit: 83488589c5237b4b9af904e1809997df735106f0 Internal-reference-commit: 3bbf4632e7d5e978b817bc168fa0897d0bbc2161 Internal-reference-commit: 691dcaa75a89d8d6f22fc7253c641375df9c3bdf

Copilot

Pull request overview

This PR updates ProgramBench’s task evaluation metadata to ignore newly identified flaky, non-discriminating, or otherwise unsuitable behavioral tests, and documents the ignore-reason taxonomy in CLAUDE.md.

Changes:

Added new branches.<hash>.ignored_tests[] entries across many task tests.json files to exclude flaky/deterministically-failing tests from scoring (with detailed reason metadata).
Extended ignore coverage for several TUI/network/timing-sensitive test suites and golden drift/build-stamp/toolchain-sensitive cases.
Documented ignore reason IDs and semantics in CLAUDE.md.

Reviewed changes

Copilot reviewed 42 out of 42 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
src/programbench/data/tasks/zk-org__zk.10d93d5/tests.json	Adds ignore metadata for a golden-output drift failure in list formatting.
src/programbench/data/tasks/ys-l__flamelens.0b4dc33/tests.json	Adds ignore metadata for a flaky TUI render/timing test.
src/programbench/data/tasks/yassinebridi__serpl.c48a9d7/tests.json	Adds many ignored tests due to flaky TUI/tmux/db-lock/test-isolation behavior.
src/programbench/data/tasks/y2z__monolith.8702e66/tests.json	Ignores a deterministic gold failure due to external resource non-reproducibility.
src/programbench/data/tasks/unhappychoice__gittype.34b72d0/tests.json	Adds multiple ignored tests flagged as flaky under gold runs.
src/programbench/data/tasks/tstack__lnav.ee34494/tests.json	Ignores a deterministic build-stamp-dependent usage/version mismatch.
src/programbench/data/tasks/svenstaro__genact.16f96e3/tests.json	Ignores a randomized-output flaky test module case.
src/programbench/data/tasks/stranger6667__jsonschema.d52e881/tests.json	Ignores a deterministic gold failure tied to toolchain drift/external TLS endpoint.
src/programbench/data/tasks/sqlite__sqlite.839433d/tests.json	Ignores several flaky tests due to shared DB state / isolation races.
src/programbench/data/tasks/sheepla__pingu.926d475/tests.json	Ignores deterministic toolchain/build-stamp-sensitive failures in CLI/version outputs.
src/programbench/data/tasks/rust-embedded__svd2rust.1760b5e/tests.json	Adds ignored tests for build-stamp/toolchain-sensitive output expectations.
src/programbench/data/tasks/rs__curlie.5dfcbb1/tests.json	Ignores network-dependent/flaky curl connection tests and snapshot flakes.
src/programbench/data/tasks/rhysd__kiro-editor.4157485/tests.json	Adds ignored tests for flaky TUI render timing and UTF-8 input state capture.
src/programbench/data/tasks/rcoh__angle-grinder.9c2fc88/tests.json	Ignores flaky progressive/ANSI TTY rendering snapshot cases.
src/programbench/data/tasks/raviqqe__muffet.a882908/tests.json	Ignores a flaky local HTTP test-server dial race.
src/programbench/data/tasks/pls-rs__pls.4e1ae50/tests.json	Ignores a flaky filtering cutoff test.
src/programbench/data/tasks/peco__peco.4e58dad/tests.json	Ignores multiple flaky tcell/TUI screen-init and tty-dependent cases.
src/programbench/data/tasks/orf__gping.26eb5b9/tests.json	Ignores flaky TUI ping-stats rendering timing races.
src/programbench/data/tasks/noborus__ov.b96c2ba/tests.json	Adds several ignored tests due to flaky TUI/snapshot behavior.
src/programbench/data/tasks/nikolassv__bartib.6b9b5ce/tests.json	Ignores a deterministic toolchain formatting drift in table output.
src/programbench/data/tasks/mkj__dropbear.75f699b/tests.json	Ignores numerous SSH socket/port/host-key race flakes in integration tests.
src/programbench/data/tasks/kyoheiu__felix.95df390/tests.json	Ignores flaky narrow-terminal layout calculation behavior.
src/programbench/data/tasks/kisielk__errcheck.dacab89/tests.json	Ignores a deterministic toolchain-dependent exit-code/output expectation.
src/programbench/data/tasks/junegunn__fzf.b56d614/tests.json	Ignores build-stamp-dependent version exactness tests.
src/programbench/data/tasks/jrnxf__thokr.09375ef/tests.json	Ignores timing-dependent elapsed-time and TUI timer rendering flakes.
src/programbench/data/tasks/jgm__pandoc.5caad90/tests.json	Ignores deterministic pandoc-types API/toolchain drift failures in JSON AST output.
src/programbench/data/tasks/isona__dirble.e2dea9f/tests.json	Ignores network-dependent/flaky local HTTP test-server connection races.
src/programbench/data/tasks/htop-dev__htop.523600b/tests.json	Ignores a flaky ultra-intensive TUI/tree test.
src/programbench/data/tasks/hatoo__oha.8dc6349/tests.json	Ignores multiple flaky output/TUI timing cases and a deterministic toolchain drift.
src/programbench/data/tasks/gromacs__gromacs.665ea4c/tests.json	Ignores deterministic build-stamp-dependent structure-related tests.
src/programbench/data/tasks/ffmpeg__ffmpeg.360a402/tests.json	Ignores deterministic feature-set/build-configuration dependent help/list outputs and a setup race.
src/programbench/data/tasks/elkowar__pipr.fae0b17/tests.json	Ignores a flaky TUI command list window interaction test.
src/programbench/data/tasks/ekzhang__bore.8e059cd/tests.json	Ignores a flaky harvest/proxy test.
src/programbench/data/tasks/duckdb__duckdb.bdb65ec/tests.json	Ignores a flaky SQL nondeterminism case (row order without ORDER BY).
src/programbench/data/tasks/dandavison__delta.acd758f/tests.json	Ignores a flaky git-grep output-related test.
src/programbench/data/tasks/chirlu__sox.42b3557/tests.json	Ignores deterministic build feature-set dependent CLI option/help expectation tests.
src/programbench/data/tasks/canop__broot.d6c798e/tests.json	Ignores a flaky TUI panel size toggle test.
src/programbench/data/tasks/burntsushi__ripgrep.3b7fd44/tests.json	Ignores nondeterministic output ordering flakes in CLI/vimgrep tests.
src/programbench/data/tasks/bensadeh__tailspin.6278437/tests.json	Ignores a flaky follow-mode test.
src/programbench/data/tasks/ast-grep__ast-grep.dde0fe0/tests.json	Ignores transient nondeterminism in HTML injection matching and diagnostic ordering.
src/programbench/data/tasks/antonmedv__walk.bf802ef/tests.json	Ignores multiple flaky TUI/listing output snapshot cases and delete/undo races.
CLAUDE.md	Documents ignored-test storage location and the meaning of ignore reason IDs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 18, 2026

klieret changed the title ~~Data: sync test-ignore updates from internal reference~~ Fix(eval): Ignore flaky and otherwise unsuitable tests Jun 18, 2026

klieret requested a review from Copilot June 18, 2026 02:37

Copilot started reviewing on behalf of klieret June 18, 2026 02:38 View session

Copilot AI reviewed Jun 18, 2026

View reviewed changes

klieret merged commit 102c952 into main Jun 18, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix(eval): Ignore flaky and otherwise unsuitable tests#40

Fix(eval): Ignore flaky and otherwise unsuitable tests#40
klieret merged 1 commit into
mainfrom
sync-june-fixes

klieret commented Jun 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

klieret commented Jun 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants