Skip to content

Stop the brain from coercing NA/missing padj to 0 when filtering stats tables#360

Closed
dannon wants to merge 1 commit into
galaxyproject:mainfrom
dannon:fix/355-na-padj-coercion
Closed

Stop the brain from coercing NA/missing padj to 0 when filtering stats tables#360
dannon wants to merge 1 commit into
galaxyproject:mainfrom
dannon:fix/355-na-padj-coercion

Conversation

@dannon

@dannon dannon commented Jun 25, 2026

Copy link
Copy Markdown
Member

Closes #355

DESeq2 writes NA in the padj column for genes dropped by independent filtering / outlier detection. A bare awk filter like $7+0 < 0.05 coerces those NA rows to 0 ("NA"+0 == 0 in awk), and 0 < 0.05 is true, so every NA-padj gene gets silently counted as significant. In the run that prompted this, one contrast was reported as 8,738 significant genes where the NA-aware count was ~1,560 -- and that wrong number flowed into downstream shared-DEG and TF analyses before anyone caught it. Same class of silent scientific-correctness gap as #318 and #220.

This adds a guidance subsection to the brain's verification-discipline prompt block (buildVerificationDisciplineBlock in extensions/loom/context.ts) -- the shipped system prompt, not the human-facing docs. The rule: when thresholding a p-value/padj/FDR column, drop non-numeric/missing rows explicitly (prefer Python/R with real NA handling, or guard the column in awk) and sanity-check the surviving count against expectation.

A few deliberate choices:

Ran a skeptical adversarial review pass over the diff; it caught the awk-example-too-narrow and scope-too-broad points above, which are folded in.

Tests: added a case to tests/verification-context.test.ts asserting the load-bearing strings ship in the prompt. Full root suite green (1238 passing), root + app typecheck clean. Not live-eyeballed in a running session.

DESeq2 writes NA in the padj column for independent-filtered and outlier
genes, and a bare awk filter like `$7+0 < 0.05` coerces those NA rows to
0, so every one slips through as "significant" with no error to catch --
in galaxyproject#355 that turned a ~1,560-gene contrast into a reported 8,738 that
then propagated downstream. Added a subsection to the verification
discipline prompt block: when thresholding a p-value/padj/FDR column,
drop non-numeric/missing rows explicitly (prefer Python/R with real NA
handling, or guard the column in awk) and sanity-check the surviving
count. Scoped to significance filtering with an escape hatch for a
deliberate zero-imputation convention, and the awk example flags its own
limitation (it only excludes literal NA -- header, blanks, and `.` coerce
the same way) so it doesn't get cargo-culted. Mechanical test asserts the
load-bearing strings ship in the prompt.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Orbit: $7+0 < 0.05 awk filter counts NA/missing padj as significant, inflating DEG counts

1 participant