Stop the brain from coercing NA/missing padj to 0 when filtering stats tables#360
Closed
dannon wants to merge 1 commit into
Closed
Stop the brain from coercing NA/missing padj to 0 when filtering stats tables#360dannon wants to merge 1 commit into
dannon wants to merge 1 commit into
Conversation
DESeq2 writes NA in the padj column for independent-filtered and outlier genes, and a bare awk filter like `$7+0 < 0.05` coerces those NA rows to 0, so every one slips through as "significant" with no error to catch -- in galaxyproject#355 that turned a ~1,560-gene contrast into a reported 8,738 that then propagated downstream. Added a subsection to the verification discipline prompt block: when thresholding a p-value/padj/FDR column, drop non-numeric/missing rows explicitly (prefer Python/R with real NA handling, or guard the column in awk) and sanity-check the surviving count. Scoped to significance filtering with an escape hatch for a deliberate zero-imputation convention, and the awk example flags its own limitation (it only excludes literal NA -- header, blanks, and `.` coerce the same way) so it doesn't get cargo-culted. Mechanical test asserts the load-bearing strings ship in the prompt.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #355
DESeq2 writes
NAin thepadjcolumn for genes dropped by independent filtering / outlier detection. A bare awk filter like$7+0 < 0.05coerces thoseNArows to0("NA"+0 == 0in awk), and0 < 0.05is true, so every NA-padj gene gets silently counted as significant. In the run that prompted this, one contrast was reported as 8,738 significant genes where the NA-aware count was ~1,560 -- and that wrong number flowed into downstream shared-DEG and TF analyses before anyone caught it. Same class of silent scientific-correctness gap as #318 and #220.This adds a guidance subsection to the brain's verification-discipline prompt block (
buildVerificationDisciplineBlockinextensions/loom/context.ts) -- the shipped system prompt, not the human-facing docs. The rule: when thresholding a p-value/padj/FDR column, drop non-numeric/missing rows explicitly (prefer Python/R with real NA handling, or guard the column in awk) and sanity-check the surviving count against expectation.A few deliberate choices:
context.ts, each with a mechanical test. This follows that.$7 != "NA"only excludes the literalNA; the header row, blanks, and.coerce to0the same way, so the text says so and points at Python/R as the reliable default rather than handing over one incantation to cargo-cult.Ran a skeptical adversarial review pass over the diff; it caught the awk-example-too-narrow and scope-too-broad points above, which are folded in.
Tests: added a case to
tests/verification-context.test.tsasserting the load-bearing strings ship in the prompt. Full root suite green (1238 passing), root + app typecheck clean. Not live-eyeballed in a running session.