Add RCT historical validation reproducibility package by Airwhale · Pull Request #37 · Ely-S/PatientPunk

Airwhale · 2026-04-30T18:46:37Z

Summary

Adds a self-contained reproducibility package for the paper's core claim: that pre-publication community sentiment on r/covidlonghaulers predicted clinical trial outcomes for 6 Long COVID drugs.

The package lives at docs/RCT_historical_validation/ and reproduces Figure 1, Table 2, and Table 3 from "A Methodology for Gathering Real-World Evidence at Scale: Using NLP-Extracted Community Treatment Reports to Predict Clinical Trial Outcomes."

What's inside

File	Purpose
`output/paper_figures.html`	Open in any browser to see all figures — no setup required
`output/figure1.png`	Figure 1 standalone (forest plot: responders + non-responders, 95% Wilson CIs, 6 drugs)
`output/figure0.png`	Figure 0 standalone (full sentiment breakdown)
`output/paper_figures_executed.ipynb`	Executed Jupyter notebook with all outputs
`output/provenance.json`	Build manifest (git commit, DB SHA-256, model env, full extraction-runs table)
`_build_paper_figures.py`	Main script — reads the analysis SQLite DB, runs the analysis, generates notebook + HTML + PNG
`verify.py`	One-command reproducibility gate (PASS/FAIL summary, exits 0/1)
`paths.py`	Canonical DB-path resolver (env-var override + marker-based anchor walk-up)
`build_notebook.py`	Helper for notebook creation/execution/export
`DRUG_ALIASES.md` / `THREAD_AUDIT.md` / `DEDUP_AUDIT.md` / `ARCTIC_SHIFT_PROVENANCE.md`	Reviewer-facing audit artifacts
`requirements.txt`	10 Python packages
`README.md`	Full methodology walkthrough, download links, expected output for spot-checking

What the analysis shows

For each of 6 drugs, we count what proportion of r/covidlonghaulers users reported a positive experience — using only posts from before the relevant trial published. Per-(user, drug) deduplication: most recent post wins (full UTC timestamp); signal strength acts as a near-vestigial tiebreaker on exact-timestamp ties. Result:

Drug	n	% Positive	p-value	Trial Result
loratadine	90	81.1%	< 1e-9	✅ positive
famotidine	232	77.2%	< 1e-16	✅ positive
naltrexone	154	65.6%	1.36e-4	✅ positive
colchicine	91	53.8%	0.530	⬜ null
paxlovid	196	54.1%	0.284	⬜ null
prednisone	343	48.7%	0.666	⬜ null

All 6 match: drugs the community clearly favored → trials found effective. Drugs the community was split on → trials found no effect.

How to use

Just viewing: Clone the repo → open docs/RCT_historical_validation/output/paper_figures.html in a browser.

Full re-run: Download the single ~314 MB analysis DB from S3 into data/, then run the build:

cd docs/RCT_historical_validation/
pip install -r requirements.txt

mkdir -p data
curl -fL -o data/historical_validation_2020-07_to_2022-12.db \
    https://patientpunk.s3.amazonaws.com/scientific_validation/rct_historical/processed/historical_validation_2020-07_to_2022-12.db

python verify.py            # one-shot PASS/FAIL gate (~10s)
python _build_paper_figures.py

verify.py runs every build-time assertion (DB integrity, SHA-256 match, per-drug window, thread reconstruction, expected outputs) and exits 0/1. Run it immediately after the download to catch HTTP errors at the download step rather than mid-build.

The build script and notebook find the DB through the canonical resolver in paths.py — they work from any cwd (workspace root, package directory, anywhere). Override with RCT_DB_PATH=/path/to/db if needed.

Data access

A single self-sufficient SQLite DB (historical_validation_2020-07_to_2022-12.db, 326 MB, SHA-256 c50fcacd…) is publicly hosted on S3, along with the raw Reddit JSON and per-drug merged + deduped CSVs. All download links in the README. Bucket policy grants anonymous s3:GetObject on the scientific_validation/rct_historical/* prefix.

Reviewer findings addressed in this PR

Eli's blocking review (resolved):

✅ Wrong script name + false "identical file" claim in DEDUP_AUDIT.md — script's output template fixed; "identical file" softened to acknowledge the regenerated Generated timestamp; audit regenerated.
✅ "Same date" wording vs full-timestamp implementation — wording reconciled across all four files (script docstring, README, generated audit body, build script docstring) to "full UTC timestamp" / "exact-timestamp ties." Implementation unchanged; clarified that the signal-strength tiebreaker is near-vestigial in practice.
✅ Stale Windows-backslash DB path in DEDUP_AUDIT.md — regenerated against the package-local posix path.

Sonnet's deep-pass findings (resolved):
4. ✅ provenance.json dirty=true — README "Notes on output/provenance.json" subsection documents the chicken-and-egg constraint and what reviewers should do.
5. ✅ DB built at extraction commit 16c2569b… ≠ HEAD — README subsection enumerates the diff: SQL predicate reformalization, deleted-user exclusion, colchicine date correction, audit/verify scripts, parent_id prefix fix; explicitly notes no alias-resolution or sentiment-classification logic was changed.
6. ✅ scripts/dump_per_drug_csvs.py default DB path mismatch — fixed by routing through paths.py.
7. ✅ README "every comparison reaches p ≤ 0.0001" — corrected to "p < 0.001" (naltrexone is 1.358e-4).
8. ✅ Windows backslashes + undocumented master_gap in provenance.json — provenance writer now normalizes path separators; README subsection explains master_gap was the original Windows extraction working directory.

Codex's reproduction-pass findings (resolved):
9. ✅ Provenance could record an unrelated repo's commit — _git_metadata() now anchors git to the package root and verifies the script file is tracked there. Records outside-git (visibly broken) instead of an unrelated SHA (silently misleading) when the package is extracted outside its own checkout.
10. ✅ curl -L exits 0 on HTTP errors — README now uses curl -fL with a comment, and explicitly tells reviewers to run python verify.py immediately after the download.
11. ✅ Alias sensitivity claim was unsupported — dump_drug_aliases.py text softened. Now states explicitly that no per-alias quantitative sensitivity analysis has been run; producing one is documented as the open methodology task before publication.

Bonus parallel-bug catches in adjacent audit scripts (dump_drug_aliases.py, thread_audit.py): same false "produces an identical file" claim and Windows-backslash DB-path renderings → both fixed; both audits regenerated.

Test plan

Open output/paper_figures.html in a browser — Figure 1 forest plot, Table 2, Table 3 render correctly
Spot-check Table 3 values against the expected output table in README (famotidine 232, loratadine 90, naltrexone 154, paxlovid 196, prednisone 343, colchicine 91)
Download the DB via the README curl command — should succeed (200 OK) and the file should be ~326 MB
python verify.py from any cwd — should report 5/5 PASSED and the resolution-mode line should reflect how the DB was located
(Optional) Full re-run: python _build_paper_figures.py → numbers identical to pre-built
(Optional) Test outside-git handling: copy the package to a scratch directory, run the build → provenance.json git.commit should record outside-git rather than an unrelated repo's SHA

🤖 Generated with Claude Code

Self-contained folder for reproducing Figure 1, Table 2, and Table 3 from the paper. Includes build script, notebook helper, README with plain-language methodology, and requirements. Databases and outputs are gitignored — share via OneDrive. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Force-add HTML, notebooks, and figure so reviewers can view results without the databases. README now explains how to view pre-built outputs vs re-run from scratch, and directs users to contact authors for the underlying data. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

qodo-code-review · 2026-04-30T18:46:42Z

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

gemini-code-assist

Code Review

This pull request introduces a reproducibility package for RCT historical validation, including documentation, build scripts, and pre-built Jupyter notebooks. It also enhances the sentiment pipeline to support hand-curated drug alias lists via a new --drug-file argument. My feedback focuses on a portability issue where the notebook build process hardcodes absolute local file paths into the generated .ipynb files, which prevents them from being executed on other machines without manual modification.

Addresses Gemini review: pre-built notebooks contained C:/Users/scgee/... paths. Now uses ../data/ relative paths so notebooks are portable across machines. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Airwhale · 2026-04-30T18:53:07Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a reproducibility package for RCT historical validation, featuring comprehensive documentation, analysis scripts, and pre-built Jupyter notebooks. Additionally, it enhances the sentiment pipeline by allowing drug aliases to be provided via a text file and centralizing alias resolution logic. Feedback identifies an inconsistency in alias handling during classification, a non-standard Byte Order Mark in a new script, and recommends explicitly specifying UTF-8 encoding when reading the drug aliases file.

- classify.py: use resolve_aliases() for consistent alias handling with extract/canonicalize (fixes missed aliases when skipping canon) - build_notebook.py: strip UTF-8 BOM - run_sentiment_pipeline.py: explicit encoding="utf-8" on file read Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Airwhale · 2026-04-30T18:58:48Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a reproducibility package for RCT historical validation, including scripts to generate figures and tables from SQLite databases. It also refactors the drug alias resolution logic in the sentiment pipeline to support both LLM-based lookups and manual alias files. I have identified critical syntax errors in the build_notebook.py file where double curly braces were incorrectly used for dictionary and set definitions, which will cause runtime TypeErrors.

gemini-code-assist · 2026-04-30T19:03:09Z

+conn = sqlite3.connect(DB_PATH)
+
+# ── Sentiment mapping ──
+SENTIMENT_SCORE = {{"positive": 1.0, "mixed": 0.5, "neutral": 0.0, "negative": -1.0}}


The use of double curly braces {{...}} is incorrect for defining a dictionary. This syntax attempts to create a set containing a dictionary, which will raise a TypeError because dictionaries are not hashable. To define a dictionary, please use single curly braces {...}.

Suggested change

SENTIMENT_SCORE = {{"positive": 1.0, "mixed": 0.5, "neutral": 0.0, "negative": -1.0}}

SENTIMENT_SCORE = {"positive": 1.0, "mixed": 0.5, "neutral": 0.0, "negative": -1.0}

I disagree. Double braces are how you produce literal { and } in format strings. Doing this causes an error.

Ely-S · 2026-05-01T06:07:58Z

Concern about the dedupelication logic:

The Deduplication Code

  SENT_RANK = {"positive": 3, "mixed": 2, "neutral": 1, "negative": 0}                                                                                                                                                                            
  SIG_RANK  = {"strong": 2, "moderate": 1, "weak": 0, ...}                                                                                                                                                                                          
                                                                                                                                                                                                                                                    
  for uid, drug, sent, sig, _d in rows:
      k = (SENT_RANK.get(sent, -1), SIG_RANK.get(sig, 0))                                                                                                                                                                                           
      key = (uid, drug)                                                                                                                                                                                                                             
      if key not in best or k > best[key][0]:
          best[key] = (k, sent, drug)

Each (user, drug) pair keeps exactly one record — the one with the highest rank tuple (sentiment_rank, signal_strength_rank).

The Problem

The ranking is a peak-performance rule, not a summary. A user's final classification is determined by their single best report about a drug, ever.

Consider a user who posts the folloing naltrexone:

  │ "really helped me!"   │ positive  │ strong   │
  │ "not working anymore" │ negative  │ strong   │                                                                                                                                                                                                  
  │ "gave up on it"       │ negative  │ moderate │

This would be counted as a positive outcome.

It's asymmetric in only one direction: a single positive report can override any number of negatives, but no number of negatives can override a single positive.

Ely-S · 2026-05-01T22:17:29Z

To make it easier to verify, can you modify the SQL that collects the posts for analysis to include

WHERE post_date < study_date

Re-runs Figure 1 / Table 2 / Table 3 with three substantive changes: 1. Data sources: adds master_gap_2020-07_to_2022-12.db, which classifies all six target drugs across the full corpus from r/covidlonghaulers inception (2020-07-24) through end of 2022. Built by running the pipeline once per drug with --drug, leveraging the substring-prefilter for cost efficiency. Existing per-drug DBs are still used; cross-DB rows are deduped on post_id with master_gap taking priority. 2. Cutoff dates: corrected to "day before earliest public release" using verified preprint and journal online dates. Glynne now uses 2021-06-06 (medRxiv preprint) instead of 2021-10-01 (journal). Prednisone comparator updated to Utrero-Rico 2021. Naltrexone, paxlovid, colchicine cutoffs verified against publication metadata. 3. Per-(user, drug) dedup rule: replaces the previous "best-report" peak-rank rule (positive > mixed > neutral > negative; signal as tiebreaker) with a "most recent + signal-strength tiebreaker" rule. The new rule is symmetric on sentiment direction and addresses the asymmetry concern raised in code review (single positive could override any number of later non-positives). Headline numbers are tighter and more defensible: - Trial-positive drugs (loratadine, famotidine, naltrexone) all clear 50% with p < 0.0001. - Trial-null drugs (colchicine, paxlovid, prednisone) all hover near 50% with no per-drug significance against the null. - Prednisone, the most-debated case in the original analysis, settles at 49.3% with the larger sample, exactly at chance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Table 3 previously combined the comparator paper, journal/preprint source, and date into one parenthesized string per row. Splitting them into two columns makes the per-drug comparator easier to scan and the publication date explicit. DRUG_CUTOFFS now stores three fields per drug: (cutoff_date, paper_short, source_date) Figure 1 unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously the build script pulled classifications from five databases (master_gap plus four older per-drug runs) and deduped on post_id at analysis time. About 3% of relevant classifications lived only in the older DBs and were missing from master_gap, so cross-DB merging was required to reproduce the headline numbers. To simplify reproduction, scripts/build_combined_db.py was added to copy the missing rows into the master_gap DB and write out a new self-sufficient SQLite file: historical_validation_2020-07_to_2022-12.db (314 MB). It has been uploaded to S3 at: s3://patientpunk/scientific_validation/rct_historical/processed/ historical_validation_2020-07_to_2022-12.db (replacing the older master_gap_2020-07_to_2022-12.db). Reviewers reproducing the paper now need to download a single 314 MB DB instead of five DBs totalling ~580 MB. The build script reads from this single DB; numbers are bit-for-bit identical to the prior multi-DB version. The original per-drug DBs are still published on S3 as optional source data so their contents can be independently verified against the combined DB if desired. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A few spots in the README still talked about cross-DB merging or pointed at the old master_gap_*.json filename. Cleaned those up: - Reproduce section now includes the curl step to download the single required DB. - Per-drug CSV section no longer references the deleted scripts/merge_and_analyze_historical.py. - Databases table reframes the older per-drug DBs as "Optional, transparency only" with a clear "Required" marker on the combined DB. - Why-multiple-DBs paragraph rewritten to reflect the actual provenance (master pipeline run + small backfill from earlier per-drug runs). - Renamed master_gap_2020-07_to_2022-12.json to historical_validation_2020-07_to_2022-12.json on S3 for naming consistency with the DB; README link updated. - Step 1 now reads "Pull all reports per drug" from a single DB rather than "Merge all databases per drug". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Incorporates the two charts requested in PR #38 (Eli's working-files PR, explicitly marked Do-Not-Merge). The PR description asked for two specific items to be merged into our notebook; everything else in that PR is V&V audit work and unrelated artifacts. Figure 0 (NEW) — Full sentiment breakdown per drug: Stacked horizontal bar showing the four-way sentiment split (positive / mixed / neutral / negative) per drug, with each segment labeled by percentage and raw count. Y-axis labels include drug name + trial-direction tag. Plotted from our resp_df, so it inherits the combined-DB single-source workflow and the most-recent-plus-strength dedup rule. Figure 1 — cosmetic upgrades only: - figsize 12x5.5 -> 13.5x6.8 (more breathing room, prints larger) - value-label fontsize 9 -> 11 (fontweight bold preserved) - y-tick labels 10 -> 12; xlabel 11 -> 12; title 12 -> 14 - trial-direction tags switched from monospace to sans-serif at fontsize 11 - title prefixed with "Figure 1 — " for symmetry with Figure 0 Numbers in Figure 1 are unchanged. The data source (combined DB) and dedup rule are unchanged. Only styling and the new Figure 0 are added. Other items from PR #38 deliberately not merged: - Eli's _gen_analysis.py / analysis.ipynb use the old multi-DB merge and old peak-rank dedup; adopting them would undo the methodology fixes landed in 2208e2d, ad7a74c, 6424e03, d1d7e1a. - The V&V audit in vv/*.md is substantive review material but not code to merge; several of its critiques are already addressed by recent work (V2 sampling-window cutoffs, V8 cross-DB dedup, V9 hard-coded Table 2). Open items (V1 build manifest, V3 cycle guard, V7 dedup sensitivity, V10 expected-output assertions) are tracked separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Airwhale · 2026-05-03T18:36:48Z

Re: dedup : as we discussed, changed it so that it picks most recent highest signal sentiment. So if there is a strong postive and then a strong negitive, will show up as a negative. If there is a strong positive and then a moderate negative, will show up as positive.

I re-ran (and repopulated S3) with a larger dataset, from the beginning of the dataset to a day before publication of the relevant paper, or to two years of the dataset, whichever is less. Effect sizes and significance both increase. I need to confirm reproducability on this dataset and make sure documentation matches.

I made the requested merge from #38, but there are some methodological critiques I need to go through: it's like third on the list of things I need to do.

Will look into the SQL change requested above, seems pretty straightforward.

Closing and will reopen once those check marks are ticked off.

…master_gap The build_combined_db.py script (introduced in 6424e03) backfilled rows from the older per-drug DBs into the master_gap DB. That backfill was buggy: it copied treatment_reports.user_id from the source DBs while leaving the posts table's user_id as master_gap's. Because the older DBs and master_gap use different username-hashing conventions, every one of the 262 backfilled rows ended up with a treatment_reports.user_id that did not match the posts.user_id for the same post. That breaks dedup, which keys on user_id. Verified: in the contaminated DB, all 262 backfilled rows had mismatched user_ids (3.30% of 7,949 total). Fix: replace the contaminated DB with a clean copy of master_gap (the direct output of the master pipeline run, with no backfill). This loses the small amount of classification data that lived only in older DBs (sample sizes shrink ~6-12% for prednisone, naltrexone, colchicine), but the headline finding is unchanged: every trial-positive drug is significantly above the 50% null at p ≤ 0.0001, and no trial-null drug reaches significance. Updated values in Table 3 / Figure 1: loratadine n=91 81.3% p<0.0001 trial: positive famotidine n=233 77.3% p<0.0001 trial: positive naltrexone n=155 65.8% p=0.0001 trial: positive colchicine n=92 54.3% p=0.47 trial: null paxlovid n=196 54.1% p=0.28 trial: null prednisone n=344 48.8% p=0.71 trial: null Changes in this commit: - Local + S3 historical_validation_2020-07_to_2022-12.db replaced with the clean master_gap snapshot (313 MB; 7,687 reports; 0 user_id mismatches). - merged/*.csv on S3 regenerated from clean DB. - README updated: the Provenance section now states the DB is the direct output of one master pipeline run with no backfill, and explains why cross-DB merging is unsafe given different username-hashing conventions. Expected-Output table updated with the new (clean) values. - Outputs (figure0.png, figure1.png, paper_figures.{html,ipynb}) rebuilt from clean DB. - scripts/build_combined_db.py and scripts/merge_and_analyze_historical.py removed; superseded by scripts/dump_per_drug_csvs.py which reads only the single combined DB. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…older Three follow-ups from a reviewer who downloaded the (then-contaminated) DB before the c5c4957 fix landed: 1. Publish SHA-256 of the required database in the README and provide verification commands for macOS/Linux/Windows. Hash: FA67BE6D10FFE0B98365390A5223F6DFBC708360E830C725B74B443337AB4952 2. Add an integrity assertion at the top of the build script that fails loudly if any treatment_reports.user_id != posts.user_id. The build now prints "DB integrity check: 0 mismatched user_ids (must be 0). PASS." on every successful run. If the bug ever re-enters, the build aborts before producing figures, so a contaminated DB cannot silently produce numbers. 3. Stop creating a 0-byte data/corpus_baseline_onemonth.db when the reviewer doesn't have that file. The build was passing "../data/corpus_baseline_onemonth.db" to build_notebook.py purely as the cosmetic placeholder DB path that the helper's setup cell connects to. sqlite3.connect() silently creates an empty file when the path doesn't exist, which left a confusing 0-byte artifact behind. Switched the path to "../data/historical_validation_2020-07_to_2022-12.db" so the helper opens a real DB. No analysis cell uses the helper's conn variable, so the change is purely cosmetic but eliminates the misleading file. The contaminated-DB issue itself was already fixed in c5c4957; the S3 file is clean and matches the SHA-256 above. The reviewer's hash (4BBF06E7...) was for the old contaminated upload, which is no longer on S3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two reviewer follow-ups, both intentional methodology fixes: 1. Deleted-user policy (was: implicit collapse; now: explicit exclusion) The DB has 52,603 posts where the Reddit author field was [deleted] or [removed] at scrape time, all mapped to the placeholder user_id "deleted". 122 of those posts have classified treatment reports. Per-user dedup would collapse them all into one pseudo-user (one vote per drug for the entire deleted population) — methodologically wrong because they come from many distinct real users we cannot identify. New policy: exclude every post with user_id = "deleted" from analysis. Verified empirically that the shift is at most 0.5pp and 1 unit of n per drug, so no headline conclusion changes: loratadine n=91 -> 90 (81.3% -> 81.1%) famotidine n=233 -> 232 (77.3% -> 77.2%) naltrexone n=155 -> 154 (65.8% -> 65.6%) colchicine n=92 -> 91 (54.3% -> 53.8%) paxlovid n=196 -> 196 (54.1% -> 54.1%) prednisone n=344 -> 343 (48.8% -> 48.7%) Implementation: AND p.user_id != 'deleted' added to the SQL query in both _build_paper_figures.py and scripts/dump_per_drug_csvs.py, so the filter propagates to figures, tables, and the published per-drug CSVs uniformly. README adds a "Step 1a: Deleted-user exclusion" section documenting the policy and the empirical shift. 2. Colchicine first-public date correction Reviewer flagged that JAMA's article page lists Bassi et al. 2025 as "Published Online: October 20, 2025"; the December 2025 date in the README was the print issue date, not the online date. Fixed: cutoff 2025-11-30 -> 2025-10-19 source JAMA Intern Med 2025-12-01 -> JAMA Intern Med 2025-10-20 Does not affect numbers because colchicine is bound by the end-2022 analysis cap, but "first public date" is now accurate. Re-uploaded merged/*.csv to S3 with the new (post-exclusion) values. The combined database itself is unchanged — the exclusion happens at query time, not in the data — so its SHA-256 is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two cosmetic-but-real consistency fixes flagged in an internal-consistency audit: 1. The naltrexone comparator label used a Unicode prime character (U+2032, "O′Kelly") in the build script's DRUG_CUTOFFS dict. The README used a regular apostrophe ("O'Kelly"). Fixed by quoting the paper field in double quotes inside the r-string raw cell, which allows a regular apostrophe inside the value: 'O′Kelly et al. 2022' -> "O'Kelly et al. 2022" Now both files use the same character; rebuilt notebook output reflects the corrected label. 2. README never mentioned scripts/dump_per_drug_csvs.py, the script that regenerates the published per-drug CSVs from the combined database. Added a one-sentence pointer to it in the "Per-drug merged + deduped CSVs" subsection so reviewers know how to reproduce that step locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…nces Two related cleanups, no analytical change. (1) Switch SQL from `post_date <= eod-of-day-before-publication` to `post_date < midnight-utc-of-publication-date`. Same rows in/out (verified bit-for-bit by re-running dump_per_drug_csvs.py — same n, same %pos, same CIs, same p-values), but the predicate now matches the natural-language "strictly before publication" much more transparently. Renames `cutoff` -> `pub_date` and `effective_cutoff` -> `window_end_exclusive` throughout. Adds `epoch_midnight()` (replacing `epoch_eod()`) and changes `END_2022 = '2022-12-31'` to `END_2022_EXCLUSIVE = '2023-01-01'`. summary.csv column rename re-uploaded to S3. (2) Drop README references to the 5 optional smaller .db files. Verified nothing in the build flow reads them — both _build_paper_figures.py and scripts/dump_per_drug_csvs.py use only the combined DB. The smaller DBs remain on S3 (cheap to keep) but are no longer documented as part of the reproducibility package, removing ~40 lines of stale "optional / for transparency" tables and trimming the Provenance section. Also updates build_notebook.py docstring example to use the canonical DB name. Verified: dump_per_drug_csvs.py and _build_paper_figures.py both produce identical headline numbers (famotidine 232/77.2%, loratadine 90/81.1%, naltrexone 154/65.6%, paxlovid 196/54.1%, colchicine 91/53.8%, prednisone 343/48.7%). DB integrity check still passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two fixes addressing review feedback: 1. README post-count reconciliation. Reviewer noted the README's "47,434 top-level posts + 684,092 comments" line was inconsistent with verify.py / THREAD_AUDIT.md / the DB itself reporting 47,442 submissions + 684,084 comments. Both numbers are correct but describe different stages: 47,434/684,092 is the JSON-level reality (actual posts vs comments), 47,442/684,084 is the DB-level reality after the import script's dangling-parent cleanup nulls parent_ids for 8 comments whose parents fell outside the scrape window. Same total. README now explains the 8-row delta inline so the apparent inconsistency is documented as a known scrape-window edge effect. 2. Drop internal audit codes from public-facing files. The "V1/V2/V3/ V7/V10" prefixes in section headers, check names, and prose came from an internal audit framework that shouldn't be referenced in the public reproducibility package. Renamed: - V3_THREAD_AUDIT.md -> THREAD_AUDIT.md - V7_DEDUP_AUDIT.md -> DEDUP_AUDIT.md - scripts/v3_thread_audit.py -> scripts/thread_audit.py - scripts/v7_dedup_sample_audit.py -> scripts/dedup_sample_audit.py Plus stripped V-codes from verify.py check names ("V1: DB integrity" -> "DB integrity"), build-script cell headings ("## Window verification (V2)" -> "## Window verification"), build-script stdout ("V2 window verification: ..." -> "Window verification: ..."), DRUG_ALIASES.md prose, and a few other call sites. Section structure unchanged; only the audit-framework labels are gone. Verified: build runs cleanly, all assertions still PASS, headline numbers unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Both small clarifications, addressing reviewer comments that pointed at genuine ambiguity in the docs (not at any methodological issue): 1. README "Raw Reddit JSON" section — clarify scope of import_posts.py. The previous wording ("rebuild the SQLite DB from this JSON file using src/import_posts.py") implied that one script rebuilds the full analysis DB. It does not: import_posts.py rebuilds only the `users` and `posts` tables. The `treatment_reports` table (LLM classifications) is the frozen output of running src/run_sentiment_pipeline.py per target drug — Anthropic API credentials and ~2 hr wall time required. README now states this explicitly. 2. DRUG_ALIASES.md "Epistemic status" section — make the LLM-aliases methodology choice explicit. Previously the file noted that aliases are LLM-generated and not manually adjudicated, but didn't address the obvious reviewer question: is this a workaround, an unintentional limitation, or a deliberate choice? Added a short paragraph stating it's deliberate — evaluating the LLM canonicalization step is part of the method being demonstrated, so we treat alias generation as a pipeline output rather than a hand- authored input, and publish the generated aliases for auditability. Updated in scripts/dump_drug_aliases.py and regenerated DRUG_ALIASES.md. No code logic changed; no numbers shifted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Airwhale · 2026-05-04T19:20:51Z

/gemini review

gemini-code-assist · 2026-05-04T19:36:30Z

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

Airwhale · 2026-05-04T19:36:42Z

Update: substantial methodology and transparency pass

Pushed a substantial round of changes in response to PR review feedback (Eli, Qodo, Gemini, plus a more recent doc review). Headline conclusion is unchanged — every trial-positive drug still cleanly above 50% with strong CIs, every null-trial drug still hugging 50% — but the path to those numbers is now documented, asserted, and reviewable end-to-end.

TL;DR

Larger corpus. Pulled the full r/covidlonghaulers history from Arctic Shift (2020-07-24 → 2022-12-31, ~2.5 years) and run every drug query against that single corpus instead of the per-drug short-window scrapes used earlier.
Three additional substantive methodology changes (single-DB consolidation, "most recent + signal-strength" dedup, strict-< pre-publication SQL).
One real bug found and fixed during review (parent_id loss on import → analysis DB rebuilt with prefix fix; published numbers unaffected).
Build-time assertions wired into the notebook execution: DB integrity, per-drug window, expected-output match, dedup audit. Build fails on drift.
Standalone reviewer artifacts: DRUG_ALIASES.md, THREAD_AUDIT.md, DEDUP_AUDIT.md, ARCTIC_SHIFT_PROVENANCE.md, output/provenance.json.
One-command reviewer check: python verify.py — runs every assertion, prints PASS/FAIL summary, exits 0/1.
Headline numbers stable across every re-run.

Methodology changes

Change	Why	Effect on numbers
Larger corpus pull from Arctic Shift. Earlier per-drug runs used short scrape windows (~50 days around each publication date, run separately per drug). Now: one corpus-wide pull (2020-07-24 → 2022-12-31, ~2.5 years) that every drug query reads from.	Short windows around each pub date were enough to draw the original samples but threw away most of the available pre-publication discussion. The full-corpus pull lets every drug see whatever discussion exists in the 30-month corpus.	Pre-publication sample sizes substantially larger; naltrexone now has ~21 months of pre-publication data (2020-10 → 2022-07, n=154); colchicine has ~28 months (2020-08 → 2022-12, n=91, capped at end-of-2022). See per-drug timeframe table below.
Single-DB consolidation (`historical_validation_2020-07_to_2022-12.db`, 311 MB, SHA `c50fcacd…`). Was 5 separate DBs being merged at analysis time.	Cleaner reproducibility story: one file, one hash, no cross-DB hashing-convention concerns.	Numbers stable; reviewers download 1 file instead of 5.
Dedup rule changed from peak-rank `(SENT_RANK, SIG_RANK)` to "most recent post wins; signal-strength breaks ties on same date."	Direct response to Eli's asymmetry concern that a single positive could override any number of later non-positives under the old rule. New rule is symmetric on sentiment direction.	Headline numbers shifted (e.g., famotidine 207 → 232); shifts are documented and consistent across `_build_paper_figures.py` and `dump_per_drug_csvs.py`.
SQL predicate changed from `post_date <= eod-of-day-before-publication` to `post_date < midnight-utc-of-publication-date`.	Eli's request: cleaner natural-language mapping. The strict-`<` predicate maps directly to "strictly before publication."	Bit-for-bit identical numbers (same rows in/out); semantically clearer.
Deleted-user exclusion added (`AND p.user_id != 'deleted'`).	Reddit's `[deleted]`/`[removed]` author placeholder mapped to one pseudo-user, giving the entire deleted population one vote per drug. Documented policy now: exclude.	Each drug's headline shifted ≤0.5pp and ≤1 unit of n.
Colchicine date corrected from 2025-12-01 (PubMed print issue) to 2025-10-20 (JAMA online-first).	Caught during reviewer audit.	End-of-2022 cap is binding for colchicine, so cap-bound numbers unchanged; the documented cutoff is now accurate.

Bug found and fixed during the audit pass

src/import_posts.py was inserting comment.parent_id verbatim from the JSON (t1_<id> / t3_<id> Reddit prefixes) but storing post_id bare. The cleanup query UPDATE parent_id = NULL WHERE parent_id NOT IN (SELECT post_id ...) then nulled every prefixed parent as "dangling." Result: in the analysis DB every comment had parent_id IS NULL and the thread structure was lost.

Impact on the published numbers: zero. Each pipeline run used --drug substring filter, so upstream-context inheritance was a no-op even before the fix.
Fix: 1-line strip_reddit_prefix() helper in import_posts.py. Analysis DB rebuilt via scripts/rebuild_analysis_db_with_parent_ids.py (fresh import + copy LLM classifications across); 0 user_id mismatches, treatment_reports count unchanged, headline numbers identical bit-for-bit. New SHA-256 published in README.
Cycle guard added to extract.py's upstream-traversal recursion. idx_posts_parent index added to schema.sql.

New transparency / reviewer artifacts

All under docs/RCT_historical_validation/:

File	What it is
`README.md`	Adds Window verification table, Drug aliases section, Dedup audit section, Verify section, Arctic Shift provenance pointer
`verify.py`	Single-command reproducibility runner (PASS/FAIL summary, exit 0/1)
`ARCTIC_SHIFT_PROVENANCE.md`	How the raw JSON was acquired (URL, scrape date, conversion process, byte sizes)
`DRUG_ALIASES.md`	Every alias the pipeline substring-matched against (170 across 6 drugs, 0 cross-drug collisions, reviewer notes flagging entries worth a closer look). Includes an explicit Epistemic Status section: alias generation is a deliberate pipeline output, not a hand-authored input, because evaluating the LLM canonicalization step is part of the method this paper demonstrates.
`THREAD_AUDIT.md`	Orphan parent count, cycle detection, 50 sampled reply chains with timestamp monotonicity
`DEDUP_AUDIT.md`	36 sampled multi-report (user, drug) pairs showing every candidate report with the rule's retained pick marked
`output/provenance.json`	Build-time manifest: git commit + dirty flag, DB SHA-256, model env, full extraction_runs table

Plus standalone reproducible scripts (scripts/thread_audit.py, scripts/dedup_sample_audit.py, scripts/dump_drug_aliases.py, scripts/rebuild_analysis_db_with_parent_ids.py) — all deterministic, all idempotent.

Build now has hard gates that fire on every notebook execution

Ely-S · 2026-05-05T16:58:10Z

Great results!

Ely-S · 2026-05-05T16:59:17Z

@@ -0,0 +1,128 @@
+{


This is a great solution to documenting what was run and how !

Ely-S · 2026-05-05T16:59:47Z

+   comments" downloads, save the resulting `.jsonl` files locally.
+3. Run `scripts/convert_jsonl_to_source.py` to filter to the analysis
+   window and produce `historical_validation_2020-07_to_2022-12.json`.
+4. Verify SHA-256 matches `298d5bc7...abfe6a`.


perfect and reproducible!

Ely-S · 2026-05-05T17:00:04Z

+guaranteed against the snapshot Arctic Shift had on 2026-05-02. Earlier
+or later Arctic Shift snapshots may differ.
+
+## Why the date range stops at 2022-12-31


perfect explanation

Ely-S · 2026-05-05T17:00:45Z

+the analysis SQLite DB. The lists are what every pipeline run
+substring-matched posts and comments against during the extraction
+step and what every canonicalization step normalized to.
+Reviewers can audit these lists directly without running anything.


awesome visibility

Ely-S · 2026-05-05T17:02:22Z

+- `h2 blocker`  *(multi-word)*
+- `h2 antagonist`  *(multi-word)*
+- `acid reducer`  *(multi-word)*
+- `heartburn relief`  *(multi-word)*


not sure if heartburn relief and acid reducer are correct.

Ely-S · 2026-05-05T17:06:49Z

+- `50mg naltrexone`  *(multi-word)*
+- `oral naltrexone`  *(multi-word)*
+- `injectable naltrexone`  *(multi-word)*
+- `extended release naltrexone`  *(multi-word)*
+- `naltrexone er`  *(multi-word)*
+- `naltrexone xr`  *(multi-word)*


these are definitely not the same thing as LDN

50mg naltrexone

injectable naltrexone

extended release naltrexone

naltrexone er

naltrexone xr

vivitrol

revia

depade

the brand name naltrexones are all at 50mg

The 50mg dose is only for opioid and alcohol use disorder

While 50mg might be effective for long covid, it's essentially a different treatment

The Injectable and XR formulations are definitely not the same thing

LDN works because it has different effects at a lower dosage.

Ely-S · 2026-05-05T17:07:18Z

+- `famotidins`  *(misspelling)*
+- `h2 blocker`  *(multi-word)*
+- `h2 antagonist`  *(multi-word)*
+- `acid reducer`  *(multi-word)*


I'm not sure if I would use acid reducer and heartburn relief as aliases to pepcid.

that category would include Tums

Ely-S · 2026-05-05T17:08:07Z

+- **loratadine** includes `loratab` — Lortab is a brand of
+  hydrocodone/acetaminophen, a different (opioid) drug; this is
+  likely an LLM error and should be reviewed.


Ely-S · 2026-05-05T17:08:30Z

+- `predni`
+- `prednisona`  *(misspelling)*
+- `corticosteroid`
+- `steroid`


i would probably remove steroid as an alias to prednisone lol

Ely-S · 2026-05-05T17:09:24Z

+To turn any of these into actual corrections, the path is: edit the
+alias list, re-run canonicalization, re-run classification, regenerate
+the analysis DB and figures. None of this changes the headline
+conclusion (every drug's responder rate stays in its current bucket
+when individual ambiguous aliases are removed) but it is the open
+methodology task of manually reviewing every alias before publication.


fixing any canonicalization errors would strengthen our results

Ely-S · 2026-05-05T17:16:15Z

+
+## Reproducibility
+
+Re-running `scripts/v7_dedup_sample_audit.py --db <db> --out <out> --seed 42 --per-drug 6` against the same DB produces an identical file. Total sampled (user, drug) pairs: **36**.


scripts/dedup_sample_audit.py ?

Ely-S · 2026-05-05T17:18:44Z

Issue 1: the dedup audit reproducibility command is not runnable as written, and the “identical file” claim is too strong.

docs/RCT_historical_validation/DEDUP_AUDIT.md says:

Re-running scripts/v7_dedup_sample_audit.py --db <db> --out <out> --seed 42 --per-drug 6 against the same DB produces an identical file.

Two problems:

scripts/v7_dedup_sample_audit.py does not exist in this repo. The actual script is scripts/dedup_sample_audit.py.
Re-running the current script does not produce a byte-identical file because the Generated timestamp changes. It does reproduce the same sampled rows and still reports 36 sampled (user, drug) pairs.

Suggested fix: update the sentence to something like:

Re-running scripts/dedup_sample_audit.py --db docs/RCT_historical_validation/data/historical_validation_2020-07_to_2022-12.db --out <out> --seed 42 --per-drug 6 against the same DB reproduces the same sampled rows, aside from generated metadata. Total sampled (user, drug) pairs: 36.

Ely-S · 2026-05-05T17:18:46Z

Issue 2: the dedup rule wording says signal strength breaks ties on the “same date,” but the implementation uses the full post_date timestamp. In other words, the most recent post wins over Signal Strength.

In scripts/dedup_sample_audit.py, rows are sorted by:

(r["date"], SIG_RANK.get(r["sig"], 0))

with reverse=True. That means the most recent timestamp wins first, and signal strength only breaks ties when the full timestamp value is identical. Same-calendar-day rows are still ordered by time before signal strength.

This shows up in DEDUP_AUDIT.md for loratadine: a later moderate report at 2020-12-24 16:52 is retained over an earlier strong report at 2020-12-24 15:20.

Suggested fix: change the wording in both the script-generated intro and the checked-in audit from “ties on the same date” to “ties on the same timestamp” or “ties on the same post_date value.”

Ely-S · 2026-05-05T17:18:47Z

Minor Issue 3: the DB path recorded in DEDUP_AUDIT.md is stale and does not match the current reproducibility package layout.

DEDUP_AUDIT.md currently records:

data\historical_validation\historical_validation_2020-07_to_2022-12.db

But the README now instructs reviewers to place the DB in the data/ subfolder under docs/RCT_historical_validation/, i.e.:

docs/RCT_historical_validation/data/historical_validation_2020-07_to_2022-12.db

The current build/verification flow also uses that package-local path. This stale metadata will confuse reviewers trying to rerun the audit.

Suggested fix: regenerate DEDUP_AUDIT.md using the current DB location, or manually update the metadata and reproduction command so they point at docs/RCT_historical_validation/data/historical_validation_2020-07_to_2022-12.db.

Ely-S · 2026-05-05T17:29:04Z

Here is the same writeup from claude code with nicer fromatting, if this helps.

Code review

Found 3 issues (re-checking the 3 items from the previous changes-requested review):

Wrong script name (and incorrect reproducibility claim) in DEDUP_AUDIT.md — the Reproducibility section says to run scripts/v7_dedup_sample_audit.py, but that file does not exist; the actual script is scripts/dedup_sample_audit.py. The bug is in the script's own output string, not just the committed markdown. Additionally, the claim "produces an identical file" is false: the script writes **Generated:** <current UTC time> on every run (line 77 of the script), so re-runs differ in the timestamp header.

PatientPunk/scripts/dedup_sample_audit.py

Lines 173 to 179 in 384e0d0

    
           lines.append("") 
        
           lines.append( 
        
               f"Re-running `scripts/v7_dedup_sample_audit.py --db <db> --out <out> " 
        
               f"--seed {args.seed} --per-drug {args.per_drug}` against the same DB " 
        
               f"produces an identical file. Total sampled (user, drug) pairs: " 
        
               f"**{total_sampled}**." 
        
           )

Dedup rule wording says "same date" but implementation uses full timestamp — the docstring and generated audit text both say "signal_strength breaks ties on the same date," but the sort key is (r["date"], SIG_RANK) where r["date"] is a Unix epoch in seconds, not a calendar date. A later post on the same calendar day beats an earlier stronger post. The loratadine user 87b65a2f… in the committed DEDUP_AUDIT.md confirms this: moderate at 16:52 is retained over strong at 15:20, both on 2020-12-24. The wording in README.md ("For posts on the same date, the strongest signal wins") has the same issue.

PatientPunk/scripts/dedup_sample_audit.py

Lines 7 to 9 in 384e0d0

    
           Dedup rule under audit: "most recent post wins; signal_strength breaks 
        
           ties on the same date (strong > moderate > weak > n/a)."

PatientPunk/docs/RCT_historical_validation/DEDUP_AUDIT.md

Lines 112 to 119 in 384e0d0

    
           ### user `87b65a2f4371463c…` (2 reports) 
        
           | | post_date (UTC) | sentiment | signal | post_id | 
        
           |---|---|---|---|---| 
        
           | **RETAINED** | 2020-12-24 16:52 | positive | moderate | `ggwtglb` | 
        
           |  | 2020-12-24 15:20 | mixed | strong | `ggwjynp` |

Stale DB path in DEDUP_AUDIT.md — line 4 of the committed file records data\historical_validation\historical_validation_2020-07_to_2022-12.db (Windows backslashes; includes a historical_validation/ subdirectory). The README.md instructs reviewers to place the DB at data/historical_validation_2020-07_to_2022-12.db (flat inside docs/RCT_historical_validation/data/, no subdirectory). The path recorded in the audit file does not match the path the README tells reviewers to use.

PatientPunk/docs/RCT_historical_validation/DEDUP_AUDIT.md

Lines 3 to 5 in 384e0d0

    
           **Generated:** 2026-05-04 19:10 UTC 
        
           **DB:** `data\historical_validation\historical_validation_2020-07_to_2022-12.db` 
        
           **Sample seed:** 42

Issues 4–6 from earlier comments are resolved: dump_per_drug_csvs.py uses Path(__file__).resolve().parents[1] (no hardcoded Windows paths), the post_date < window_end SQL predicate is present in both scripts, and the dedup logic correctly implements most-recent-timestamp-wins.

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

Ely-S · 2026-05-05T17:36:54Z

Here are some of Sonnet's comments. I don't know that any of these are meaningful, but leave it up to you.

Code review (OPTIONAL deep pass — reproducibility, data quality, robustness)

Found 5 new issues:

Committed output artifacts were built from a dirty working tree — provenance.json records "dirty": true, meaning figure1.png, paper_figures.html, and the executed notebook were generated from code not captured in any commit. A reviewer who checks out 384e0d0 and runs _build_paper_figures.py on a clean tree cannot expect to produce byte-identical outputs.

PatientPunk/docs/RCT_historical_validation/output/provenance.json

Lines 4 to 8 in 384e0d0

    
           "git": { 
        
             "commit": "8a05a31a116f0206e78531dfa280337e6846acb1", 
        
             "dirty": true 
        
           }, 
        
           "db": {

The frozen DB was built at a different commit than the code under review — all six extraction_runs in provenance.json record "commit_hash": "16c2569b0273e793f7d5f8894c912073a0a85753" (commit title: "address Gemini review: alias resolution, BOM, encoding"), while the current HEAD is 384e0d0. The Gemini review that was being addressed raised alias-resolution concerns; it's unclear whether the alias logic used during the actual extraction run differs from the logic now in the PR. There is no diff between those commits in the documented provenance.

PatientPunk/docs/RCT_historical_validation/output/provenance.json

Lines 18 to 36 in 384e0d0

    
           "extraction_runs": [ 
        
             { 
        
               "run_id": 1, 
        
               "run_at": 1777759720, 
        
               "commit_hash": "16c2569b0273e793f7d5f8894c912073a0a85753", 
        
               "extraction_type": "treatment_sentiment", 
        
               "config": { 
        
                 "models": { 
        
                   "fast": "anthropic/claude-haiku-4.5", 
        
                   "strong": "anthropic/claude-sonnet-4.6" 
        
                 }, 
        
                 "limit": 0, 
        
                 "reclassify": false, 
        
                 "skip_canonicalize": false, 
        
                 "output_dir": "data\\historical_validation\\master_gap\\pipeline_output", 
        
                 "drug": "famotidine" 
        
               }, 
        
               "run_at_iso": "2026-05-02T22:08:40+00:00" 
        
             },

dump_per_drug_csvs.py default DB path does not match the path the README tells reviewers to use — the README's "Reproduce" section instructs downloading the DB to docs/RCT_historical_validation/data/historical_validation_2020-07_to_2022-12.db (no subdirectory, confirmed by provenance.json's db.path). But dump_per_drug_csvs.py's default resolves to <repo-root>/data/historical_validation/historical_validation_2020-07_to_2022-12.db — a different location entirely. Running the script without --db after following the README download instructions will fail with "DB not found." The README also internally describes the script as reading from data/historical_validation/... (a third path variant).

PatientPunk/scripts/dump_per_drug_csvs.py

Lines 39 to 44 in 384e0d0

    
           # Repo-relative defaults: this file lives at <repo>/scripts/dump_per_drug_csvs.py, 
        
           # so parents[1] is the repo root regardless of where the script is invoked from. 
        
           REPO_ROOT   = Path(__file__).resolve().parents[1] 
        
           DEFAULT_DB  = REPO_ROOT / "data" / "historical_validation" / "historical_validation_2020-07_to_2022-12.db" 
        
           DEFAULT_OUT = REPO_ROOT / "data" / "historical_validation" / "merged"

README claims "every comparison reaches p ≤ 0.0001" but naltrexone's p-value is 1.358e-4 — verify.py's EXPECTED_OUTPUTS records naltrexone at p: 1.358e-4 = 0.0001358, which is greater than 0.0001. The expected-output table in the README correctly shows naltrexone as 0.0001 (rounded), but the prose is factually wrong.

PatientPunk/docs/RCT_historical_validation/README.md

Lines 379 to 382 in 384e0d0

    
           | prednisone | 343 | 48.7% | [43.4, 54.0] |   0.67   | null | 
        
           The pattern: every drug where the community clearly leaned positive (loratadine, famotidine, naltrexone) corresponds to a trial that found a positive result — every comparison reaches p ≤ 0.0001. Every drug where the community was roughly split (colchicine, paxlovid, prednisone) corresponds to a trial that found no significant effect — none reach significance against the 50% null. All 6 directional comparisons match the eventual trial outcome.

PatientPunk/docs/RCT_historical_validation/verify.py

Lines 63 to 68 in 384e0d0

    
           "famotidine": {"n": 232, "pos": 179, "pos_pct": 77.155, "p": 3.565e-17}, 
        
           "loratadine": {"n":  90, "pos":  73, "pos_pct": 81.111, "p": 1.948e-9}, 
        
           "prednisone": {"n": 343, "pos": 167, "pos_pct": 48.688, "p": 0.6658}, 
        
           "naltrexone": {"n": 154, "pos": 101, "pos_pct": 65.584, "p": 1.358e-4}, 
        
           "paxlovid":   {"n": 196, "pos": 106, "pos_pct": 54.082, "p": 0.2839}, 
        
           "colchicine": {"n":  91, "pos":  49, "pos_pct": 53.846, "p": 0.5296},

provenance.json extraction-run paths use Windows backslashes and an undocumented master_gap directory — every extraction_runs[*].config.output_dir records a Windows-style path such as data\\historical_validation\\master_gap\\pipeline_output_loratadine. The master_gap subdirectory does not appear in the README, the documented data layout, or any script. Reviewers who inspect provenance.json to understand how the DB was built cannot trace these intermediate artifacts.

PatientPunk/docs/RCT_historical_validation/output/provenance.json

Lines 29 to 36 in 384e0d0

    
               "limit": 0, 
        
               "reclassify": false, 
        
               "skip_canonicalize": false, 
        
               "output_dir": "data\\historical_validation\\master_gap\\pipeline_output", 
        
               "drug": "famotidine" 
        
             }, 
        
             "run_at_iso": "2026-05-02T22:08:40+00:00" 
        
           },

Minor (not blocking): _build_paper_figures.py's module docstring (line 18) still says "Signal-strength is the tiebreaker for posts on the same date" — the same wording issue flagged in the previous review for DEDUP_AUDIT.md and README.md also appears here.

PatientPunk/docs/RCT_historical_validation/_build_paper_figures.py

Lines 16 to 20 in 384e0d0

    
           Per-(user, drug) deduplication rule: 
        
               1) Most recent report wins (post_date desc). 
        
               2) Signal-strength is the tiebreaker for posts on the same date 
        
                  (strong > moderate > weak > n/a).

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

PR #37 reviewer findings (Eli + Sonnet): - Eli #1: dedup_sample_audit.py output template fixed — script name was 'v7_dedup_sample_audit.py' (file doesn't exist), now correctly references 'scripts/dedup_sample_audit.py'; "produces an identical file" claim softened to acknowledge the regenerated 'Generated' timestamp at the top of the file. - Eli #2 (Option A — wording, not implementation): "same date" reconciled with the full-timestamp implementation across _build_paper_figures.py (4 spots), README, dedup_sample_audit.py, and the regenerated DEDUP_AUDIT.md. The rule in practice reduces to "most recent post wins (full UTC timestamp)" because exact- timestamp ties are rare; signal_strength acts as a near-vestigial tiebreaker on those rare ties. Implementation unchanged. - Eli #3: stale Windows-backslash DB path in DEDUP_AUDIT.md ('data\historical_validation\...') replaced by the package-local posix path on regeneration. - Sonnet #4 + #5: new README "Notes on output/provenance.json" subsection documents the dirty=true chicken-and-egg constraint and the extraction commit (16c2569) vs HEAD (this PR) diff — reformalized SQL predicate, deleted-user exclusion, colchicine date correction, audit/verify scripts, parent_id prefix fix; explicitly notes no alias-resolution or sentiment-classification logic was changed in this range. - Sonnet #6: dump_per_drug_csvs.py default DB path mismatch with README — fixed by routing through paths.py (see refactor below). - Sonnet #7: README "every comparison reaches p ≤ 0.0001" prose corrected to "p < 0.001" (naltrexone is 1.358e-4 = 0.0001358). - Sonnet #8: provenance writer now normalizes Windows backslashes in extraction_runs.config; new README subsection explains what the 'master_gap' working directory was. Path resolution refactor (paths.py): - New paths.py module with marker-based anchor walk-up + RCT_DB_PATH env var override. Single source of truth. - All four entry points (_build_paper_figures.py, verify.py, scripts/dump_per_drug_csvs.py, the executed notebook) use the same resolver. The notebook embeds an inlined copy of the resolver in its first setup cell so reviewers can read exactly how the DB is found without chasing imports. - build_notebook.py grew an optional db_path_block kwarg so callers can inject richer DB resolution (used by _build_paper_figures.py); default behavior unchanged for any other notebook builders. - provenance.json db.path is now package-relative ("data/historical_validation_2020-07_to_2022-12.db") — no more leaky absolute Windows-user paths. - verify.py header surfaces resolution mode ("package anchor (default)" vs "RCT_DB_PATH=..." vs "--db CLI flag") so a stale env var failure mode is obvious at a glance. Bonus parallel-bug catches in adjacent audit scripts (same false "produces an identical file" claim and Windows-backslash DB paths): - dump_drug_aliases.py and thread_audit.py corrected; DRUG_ALIASES.md and THREAD_AUDIT.md regenerated with posix DB paths and accurate reproducibility wording. Figure 1 redesign: - Replaced the paired horizontal bar chart with a two-marker forest plot per drug (responder green circle + non-responder red square), trial outcome tagged in the row label, paper citation on line two of the y-tick label. Each marker has a matching-color 95% Wilson CI line; numeric values printed at the right end. Verification: - All headline numbers unchanged: famotidine 232, loratadine 90, naltrexone 154, paxlovid 196, prednisone 343, colchicine 91. - DB SHA-256 unchanged: c50fcacd7ce366f397152f5fe4dbb59d5eaf64ba32627faef91dad86fbf6c6f4 - verify.py: ALL 5 CHECKS PASSED. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rl, soften alias claim Three findings from a fresh-checkout reproduction pass: P2 — provenance could record the wrong git commit _git_metadata() ran `git rev-parse HEAD` and `git status` without anchoring to the package root. If a reviewer extracted the package into an unrelated dirty repo checkout (Codex's exact scenario), output/provenance.json silently recorded that *other* repo's commit rather than the package's commit — silent misdirection that's worse than recording 'unknown'. Fix: run git with cwd=Path(__file__).resolve().parent, then verify _build_paper_figures.py is actually tracked in that worktree via `git ls-files --error-unmatch`. If git can't resolve HEAD or the script isn't tracked, record 'outside-git' (visibly broken) instead of an unrelated commit (silently misleading). 'unknown' is now reserved for the case where git itself isn't installed. Verified across three scenarios: - inside the package's own checkout → reports actual SHA - extracted to a plain directory → 'outside-git' + warning - extracted into an unrelated repo → 'outside-git' + warning (was: returned unrelated SHA) P3 — README curl command silent on HTTP errors `curl -L -o data/...` exits 0 on 4xx, writes the error response body to the destination filename, and the next step crashes with a confusing `sqlite3.DatabaseError: file is not a database`. Changed to `curl -fL -o ...` with a comment explaining why, and added a `python verify.py` step right after the download so HTTP/transfer errors fail at the download step rather than mid-build. P3 — alias sensitivity claim was unsupported DRUG_ALIASES.md previously said "None of this changes the headline conclusion" about alias edits — a quantitative sensitivity claim with no per-alias ablation table to back it. Softened the wording in scripts/dump_drug_aliases.py: explicitly state we have NOT run a quantitative sensitivity analysis, that this file documents inspectability rather than robustness, and that producing a per-drug ablation table is the open methodology task before publication. DRUG_ALIASES.md regenerated. Also fixed (parallel hygiene): - dump_drug_aliases.py reproducibility command rendered Windows backslashes for args.db / args.out; now uses .as_posix(). Verification: - provenance.json `git.commit` correctly = 665e910… (HEAD before this commit), not the parent OneDrive repo's commit - All headline numbers unchanged (verify.py: ALL 5 CHECKS PASSED) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ce lag P3 — README DB table count was wrong README said "four tables" (users, posts, treatment, treatment_reports) but the DB actually has seven: those four plus extraction_runs (used by provenance), plus conditions and user_profiles (both empty, reserved for future extensions). Reviewers inspecting the DB directly would have been surprised. Replaced the prose count with a full inventory table showing every table, its row count at the published SHA, and its role. P3 — committed provenance.json lagged HEAD The committed provenance.json recorded git.commit = 665e910 (the prior PR commit) while HEAD was 0e1f187 — an artifact of the chicken-and-egg constraint: the build runs at commit N, records git.commit = N, and gets committed as commit N+1. The committed artifact always points at its own parent commit. Two changes: - Regenerated outputs at current HEAD so the lag drops from 2 commits to exactly 1 (parent). provenance.json now records 0e1f187 (the immediate parent of this fix commit). - Added an explicit README subsection under Notes on output/provenance.json explaining the 1-commit lag, naming the reviewer-regenerated file as authoritative, and documenting git.commit = "outside-git" semantics introduced in the previous commit. Verification: - All headline numbers unchanged - verify.py: ALL 5 CHECKS PASSED at HEAD before this commit Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The original paper's data-sources table included a Total-posts column (corpus posts in each drug's window, the denominator the substring + LLM extraction ran against). The current package's Table 2 dropped it. Adding it back so reviewers can see the analysis denominator alongside the per-drug user/report counts. Definition (matches the per-drug analysis filters): COUNT(*) FROM posts WHERE post_date IS NOT NULL AND post_date < <window_end_exclusive> AND user_id != 'deleted' Values for the published SHA c50fcacd…: famotidine 139,075 (window: 2020-07-24 → 2021-06-06) loratadine 139,075 (same window) prednisone 218,822 (2020-07-24 → 2021-10-25) naltrexone 448,114 (2020-07-24 → 2022-07-02) paxlovid 678,923 (2020-07-24 → 2022-12-31, end-2022 cap) colchicine 678,923 (same as paxlovid, capped) These are larger than the original paper's narrower-window post counts (famotidine 88,077, naltrexone 244,036, etc.) because this PR switched to one corpus-wide pull — already documented in Provenance. The added footnote explicitly spells out what Total posts counts and why it differs from "raw post-row count" (the deleted-user and NULL-post_date exclusions match the analysis filters, so the column reads as the actual analysis denominator rather than a rawer count that would mismatch downstream rows). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Table 3 formatter used to bin p-values into one of three buckets: - p < 0.0001 → "<0.0001" (lossy — discards the actual order of magnitude; loratadine and famotidine both rendered as the same string despite differing by 8 orders of magnitude) - p < 0.001 → 4 decimal places (e.g. "0.0001") - p ≥ 0.001 → 3 decimal places (e.g. "0.284") New behavior: - p < 0.001 → 3-sig-fig scientific notation (e.g. "3.56e-17", "1.95e-9", "1.36e-4") - p ≥ 0.001 → 3 decimal places (unchanged) Single-digit exponents have their leading zero stripped (1.95e-9 not 1.95e-09) for readability. Both Table 3 in the executed notebook and the README's Expected Output table now use the same formatter and show the same values. Verification: - verify.py: ALL 5 CHECKS PASSED (formatter change is presentation-only; numeric assertions unchanged) - Rendered Table 3 p-values: loratadine 1.95e-9 famotidine 3.56e-17 naltrexone 1.36e-4 paxlovid 0.284 colchicine 0.530 prednisone 0.666 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Airwhale · 2026-05-07T22:03:01Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a comprehensive reproducibility package for the RCT historical validation analysis, including documentation, audit logs, and automated build scripts. Key changes include fixing a parent_id prefix issue in the Reddit import logic, adding cycle detection to the extraction pipeline, and providing a suite of verification tools to ensure data integrity and numerical consistency. Feedback focuses on improving code maintainability by reusing existing database connections and constants, removing redundant implementations of statistical functions, and consolidating duplicated graph traversal logic into shared utilities.

gemini-code-assist · 2026-05-07T22:05:41Z

+DB_DIR = Path(DB_PATH).parent
+COMBINED_DB = DB_DIR / "historical_validation_2020-07_to_2022-12.db"
+combined_conn = sqlite3.connect(COMBINED_DB.as_posix())


The generated notebook code creates a redundant database connection combined_conn. The setup block (injected via build_notebook) already establishes a global conn object using the same database file. It is better to reuse the existing conn to avoid confusion and resource overhead in the executed notebook.

Suggested change

DB_DIR = Path(DB_PATH).parent

COMBINED_DB = DB_DIR / "historical_validation_2020-07_to_2022-12.db"

combined_conn = sqlite3.connect(COMBINED_DB.as_posix())

# Reusing the 'conn' established in the setup block.

combined_conn = conn

gemini-code-assist · 2026-05-07T22:05:42Z

+NOTEBOOK_DB_RESOLVER = '''import os
+from pathlib import Path
+
+_DB_FILENAME = "historical_validation_2020-07_to_2022-12.db"
+_PACKAGE_MARKERS = ("_build_paper_figures.py", "verify.py")
+_ENV_VAR = "RCT_DB_PATH"


The database filename and package markers are hardcoded here, duplicating constants already defined in paths.py. While inlining the resolver into the notebook is a good design for portability, the generation of this string should ideally use the constants from paths.py to ensure they stay in sync.

gemini-code-assist · 2026-05-07T22:05:42Z

+def wilson_ci(k, n, z=1.96):
+    """Wilson score confidence interval for a proportion."""
+    if n == 0:
+        return 0.0, 0.0
+    p = k / n
+    denom = 1 + z**2 / n
+    center = (p + z**2 / (2 * n)) / denom
+    margin = z * np.sqrt((p * (1 - p) + z**2 / (4 * n)) / n) / denom
+    return max(0, center - margin), min(1, center + margin)


The wilson_ci function is manually implemented here, but the reproducibility package already depends on statsmodels (as seen in requirements.txt and _build_paper_figures.py), which provides a more robust implementation via proportion_confint. This manual implementation is redundant and potentially less accurate than the library version.

gemini-code-assist · 2026-05-07T22:05:42Z

+    """V10: re-derive the same per-drug stats the build does, compare to
+    EXPECTED_OUTPUTS. Same logic as the build's V10 cell, but standalone."""
+    from scipy.stats import binomtest
+    from statsmodels.stats.proportion import proportion_confint as wilson  # noqa: F401


The import of proportion_confint as wilson is unused in this function. Removing unused imports keeps the code clean and avoids unnecessary dependencies at runtime.

gemini-code-assist · 2026-05-07T22:05:42Z

+def _detect_parent_cycles(id_to_parent: dict) -> None:
+    """Raise ValueError if id_to_parent contains any cycle.
+
+    The upstream-mentioned-drugs recursion below assumes id_to_parent is a
+    forest (each node has at most one parent and chains terminate). A cycle
+    in malformed imported data would cause the recursion to memoize on
+    (eid, remaining) and still never terminate (cache stores final values,
+    not in-progress visits). Detect once up front and fail loudly.
+
+    Iterative DFS with white/gray/black coloring; O(N).
+    """
+    UNVISITED, VISITING, DONE = 0, 1, 2
+    color = {}
+    for start in id_to_parent:
+        if color.get(start, UNVISITED) != UNVISITED:
+            continue
+        stack = [start]
+        path = []
+        while stack:
+            node = stack[-1]
+            c = color.get(node, UNVISITED)
+            if c == UNVISITED:
+                color[node] = VISITING
+                path.append(node)
+                parent = id_to_parent.get(node)
+                if parent and parent in id_to_parent:
+                    pcol = color.get(parent, UNVISITED)
+                    if pcol == VISITING:
+                        # back-edge into the active path -> cycle
+                        i = path.index(parent)
+                        raise ValueError(
+                            "parent_id cycle detected: "
+                            + " -> ".join(path[i:] + [parent])
+                        )
+                    if pcol == UNVISITED:
+                        stack.append(parent)
+                        continue
+                # parent missing, NULL, or already DONE -> this node is finished
+                color[node] = DONE
+                if path and path[-1] == node:
+                    path.pop()
+                stack.pop()
+            else:
+                if path and path[-1] == node:
+                    path.pop()
+                stack.pop()
+                color[node] = DONE
+


The cycle detection logic in _detect_parent_cycles is implemented using an iterative DFS. While correct, this logic is duplicated in docs/RCT_historical_validation/verify.py. Consider moving this to a shared utility module (e.g., src/utilities/graph.py) to improve maintainability, especially since both locations are part of the same repository.

H4 (substantive correctness, classify.py alias-resolution gap) When the pipeline runs with --drug but without --drug-file, AND skip_canonicalize is set (or the DB hasn't been canonicalized yet), synonyms_for is empty. Aliased mentions like "LDN" or "low dose naltrexone" still get classified — they pass the target_aliases filter — but the system_prompt called for them with synonyms_for.get(alias) = None, so the prompt lacks the canonical- drug context. Less obvious aliases (e.g., "low dose 3mg pills") could be misclassified by a model that doesn't know the alias refers to naltrexone. Fix: when target_aliases is populated (i.e., --drug mode), use the resolved alias set (minus the matched drug itself) as the synonym hint for system_prompt instead of synonyms_for.get(drug). Falls back to synonyms_for.get(drug) when target_aliases is None (whole-corpus runs). Doesn't affect the published numbers — extraction was done with canonicalize ON, so synonyms_for was populated at run time. Future reviewers re-running cold benefit. M3 (redundant DB connection in _build_paper_figures.py) combined_conn = sqlite3.connect(...) opened a second connection to the same DB the setup cell already opened as `conn`. Replaced with `combined_conn = conn`. M5 (redundant wilson_ci implementation in build_notebook.py) Hand-rolled Wilson CI replaced with statsmodels.proportion_confint (already a transitive dep). Same numerical result; less code to defend. M6 (unused import in verify.py) Dropped the dead `from statsmodels.stats.proportion import proportion_confint as wilson` and its `# noqa: F401` muzzle. M2 (utf-8 encoding on read_text) run_sentiment_pipeline.py was already fixed in earlier commits. Two more read_text() calls without encoding remained (classify.py prefilter cache, utilities/__init__.py drug-aliases cache); added encoding="utf-8" to both for cross-platform robustness. M1 (BOM in build_notebook.py) Already stripped by an earlier rewrite. Verified no BOM remains. M4 (DRY: notebook resolver constants vs paths.py) The inlined notebook resolver in _build_paper_figures.py duplicated DB_FILENAME / PACKAGE_MARKERS / RCT_DB_PATH literals already defined in paths.py — drift risk. Now interpolates those constants from `paths.DB_FILENAME` etc. into the resolver template at build time. Single source of truth; verified the rebuilt notebook has the correct values inlined. M7 (DRY: parent-cycle detection) src/pipeline/extract.py:_detect_parent_cycles and docs/RCT_historical_validation/verify.py:_detect_cycles were two copies of the same iterative DFS. Extracted the shared cycle-finding logic into a new src/utilities/graph.py: find_parent_cycles (returns all cycles); extract.py now wraps it for fail-fast raise. verify.py keeps its inline copy on purpose — the reproducibility package can't import from src/ — with a docstring noting the duplication is deliberate and instructing future maintainers to mirror changes. False-positive "critical" findings dismissed (no fix) Three Gemini findings claimed `{{...}}` in build_notebook.py's SETUP_CODE template would create unhashable set-of-dict TypeErrors. These are intentional Python format-string escapes (the whole string is passed to .format(db_path_block=...)); the rendered notebook code has correct single-brace literals. No fix needed. Verification - verify.py: ALL 5 CHECKS PASSED - Full build: clean, all numbers unchanged - Inlined notebook resolver constants confirmed pulled from paths.py: _DB_FILENAME = "historical_validation_2020-07_to_2022-12.db" _PACKAGE_MARKERS = ('_build_paper_figures.py', 'verify.py') _ENV_VAR = "RCT_DB_PATH" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Airwhale · 2026-05-07T22:21:40Z

/gemini review

gemini-code-assist

Code Review

This pull request establishes a reproducibility package for the RCT historical validation analysis, providing the necessary scripts, documentation, and audit trails to regenerate paper figures and tables. Significant improvements were made to the data import pipeline, including a fix for Reddit ID prefix handling to preserve thread structure and the addition of cycle detection. The reviewer feedback highlights the need for stricter determinism in data extraction and deduplication by adding explicit ordering to SQL queries and sort logic, alongside a recommendation to specify minimum versions for dependencies like Scipy to ensure environment compatibility.

gemini-code-assist · 2026-05-07T22:23:42Z

+        WHERE lower(t.canonical_name) = ?
+          AND p.post_date IS NOT NULL
+          AND p.post_date < ?
+          AND p.user_id != 'deleted'
+    ''', (drug, window_end_exclusive_ts)).fetchall()


The SQL query lacks an ORDER BY clause. Since the deduplication logic in dedup_recent_then_strength (line 351) is order-dependent for ties (where both post_date and signal_strength are identical), the results may be non-deterministic depending on the database's internal row ordering. Adding a unique identifier like tr.post_id to the ordering ensures absolute reproducibility, which is critical for this package.

Suggested change

WHERE lower(t.canonical_name) = ?

AND p.post_date IS NOT NULL

AND p.post_date < ?

AND p.user_id != 'deleted'

''', (drug, window_end_exclusive_ts)).fetchall()

WHERE lower(t.canonical_name) = ?

AND p.post_date IS NOT NULL

AND p.post_date < ?

AND p.user_id != 'deleted'

ORDER BY tr.post_id

''', (drug, window_end_exclusive_ts)).fetchall()

gemini-code-assist · 2026-05-07T22:23:42Z

+        WHERE lower(t.canonical_name) = ?
+          AND p.post_date IS NOT NULL
+          AND p.post_date < ?
+          AND p.user_id != 'deleted'
+        """,
+        (drug, cutoff_ts),
+    ).fetchall()


The SQL query lacks an ORDER BY clause, which can lead to non-deterministic results during deduplication if multiple reports share the same timestamp and signal strength. Adding ORDER BY tr.post_id ensures consistent behavior across different environments and database snapshots.

WHERE lower(t.canonical_name) = ? AND p.post_date IS NOT NULL AND p.post_date < ? AND p.user_id != 'deleted' ORDER BY tr.post_id "", (drug, cutoff_ts), ).fetchall()

gemini-code-assist · 2026-05-07T22:23:42Z

+    df = df.sort_values(["user_id", "drug", "post_date", "sig_rank"],
+                        ascending=[True, True, False, False])


To ensure absolute determinism in deduplication, post_id should be included as a final tie-breaker in the sort order. This prevents drop_duplicates from picking an arbitrary row when all other sorted columns are identical, ensuring the generated CSVs are bit-identical across runs.

Suggested change

df = df.sort_values(["user_id", "drug", "post_date", "sig_rank"],

ascending=[True, True, False, False])

df = df.sort_values(["user_id", "drug", "post_date", "sig_rank", "post_id"],

ascending=[True, True, False, False, False])

good thinking, gemini

gemini-code-assist · 2026-05-07T22:23:42Z

+pandas
+matplotlib
+seaborn
+scipy


The analysis scripts use scipy.stats.binomtest, which was introduced in Scipy 1.7.0. Specifying a minimum version ensures that the reproducibility package works correctly in fresh environments without users encountering an ImportError.

scipy>=1.7.0

This file is missing package versions. That means this will fail to build in a few years as packages get updated.

We should be using uv and pyproject.toml.

uv builds are more reproducible because uv.lock specifies the specific package versions actually installed, along with the python version desired.

Todo: take the requirements from this file and move them to pyproject.toml. run uv sync Then update the docs to remove mention of this file.

…06-06 The Glynne et al. medRxiv preprint went live on 2021-06-06 (URL medrxiv.org/content/10.1101/2021.06.06.21258272v1). The previous cutoff 2021-06-07 (strict-<) was off-by-one: it INCLUDED posts made on 2021-06-06 itself, after the preprint was public. That's a small but real contamination window — the methodology paragraph promises "strictly before midnight UTC of the comparator paper's publication date," and for Glynne the first-public-availability date is the preprint, not the day after. Every other drug in the panel uses cutoff = pub_date (strict-< then excludes the pub date itself). Glynne is now consistent with that convention. Empirical effect: 3 treatment_report rows on 2021-06-06 (1 famotidine, 2 loratadine) are now excluded. famotidine user 3cc039f2... had no other pre-preprint reports for the drug (10 lifetime mentions, the 2021-06-06 one was the first). Drops from analysis: n_famotidine 232 -> 231 (-1 positive user). loratadine user 336f9e181f85... had only one lifetime loratadine report, on 2021-06-06. Drops: n_loratadine 90 -> 89 (-1 positive user). loratadine user 93d34b804d58... had 3 earlier pre-preprint reports (Mar, Apr, May 2021). Stays in analysis under most-recent-rule fallback to May 28; no effect on n. New numbers (verified by verify.py, EXPECTED_OUTPUTS updated in both _build_paper_figures.py and verify.py): famotidine: n=231 pos=178 pos_pct=77.056% p=5.51e-17 (was n=232 pos=179 pos_pct=77.155% p=3.57e-17) loratadine: n= 89 pos= 72 pos_pct=80.899% p=3.17e-9 (was n= 90 pos= 73 pos_pct=81.111% p=1.95e-9) Headline pattern unchanged; both still extreme-low p-values, CIs still mutually exclusive from non-responder CIs, all 6 trial-outcome comparisons still match. Tightening makes the methodology paragraph self-consistent rather than off-by-one. Files updated: - DRUG_CUTOFFS in _build_paper_figures.py, verify.py, scripts/dedup_sample_audit.py, scripts/dump_per_drug_csvs.py - EXPECTED_OUTPUTS in _build_paper_figures.py and verify.py - README: pre-publication-cutoffs table, window-verification table (MIN/MAX/reports for famotidine + loratadine refreshed against the new cutoff), Expected Output table - Regenerated DEDUP_AUDIT.md (now reflects fam/lor raw=692/188 instead of 693/190), DRUG_ALIASES.md - Regenerated paper_figures.html, .ipynb, .executed.ipynb, figure0.png, figure1.png, provenance.json Also corrected stale citation form: "Glynne et al. 2021" -> "Glynne et al. 2022" everywhere (the journal version is 2022; the preprint is 2021-06-06, but we cite the journal). Aligned with Tables 2/3/4 in the paper draft. verify.py: ALL 5 CHECKS PASSED. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…t plot) -> Figure 2 The script previously labelled the per-drug sentiment-breakdown stacked-bar chart as "Figure 0" and the responder/non-responder forest plot as "Figure 1". Publication convention is to number from 1; the existing "Figure 0" reads as a draft artifact. Renumbered so that: - Figure 1 = per-drug full sentiment breakdown (stacked bar) - Figure 2 = forest plot (responders vs non-responders, 95% Wilson CIs) Cascade: - All "Figure 0" -> "Figure 1" (build script titles, captions, intro markdown describing the notebook artifacts) - All "Figure 1" -> "Figure 2" (where Figure 1 was previously the forest plot; this includes body-text references in markdown cells, comments, and references in README) - PNG filenames: figure0.png -> figure1.png (stacked bar); figure1.png -> figure2.png (forest plot). The old figure0.png was removed from tracking. - README updates: * Intro: "Reproduces Figure 1, Table 2, and Table 3" -> "Figure 1, Figure 2, Table 2, and Table 3" * Output-files list: lists both figure1.png and figure2.png with what each shows * "What's reproduced" table: split the old Figure 1 row into Figure 1 (stacked bar) + Figure 2 (forest plot), with Figure 2 carrying the responder-vs-non-responder narrative * Methodology references to "Figure 1" updated to "Figure 2" where they mean the forest plot (error bars, n column, Wilson CI step, deleted-user-filter propagation) - Docstring of the build script updated to describe both figures. Verification: - verify.py: ALL 5 CHECKS PASSED (renumber is presentation-only; numeric assertions unchanged) - Rebuild produces output/figure1.png (157 KB, stacked bar) and output/figure2.png (128 KB, forest plot); old figure0.png removed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Ely-S · 2026-05-11T23:23:02Z

+- `loradatine`  *(misspelling)*
+- `loratadinе`  *(misspelling)*
+- `claritin redi-tabs`  *(multi-word)*
+- `loratab`


https://www.google.com/search?q=loratab

this looks like different drug

Ely-S · 2026-05-11T23:26:33Z

+    lines.append("# Drug aliases used in extraction\n")
+    lines.append(f"**Generated:** {now} from `{args.db.as_posix()}`")
+    lines.append("**Generator:** `scripts/dump_drug_aliases.py`")
+    lines.append("")


If we add something like this, I bet Gemini will catch the errors:

lines.append("Reviewers: Check that all aliases are correct by ensuring they actually refer to the same treatment. ")

polya-b and others added 5 commits April 30, 2026 11:07

pass in list

0ebb650

add S3 download links for databases and raw data to README

3f26086

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

move reproducibility package to docs/RCT_historical_validation

8cfc099

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread docs/RCT_historical_validation/build_notebook.py Outdated

Comment thread docs/RCT_historical_validation/output/paper_figures.ipynb Outdated

Comment thread docs/RCT_historical_validation/output/paper_figures_executed.ipynb Outdated

fix hardcoded absolute paths in notebooks (use relative paths)

3d64492

Addresses Gemini review: pre-built notebooks contained C:/Users/scgee/... paths. Now uses ../data/ relative paths so notebooks are portable across machines. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Airwhale requested review from Ely-S and polya-b April 30, 2026 18:56

gemini-code-assist Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread src/pipeline/classify.py Outdated

Comment thread docs/RCT_historical_validation/build_notebook.py Outdated

Comment thread src/run_sentiment_pipeline.py Outdated

gemini-code-assist Bot reviewed Apr 30, 2026

View reviewed changes

Shaun and others added 5 commits May 2, 2026 16:18

Airwhale closed this May 3, 2026

Shaun and others added 5 commits May 3, 2026 15:03

Airwhale reopened this May 3, 2026

Shaun and others added 2 commits May 4, 2026 12:12

Ely-S approved these changes May 5, 2026

View reviewed changes

Ely-S reviewed May 5, 2026

View reviewed changes

Ely-S requested changes May 5, 2026

View reviewed changes

Shaun and others added 5 commits May 7, 2026 09:51

gemini-code-assist Bot reviewed May 7, 2026

View reviewed changes

Airwhale requested a review from Ely-S May 7, 2026 22:27

Shaun and others added 2 commits May 10, 2026 15:06

Ely-S reviewed May 11, 2026

View reviewed changes

	SENTIMENT_SCORE = {{"positive": 1.0, "mixed": 0.5, "neutral": 0.0, "negative": -1.0}}
	SENTIMENT_SCORE = {"positive": 1.0, "mixed": 0.5, "neutral": 0.0, "negative": -1.0}


		## Reproducibility

		Re-running `scripts/v7_dedup_sample_audit.py --db <db> --out <out> --seed 42 --per-drug 6` against the same DB produces an identical file. Total sampled (user, drug) pairs: 36. No newline at end of file

		df = df.sort_values(["user_id", "drug", "post_date", "sig_rank"],
		ascending=[True, True, False, False])

Conversation

Airwhale commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's inside

What the analysis shows

How to use

Data access

Reviewer findings addressed in this PR

Test plan

Uh oh!

qodo-code-review Bot commented Apr 30, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Airwhale commented Apr 30, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Airwhale commented Apr 30, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Ely-S commented May 1, 2026

Uh oh!

Ely-S commented May 1, 2026

Uh oh!

Airwhale commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Airwhale commented May 4, 2026

Uh oh!

gemini-code-assist Bot commented May 4, 2026

Uh oh!

Airwhale commented May 4, 2026

Update: substantial methodology and transparency pass

TL;DR

Methodology changes

Bug found and fixed during the audit pass

New transparency / reviewer artifacts

Build now has hard gates that fire on every notebook execution

Uh oh!

Ely-S commented May 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Airwhale commented Apr 30, 2026 •

edited

Loading

Airwhale commented May 3, 2026 •

edited

Loading

Ely-S commented May 5, 2026 •

edited

Loading

Ely-S commented May 5, 2026 •

edited

Loading

Ely-S commented May 5, 2026 •

edited

Loading

Ely-S commented May 5, 2026 •

edited

Loading

Ely-S commented May 5, 2026 •

edited

Loading