Add RCT historical validation reproducibility package#37
Conversation
Self-contained folder for reproducing Figure 1, Table 2, and Table 3 from the paper. Includes build script, notebook helper, README with plain-language methodology, and requirements. Databases and outputs are gitignored — share via OneDrive. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Force-add HTML, notebooks, and figure so reviewers can view results without the databases. README now explains how to view pre-built outputs vs re-run from scratch, and directs users to contact authors for the underlying data. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one. |
There was a problem hiding this comment.
Code Review
This pull request introduces a reproducibility package for RCT historical validation, including documentation, build scripts, and pre-built Jupyter notebooks. It also enhances the sentiment pipeline to support hand-curated drug alias lists via a new --drug-file argument. My feedback focuses on a portability issue where the notebook build process hardcodes absolute local file paths into the generated .ipynb files, which prevents them from being executed on other machines without manual modification.
Addresses Gemini review: pre-built notebooks contained C:/Users/scgee/... paths. Now uses ../data/ relative paths so notebooks are portable across machines. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a reproducibility package for RCT historical validation, featuring comprehensive documentation, analysis scripts, and pre-built Jupyter notebooks. Additionally, it enhances the sentiment pipeline by allowing drug aliases to be provided via a text file and centralizing alias resolution logic. Feedback identifies an inconsistency in alias handling during classification, a non-standard Byte Order Mark in a new script, and recommends explicitly specifying UTF-8 encoding when reading the drug aliases file.
- classify.py: use resolve_aliases() for consistent alias handling with extract/canonicalize (fixes missed aliases when skipping canon) - build_notebook.py: strip UTF-8 BOM - run_sentiment_pipeline.py: explicit encoding="utf-8" on file read Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a reproducibility package for RCT historical validation, including scripts to generate figures and tables from SQLite databases. It also refactors the drug alias resolution logic in the sentiment pipeline to support both LLM-based lookups and manual alias files. I have identified critical syntax errors in the build_notebook.py file where double curly braces were incorrectly used for dictionary and set definitions, which will cause runtime TypeErrors.
| conn = sqlite3.connect(DB_PATH) | ||
|
|
||
| # ── Sentiment mapping ── | ||
| SENTIMENT_SCORE = {{"positive": 1.0, "mixed": 0.5, "neutral": 0.0, "negative": -1.0}} |
There was a problem hiding this comment.
The use of double curly braces {{...}} is incorrect for defining a dictionary. This syntax attempts to create a set containing a dictionary, which will raise a TypeError because dictionaries are not hashable. To define a dictionary, please use single curly braces {...}.
| SENTIMENT_SCORE = {{"positive": 1.0, "mixed": 0.5, "neutral": 0.0, "negative": -1.0}} | |
| SENTIMENT_SCORE = {"positive": 1.0, "mixed": 0.5, "neutral": 0.0, "negative": -1.0} |
There was a problem hiding this comment.
I disagree. Double braces are how you produce literal { and } in format strings. Doing this causes an error.
|
Concern about the dedupelication logic: The Deduplication Code SENT_RANK = {"positive": 3, "mixed": 2, "neutral": 1, "negative": 0}
SIG_RANK = {"strong": 2, "moderate": 1, "weak": 0, ...}
for uid, drug, sent, sig, _d in rows:
k = (SENT_RANK.get(sent, -1), SIG_RANK.get(sig, 0))
key = (uid, drug)
if key not in best or k > best[key][0]:
best[key] = (k, sent, drug) Each (user, drug) pair keeps exactly one record — the one with the highest rank tuple (sentiment_rank, signal_strength_rank). The Problem The ranking is a peak-performance rule, not a summary. A user's final classification is determined by their single best report about a drug, ever. Consider a user who posts the folloing naltrexone: This would be counted as a positive outcome. It's asymmetric in only one direction: a single positive report can override any number of negatives, but no number of negatives can override a single positive. |
|
To make it easier to verify, can you modify the SQL that collects the posts for analysis to include WHERE post_date < study_date |
Re-runs Figure 1 / Table 2 / Table 3 with three substantive changes: 1. Data sources: adds master_gap_2020-07_to_2022-12.db, which classifies all six target drugs across the full corpus from r/covidlonghaulers inception (2020-07-24) through end of 2022. Built by running the pipeline once per drug with --drug, leveraging the substring-prefilter for cost efficiency. Existing per-drug DBs are still used; cross-DB rows are deduped on post_id with master_gap taking priority. 2. Cutoff dates: corrected to "day before earliest public release" using verified preprint and journal online dates. Glynne now uses 2021-06-06 (medRxiv preprint) instead of 2021-10-01 (journal). Prednisone comparator updated to Utrero-Rico 2021. Naltrexone, paxlovid, colchicine cutoffs verified against publication metadata. 3. Per-(user, drug) dedup rule: replaces the previous "best-report" peak-rank rule (positive > mixed > neutral > negative; signal as tiebreaker) with a "most recent + signal-strength tiebreaker" rule. The new rule is symmetric on sentiment direction and addresses the asymmetry concern raised in code review (single positive could override any number of later non-positives). Headline numbers are tighter and more defensible: - Trial-positive drugs (loratadine, famotidine, naltrexone) all clear 50% with p < 0.0001. - Trial-null drugs (colchicine, paxlovid, prednisone) all hover near 50% with no per-drug significance against the null. - Prednisone, the most-debated case in the original analysis, settles at 49.3% with the larger sample, exactly at chance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Table 3 previously combined the comparator paper, journal/preprint source, and date into one parenthesized string per row. Splitting them into two columns makes the per-drug comparator easier to scan and the publication date explicit. DRUG_CUTOFFS now stores three fields per drug: (cutoff_date, paper_short, source_date) Figure 1 unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the build script pulled classifications from five databases
(master_gap plus four older per-drug runs) and deduped on post_id at
analysis time. About 3% of relevant classifications lived only in the
older DBs and were missing from master_gap, so cross-DB merging was
required to reproduce the headline numbers.
To simplify reproduction, scripts/build_combined_db.py was added to copy
the missing rows into the master_gap DB and write out a new self-sufficient
SQLite file: historical_validation_2020-07_to_2022-12.db (314 MB). It
has been uploaded to S3 at:
s3://patientpunk/scientific_validation/rct_historical/processed/
historical_validation_2020-07_to_2022-12.db
(replacing the older master_gap_2020-07_to_2022-12.db).
Reviewers reproducing the paper now need to download a single 314 MB DB
instead of five DBs totalling ~580 MB. The build script reads from this
single DB; numbers are bit-for-bit identical to the prior multi-DB version.
The original per-drug DBs are still published on S3 as optional source
data so their contents can be independently verified against the combined
DB if desired.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A few spots in the README still talked about cross-DB merging or pointed at the old master_gap_*.json filename. Cleaned those up: - Reproduce section now includes the curl step to download the single required DB. - Per-drug CSV section no longer references the deleted scripts/merge_and_analyze_historical.py. - Databases table reframes the older per-drug DBs as "Optional, transparency only" with a clear "Required" marker on the combined DB. - Why-multiple-DBs paragraph rewritten to reflect the actual provenance (master pipeline run + small backfill from earlier per-drug runs). - Renamed master_gap_2020-07_to_2022-12.json to historical_validation_2020-07_to_2022-12.json on S3 for naming consistency with the DB; README link updated. - Step 1 now reads "Pull all reports per drug" from a single DB rather than "Merge all databases per drug". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Incorporates the two charts requested in PR #38 (Eli's working-files PR, explicitly marked Do-Not-Merge). The PR description asked for two specific items to be merged into our notebook; everything else in that PR is V&V audit work and unrelated artifacts. Figure 0 (NEW) — Full sentiment breakdown per drug: Stacked horizontal bar showing the four-way sentiment split (positive / mixed / neutral / negative) per drug, with each segment labeled by percentage and raw count. Y-axis labels include drug name + trial-direction tag. Plotted from our resp_df, so it inherits the combined-DB single-source workflow and the most-recent-plus-strength dedup rule. Figure 1 — cosmetic upgrades only: - figsize 12x5.5 -> 13.5x6.8 (more breathing room, prints larger) - value-label fontsize 9 -> 11 (fontweight bold preserved) - y-tick labels 10 -> 12; xlabel 11 -> 12; title 12 -> 14 - trial-direction tags switched from monospace to sans-serif at fontsize 11 - title prefixed with "Figure 1 — " for symmetry with Figure 0 Numbers in Figure 1 are unchanged. The data source (combined DB) and dedup rule are unchanged. Only styling and the new Figure 0 are added. Other items from PR #38 deliberately not merged: - Eli's _gen_analysis.py / analysis.ipynb use the old multi-DB merge and old peak-rank dedup; adopting them would undo the methodology fixes landed in 2208e2d, ad7a74c, 6424e03, d1d7e1a. - The V&V audit in vv/*.md is substantive review material but not code to merge; several of its critiques are already addressed by recent work (V2 sampling-window cutoffs, V8 cross-DB dedup, V9 hard-coded Table 2). Open items (V1 build manifest, V3 cycle guard, V7 dedup sensitivity, V10 expected-output assertions) are tracked separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Re: dedup : as we discussed, changed it so that it picks most recent highest signal sentiment. So if there is a strong postive and then a strong negitive, will show up as a negative. If there is a strong positive and then a moderate negative, will show up as positive. I re-ran (and repopulated S3) with a larger dataset, from the beginning of the dataset to a day before publication of the relevant paper, or to two years of the dataset, whichever is less. Effect sizes and significance both increase. I need to confirm reproducability on this dataset and make sure documentation matches. I made the requested merge from #38, but there are some methodological critiques I need to go through: it's like third on the list of things I need to do. Will look into the SQL change requested above, seems pretty straightforward. Closing and will reopen once those check marks are ticked off. |
…master_gap The build_combined_db.py script (introduced in 6424e03) backfilled rows from the older per-drug DBs into the master_gap DB. That backfill was buggy: it copied treatment_reports.user_id from the source DBs while leaving the posts table's user_id as master_gap's. Because the older DBs and master_gap use different username-hashing conventions, every one of the 262 backfilled rows ended up with a treatment_reports.user_id that did not match the posts.user_id for the same post. That breaks dedup, which keys on user_id. Verified: in the contaminated DB, all 262 backfilled rows had mismatched user_ids (3.30% of 7,949 total). Fix: replace the contaminated DB with a clean copy of master_gap (the direct output of the master pipeline run, with no backfill). This loses the small amount of classification data that lived only in older DBs (sample sizes shrink ~6-12% for prednisone, naltrexone, colchicine), but the headline finding is unchanged: every trial-positive drug is significantly above the 50% null at p ≤ 0.0001, and no trial-null drug reaches significance. Updated values in Table 3 / Figure 1: loratadine n=91 81.3% p<0.0001 trial: positive famotidine n=233 77.3% p<0.0001 trial: positive naltrexone n=155 65.8% p=0.0001 trial: positive colchicine n=92 54.3% p=0.47 trial: null paxlovid n=196 54.1% p=0.28 trial: null prednisone n=344 48.8% p=0.71 trial: null Changes in this commit: - Local + S3 historical_validation_2020-07_to_2022-12.db replaced with the clean master_gap snapshot (313 MB; 7,687 reports; 0 user_id mismatches). - merged/*.csv on S3 regenerated from clean DB. - README updated: the Provenance section now states the DB is the direct output of one master pipeline run with no backfill, and explains why cross-DB merging is unsafe given different username-hashing conventions. Expected-Output table updated with the new (clean) values. - Outputs (figure0.png, figure1.png, paper_figures.{html,ipynb}) rebuilt from clean DB. - scripts/build_combined_db.py and scripts/merge_and_analyze_historical.py removed; superseded by scripts/dump_per_drug_csvs.py which reads only the single combined DB. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…older Three follow-ups from a reviewer who downloaded the (then-contaminated) DB before the c5c4957 fix landed: 1. Publish SHA-256 of the required database in the README and provide verification commands for macOS/Linux/Windows. Hash: FA67BE6D10FFE0B98365390A5223F6DFBC708360E830C725B74B443337AB4952 2. Add an integrity assertion at the top of the build script that fails loudly if any treatment_reports.user_id != posts.user_id. The build now prints "DB integrity check: 0 mismatched user_ids (must be 0). PASS." on every successful run. If the bug ever re-enters, the build aborts before producing figures, so a contaminated DB cannot silently produce numbers. 3. Stop creating a 0-byte data/corpus_baseline_onemonth.db when the reviewer doesn't have that file. The build was passing "../data/corpus_baseline_onemonth.db" to build_notebook.py purely as the cosmetic placeholder DB path that the helper's setup cell connects to. sqlite3.connect() silently creates an empty file when the path doesn't exist, which left a confusing 0-byte artifact behind. Switched the path to "../data/historical_validation_2020-07_to_2022-12.db" so the helper opens a real DB. No analysis cell uses the helper's conn variable, so the change is purely cosmetic but eliminates the misleading file. The contaminated-DB issue itself was already fixed in c5c4957; the S3 file is clean and matches the SHA-256 above. The reviewer's hash (4BBF06E7...) was for the old contaminated upload, which is no longer on S3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two reviewer follow-ups, both intentional methodology fixes:
1. Deleted-user policy (was: implicit collapse; now: explicit exclusion)
The DB has 52,603 posts where the Reddit author field was [deleted] or
[removed] at scrape time, all mapped to the placeholder user_id "deleted".
122 of those posts have classified treatment reports. Per-user dedup
would collapse them all into one pseudo-user (one vote per drug for the
entire deleted population) — methodologically wrong because they come
from many distinct real users we cannot identify.
New policy: exclude every post with user_id = "deleted" from analysis.
Verified empirically that the shift is at most 0.5pp and 1 unit of n per
drug, so no headline conclusion changes:
loratadine n=91 -> 90 (81.3% -> 81.1%)
famotidine n=233 -> 232 (77.3% -> 77.2%)
naltrexone n=155 -> 154 (65.8% -> 65.6%)
colchicine n=92 -> 91 (54.3% -> 53.8%)
paxlovid n=196 -> 196 (54.1% -> 54.1%)
prednisone n=344 -> 343 (48.8% -> 48.7%)
Implementation: AND p.user_id != 'deleted' added to the SQL query in
both _build_paper_figures.py and scripts/dump_per_drug_csvs.py, so the
filter propagates to figures, tables, and the published per-drug CSVs
uniformly. README adds a "Step 1a: Deleted-user exclusion" section
documenting the policy and the empirical shift.
2. Colchicine first-public date correction
Reviewer flagged that JAMA's article page lists Bassi et al. 2025 as
"Published Online: October 20, 2025"; the December 2025 date in the
README was the print issue date, not the online date. Fixed:
cutoff 2025-11-30 -> 2025-10-19
source JAMA Intern Med 2025-12-01 -> JAMA Intern Med 2025-10-20
Does not affect numbers because colchicine is bound by the end-2022
analysis cap, but "first public date" is now accurate.
Re-uploaded merged/*.csv to S3 with the new (post-exclusion) values. The
combined database itself is unchanged — the exclusion happens at query
time, not in the data — so its SHA-256 is unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two cosmetic-but-real consistency fixes flagged in an internal-consistency audit:
1. The naltrexone comparator label used a Unicode prime character (U+2032,
"O′Kelly") in the build script's DRUG_CUTOFFS dict. The README used a
regular apostrophe ("O'Kelly"). Fixed by quoting the paper field in
double quotes inside the r-string raw cell, which allows a regular
apostrophe inside the value:
'O′Kelly et al. 2022' -> "O'Kelly et al. 2022"
Now both files use the same character; rebuilt notebook output reflects
the corrected label.
2. README never mentioned scripts/dump_per_drug_csvs.py, the script that
regenerates the published per-drug CSVs from the combined database.
Added a one-sentence pointer to it in the "Per-drug merged + deduped
CSVs" subsection so reviewers know how to reproduce that step locally.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nces Two related cleanups, no analytical change. (1) Switch SQL from `post_date <= eod-of-day-before-publication` to `post_date < midnight-utc-of-publication-date`. Same rows in/out (verified bit-for-bit by re-running dump_per_drug_csvs.py — same n, same %pos, same CIs, same p-values), but the predicate now matches the natural-language "strictly before publication" much more transparently. Renames `cutoff` -> `pub_date` and `effective_cutoff` -> `window_end_exclusive` throughout. Adds `epoch_midnight()` (replacing `epoch_eod()`) and changes `END_2022 = '2022-12-31'` to `END_2022_EXCLUSIVE = '2023-01-01'`. summary.csv column rename re-uploaded to S3. (2) Drop README references to the 5 optional smaller .db files. Verified nothing in the build flow reads them — both _build_paper_figures.py and scripts/dump_per_drug_csvs.py use only the combined DB. The smaller DBs remain on S3 (cheap to keep) but are no longer documented as part of the reproducibility package, removing ~40 lines of stale "optional / for transparency" tables and trimming the Provenance section. Also updates build_notebook.py docstring example to use the canonical DB name. Verified: dump_per_drug_csvs.py and _build_paper_figures.py both produce identical headline numbers (famotidine 232/77.2%, loratadine 90/81.1%, naltrexone 154/65.6%, paxlovid 196/54.1%, colchicine 91/53.8%, prednisone 343/48.7%). DB integrity check still passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes addressing review feedback:
1. README post-count reconciliation. Reviewer noted the README's
"47,434 top-level posts + 684,092 comments" line was inconsistent
with verify.py / THREAD_AUDIT.md / the DB itself reporting
47,442 submissions + 684,084 comments. Both numbers are correct
but describe different stages: 47,434/684,092 is the JSON-level
reality (actual posts vs comments), 47,442/684,084 is the DB-level
reality after the import script's dangling-parent cleanup nulls
parent_ids for 8 comments whose parents fell outside the scrape
window. Same total. README now explains the 8-row delta inline
so the apparent inconsistency is documented as a known
scrape-window edge effect.
2. Drop internal audit codes from public-facing files. The "V1/V2/V3/
V7/V10" prefixes in section headers, check names, and prose came
from an internal audit framework that shouldn't be referenced in
the public reproducibility package. Renamed:
- V3_THREAD_AUDIT.md -> THREAD_AUDIT.md
- V7_DEDUP_AUDIT.md -> DEDUP_AUDIT.md
- scripts/v3_thread_audit.py -> scripts/thread_audit.py
- scripts/v7_dedup_sample_audit.py -> scripts/dedup_sample_audit.py
Plus stripped V-codes from verify.py check names ("V1: DB integrity"
-> "DB integrity"), build-script cell headings ("## Window
verification (V2)" -> "## Window verification"), build-script
stdout ("V2 window verification: ..." -> "Window verification: ..."),
DRUG_ALIASES.md prose, and a few other call sites. Section
structure unchanged; only the audit-framework labels are gone.
Verified: build runs cleanly, all assertions still PASS, headline
numbers unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both small clarifications, addressing reviewer comments that pointed at
genuine ambiguity in the docs (not at any methodological issue):
1. README "Raw Reddit JSON" section — clarify scope of import_posts.py.
The previous wording ("rebuild the SQLite DB from this JSON file
using src/import_posts.py") implied that one script rebuilds the
full analysis DB. It does not: import_posts.py rebuilds only the
`users` and `posts` tables. The `treatment_reports` table (LLM
classifications) is the frozen output of running
src/run_sentiment_pipeline.py per target drug — Anthropic API
credentials and ~2 hr wall time required. README now states this
explicitly.
2. DRUG_ALIASES.md "Epistemic status" section — make the LLM-aliases
methodology choice explicit. Previously the file noted that aliases
are LLM-generated and not manually adjudicated, but didn't address
the obvious reviewer question: is this a workaround, an
unintentional limitation, or a deliberate choice? Added a short
paragraph stating it's deliberate — evaluating the LLM
canonicalization step is part of the method being demonstrated, so
we treat alias generation as a pipeline output rather than a hand-
authored input, and publish the generated aliases for auditability.
Updated in scripts/dump_drug_aliases.py and regenerated DRUG_ALIASES.md.
No code logic changed; no numbers shifted.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/gemini review |
|
Warning Gemini encountered an error creating the review. You can try again by commenting |
Update: substantial methodology and transparency passPushed a substantial round of changes in response to PR review feedback (Eli, Qodo, Gemini, plus a more recent doc review). Headline conclusion is unchanged — every trial-positive drug still cleanly above 50% with strong CIs, every null-trial drug still hugging 50% — but the path to those numbers is now documented, asserted, and reviewable end-to-end. TL;DR
Methodology changes
Bug found and fixed during the audit pass
New transparency / reviewer artifactsAll under
Plus standalone reproducible scripts ( Build now has hard gates that fire on every notebook execution |
|
Great results! |
| @@ -0,0 +1,128 @@ | |||
| { | |||
There was a problem hiding this comment.
This is a great solution to documenting what was run and how !
| comments" downloads, save the resulting `.jsonl` files locally. | ||
| 3. Run `scripts/convert_jsonl_to_source.py` to filter to the analysis | ||
| window and produce `historical_validation_2020-07_to_2022-12.json`. | ||
| 4. Verify SHA-256 matches `298d5bc7...abfe6a`. |
| guaranteed against the snapshot Arctic Shift had on 2026-05-02. Earlier | ||
| or later Arctic Shift snapshots may differ. | ||
|
|
||
| ## Why the date range stops at 2022-12-31 |
| the analysis SQLite DB. The lists are what every pipeline run | ||
| substring-matched posts and comments against during the extraction | ||
| step and what every canonicalization step normalized to. | ||
| Reviewers can audit these lists directly without running anything. |
| - `h2 blocker` *(multi-word)* | ||
| - `h2 antagonist` *(multi-word)* | ||
| - `acid reducer` *(multi-word)* | ||
| - `heartburn relief` *(multi-word)* |
There was a problem hiding this comment.
not sure if heartburn relief and acid reducer are correct.
| - `50mg naltrexone` *(multi-word)* | ||
| - `oral naltrexone` *(multi-word)* | ||
| - `injectable naltrexone` *(multi-word)* | ||
| - `extended release naltrexone` *(multi-word)* | ||
| - `naltrexone er` *(multi-word)* | ||
| - `naltrexone xr` *(multi-word)* |
There was a problem hiding this comment.
these are definitely not the same thing as LDN
- 50mg naltrexone
- injectable naltrexone
- extended release naltrexone
- naltrexone er
- naltrexone xr
- vivitrol
- revia
- depade
- the brand name naltrexones are all at 50mg
- The 50mg dose is only for opioid and alcohol use disorder
- While 50mg might be effective for long covid, it's essentially a different treatment
- The Injectable and XR formulations are definitely not the same thing
LDN works because it has different effects at a lower dosage.
| - `famotidins` *(misspelling)* | ||
| - `h2 blocker` *(multi-word)* | ||
| - `h2 antagonist` *(multi-word)* | ||
| - `acid reducer` *(multi-word)* |
There was a problem hiding this comment.
I'm not sure if I would use acid reducer and heartburn relief as aliases to pepcid.
| - **loratadine** includes `loratab` — Lortab is a brand of | ||
| hydrocodone/acetaminophen, a different (opioid) drug; this is | ||
| likely an LLM error and should be reviewed. |
| - `predni` | ||
| - `prednisona` *(misspelling)* | ||
| - `corticosteroid` | ||
| - `steroid` |
There was a problem hiding this comment.
i would probably remove steroid as an alias to prednisone lol
| To turn any of these into actual corrections, the path is: edit the | ||
| alias list, re-run canonicalization, re-run classification, regenerate | ||
| the analysis DB and figures. None of this changes the headline | ||
| conclusion (every drug's responder rate stays in its current bucket | ||
| when individual ambiguous aliases are removed) but it is the open | ||
| methodology task of manually reviewing every alias before publication. |
There was a problem hiding this comment.
fixing any canonicalization errors would strengthen our results
|
|
||
| ## Reproducibility | ||
|
|
||
| Re-running `scripts/v7_dedup_sample_audit.py --db <db> --out <out> --seed 42 --per-drug 6` against the same DB produces an identical file. Total sampled (user, drug) pairs: **36**. No newline at end of file |
|
Issue 1: the dedup audit reproducibility command is not runnable as written, and the “identical file” claim is too strong.
Two problems:
Suggested fix: update the sentence to something like:
|
|
Issue 2: the dedup rule wording says signal strength breaks ties on the “same date,” but the implementation uses the full In
with This shows up in Suggested fix: change the wording in both the script-generated intro and the checked-in audit from “ties on the same date” to “ties on the same timestamp” or “ties on the same |
|
Minor Issue 3: the DB path recorded in
But the README now instructs reviewers to place the DB in the
The current build/verification flow also uses that package-local path. This stale metadata will confuse reviewers trying to rerun the audit. Suggested fix: regenerate |
|
Here is the same writeup from claude code with nicer fromatting, if this helps. Code reviewFound 3 issues (re-checking the 3 items from the previous changes-requested review):
PatientPunk/scripts/dedup_sample_audit.py Lines 173 to 179 in 384e0d0
PatientPunk/scripts/dedup_sample_audit.py Lines 7 to 9 in 384e0d0 PatientPunk/docs/RCT_historical_validation/DEDUP_AUDIT.md Lines 112 to 119 in 384e0d0
PatientPunk/docs/RCT_historical_validation/DEDUP_AUDIT.md Lines 3 to 5 in 384e0d0 Issues 4–6 from earlier comments are resolved: 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
Here are some of Sonnet's comments. I don't know that any of these are meaningful, but leave it up to you.Code review (OPTIONAL deep pass — reproducibility, data quality, robustness)Found 5 new issues:
PatientPunk/docs/RCT_historical_validation/output/provenance.json Lines 18 to 36 in 384e0d0
PatientPunk/scripts/dump_per_drug_csvs.py Lines 39 to 44 in 384e0d0
PatientPunk/docs/RCT_historical_validation/README.md Lines 379 to 382 in 384e0d0 PatientPunk/docs/RCT_historical_validation/verify.py Lines 63 to 68 in 384e0d0
PatientPunk/docs/RCT_historical_validation/output/provenance.json Lines 29 to 36 in 384e0d0 Minor (not blocking): PatientPunk/docs/RCT_historical_validation/_build_paper_figures.py Lines 16 to 20 in 384e0d0 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
PR #37 reviewer findings (Eli + Sonnet): - Eli #1: dedup_sample_audit.py output template fixed — script name was 'v7_dedup_sample_audit.py' (file doesn't exist), now correctly references 'scripts/dedup_sample_audit.py'; "produces an identical file" claim softened to acknowledge the regenerated 'Generated' timestamp at the top of the file. - Eli #2 (Option A — wording, not implementation): "same date" reconciled with the full-timestamp implementation across _build_paper_figures.py (4 spots), README, dedup_sample_audit.py, and the regenerated DEDUP_AUDIT.md. The rule in practice reduces to "most recent post wins (full UTC timestamp)" because exact- timestamp ties are rare; signal_strength acts as a near-vestigial tiebreaker on those rare ties. Implementation unchanged. - Eli #3: stale Windows-backslash DB path in DEDUP_AUDIT.md ('data\historical_validation\...') replaced by the package-local posix path on regeneration. - Sonnet #4 + #5: new README "Notes on output/provenance.json" subsection documents the dirty=true chicken-and-egg constraint and the extraction commit (16c2569) vs HEAD (this PR) diff — reformalized SQL predicate, deleted-user exclusion, colchicine date correction, audit/verify scripts, parent_id prefix fix; explicitly notes no alias-resolution or sentiment-classification logic was changed in this range. - Sonnet #6: dump_per_drug_csvs.py default DB path mismatch with README — fixed by routing through paths.py (see refactor below). - Sonnet #7: README "every comparison reaches p ≤ 0.0001" prose corrected to "p < 0.001" (naltrexone is 1.358e-4 = 0.0001358). - Sonnet #8: provenance writer now normalizes Windows backslashes in extraction_runs.config; new README subsection explains what the 'master_gap' working directory was. Path resolution refactor (paths.py): - New paths.py module with marker-based anchor walk-up + RCT_DB_PATH env var override. Single source of truth. - All four entry points (_build_paper_figures.py, verify.py, scripts/dump_per_drug_csvs.py, the executed notebook) use the same resolver. The notebook embeds an inlined copy of the resolver in its first setup cell so reviewers can read exactly how the DB is found without chasing imports. - build_notebook.py grew an optional db_path_block kwarg so callers can inject richer DB resolution (used by _build_paper_figures.py); default behavior unchanged for any other notebook builders. - provenance.json db.path is now package-relative ("data/historical_validation_2020-07_to_2022-12.db") — no more leaky absolute Windows-user paths. - verify.py header surfaces resolution mode ("package anchor (default)" vs "RCT_DB_PATH=..." vs "--db CLI flag") so a stale env var failure mode is obvious at a glance. Bonus parallel-bug catches in adjacent audit scripts (same false "produces an identical file" claim and Windows-backslash DB paths): - dump_drug_aliases.py and thread_audit.py corrected; DRUG_ALIASES.md and THREAD_AUDIT.md regenerated with posix DB paths and accurate reproducibility wording. Figure 1 redesign: - Replaced the paired horizontal bar chart with a two-marker forest plot per drug (responder green circle + non-responder red square), trial outcome tagged in the row label, paper citation on line two of the y-tick label. Each marker has a matching-color 95% Wilson CI line; numeric values printed at the right end. Verification: - All headline numbers unchanged: famotidine 232, loratadine 90, naltrexone 154, paxlovid 196, prednisone 343, colchicine 91. - DB SHA-256 unchanged: c50fcacd7ce366f397152f5fe4dbb59d5eaf64ba32627faef91dad86fbf6c6f4 - verify.py: ALL 5 CHECKS PASSED. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rl, soften alias claim
Three findings from a fresh-checkout reproduction pass:
P2 — provenance could record the wrong git commit
_git_metadata() ran `git rev-parse HEAD` and `git status` without
anchoring to the package root. If a reviewer extracted the package
into an unrelated dirty repo checkout (Codex's exact scenario),
output/provenance.json silently recorded that *other* repo's commit
rather than the package's commit — silent misdirection that's worse
than recording 'unknown'.
Fix: run git with cwd=Path(__file__).resolve().parent, then verify
_build_paper_figures.py is actually tracked in that worktree via
`git ls-files --error-unmatch`. If git can't resolve HEAD or the
script isn't tracked, record 'outside-git' (visibly broken) instead
of an unrelated commit (silently misleading). 'unknown' is now
reserved for the case where git itself isn't installed.
Verified across three scenarios:
- inside the package's own checkout → reports actual SHA
- extracted to a plain directory → 'outside-git' + warning
- extracted into an unrelated repo → 'outside-git' + warning
(was: returned unrelated SHA)
P3 — README curl command silent on HTTP errors
`curl -L -o data/...` exits 0 on 4xx, writes the error response body
to the destination filename, and the next step crashes with a
confusing `sqlite3.DatabaseError: file is not a database`. Changed
to `curl -fL -o ...` with a comment explaining why, and added a
`python verify.py` step right after the download so HTTP/transfer
errors fail at the download step rather than mid-build.
P3 — alias sensitivity claim was unsupported
DRUG_ALIASES.md previously said "None of this changes the headline
conclusion" about alias edits — a quantitative sensitivity claim
with no per-alias ablation table to back it. Softened the wording
in scripts/dump_drug_aliases.py: explicitly state we have NOT run a
quantitative sensitivity analysis, that this file documents
inspectability rather than robustness, and that producing a
per-drug ablation table is the open methodology task before
publication. DRUG_ALIASES.md regenerated.
Also fixed (parallel hygiene):
- dump_drug_aliases.py reproducibility command rendered Windows
backslashes for args.db / args.out; now uses .as_posix().
Verification:
- provenance.json `git.commit` correctly = 665e910… (HEAD before
this commit), not the parent OneDrive repo's commit
- All headline numbers unchanged (verify.py: ALL 5 CHECKS PASSED)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ce lag P3 — README DB table count was wrong README said "four tables" (users, posts, treatment, treatment_reports) but the DB actually has seven: those four plus extraction_runs (used by provenance), plus conditions and user_profiles (both empty, reserved for future extensions). Reviewers inspecting the DB directly would have been surprised. Replaced the prose count with a full inventory table showing every table, its row count at the published SHA, and its role. P3 — committed provenance.json lagged HEAD The committed provenance.json recorded git.commit = 665e910 (the prior PR commit) while HEAD was 0e1f187 — an artifact of the chicken-and-egg constraint: the build runs at commit N, records git.commit = N, and gets committed as commit N+1. The committed artifact always points at its own parent commit. Two changes: - Regenerated outputs at current HEAD so the lag drops from 2 commits to exactly 1 (parent). provenance.json now records 0e1f187 (the immediate parent of this fix commit). - Added an explicit README subsection under Notes on output/provenance.json explaining the 1-commit lag, naming the reviewer-regenerated file as authoritative, and documenting git.commit = "outside-git" semantics introduced in the previous commit. Verification: - All headline numbers unchanged - verify.py: ALL 5 CHECKS PASSED at HEAD before this commit Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The original paper's data-sources table included a Total-posts column
(corpus posts in each drug's window, the denominator the substring +
LLM extraction ran against). The current package's Table 2 dropped it.
Adding it back so reviewers can see the analysis denominator alongside
the per-drug user/report counts.
Definition (matches the per-drug analysis filters):
COUNT(*) FROM posts
WHERE post_date IS NOT NULL
AND post_date < <window_end_exclusive>
AND user_id != 'deleted'
Values for the published SHA c50fcacd…:
famotidine 139,075 (window: 2020-07-24 → 2021-06-06)
loratadine 139,075 (same window)
prednisone 218,822 (2020-07-24 → 2021-10-25)
naltrexone 448,114 (2020-07-24 → 2022-07-02)
paxlovid 678,923 (2020-07-24 → 2022-12-31, end-2022 cap)
colchicine 678,923 (same as paxlovid, capped)
These are larger than the original paper's narrower-window post counts
(famotidine 88,077, naltrexone 244,036, etc.) because this PR switched
to one corpus-wide pull — already documented in Provenance.
The added footnote explicitly spells out what Total posts counts and
why it differs from "raw post-row count" (the deleted-user and
NULL-post_date exclusions match the analysis filters, so the column
reads as the actual analysis denominator rather than a rawer count
that would mismatch downstream rows).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Table 3 formatter used to bin p-values into one of three buckets:
- p < 0.0001 → "<0.0001" (lossy — discards the actual order
of magnitude; loratadine and
famotidine both rendered as the same
string despite differing by 8 orders
of magnitude)
- p < 0.001 → 4 decimal places (e.g. "0.0001")
- p ≥ 0.001 → 3 decimal places (e.g. "0.284")
New behavior:
- p < 0.001 → 3-sig-fig scientific notation (e.g. "3.56e-17",
"1.95e-9", "1.36e-4")
- p ≥ 0.001 → 3 decimal places (unchanged)
Single-digit exponents have their leading zero stripped (1.95e-9 not
1.95e-09) for readability. Both Table 3 in the executed notebook and
the README's Expected Output table now use the same formatter and
show the same values.
Verification:
- verify.py: ALL 5 CHECKS PASSED (formatter change is
presentation-only; numeric assertions unchanged)
- Rendered Table 3 p-values:
loratadine 1.95e-9
famotidine 3.56e-17
naltrexone 1.36e-4
paxlovid 0.284
colchicine 0.530
prednisone 0.666
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive reproducibility package for the RCT historical validation analysis, including documentation, audit logs, and automated build scripts. Key changes include fixing a parent_id prefix issue in the Reddit import logic, adding cycle detection to the extraction pipeline, and providing a suite of verification tools to ensure data integrity and numerical consistency. Feedback focuses on improving code maintainability by reusing existing database connections and constants, removing redundant implementations of statistical functions, and consolidating duplicated graph traversal logic into shared utilities.
| DB_DIR = Path(DB_PATH).parent | ||
| COMBINED_DB = DB_DIR / "historical_validation_2020-07_to_2022-12.db" | ||
| combined_conn = sqlite3.connect(COMBINED_DB.as_posix()) |
There was a problem hiding this comment.
The generated notebook code creates a redundant database connection combined_conn. The setup block (injected via build_notebook) already establishes a global conn object using the same database file. It is better to reuse the existing conn to avoid confusion and resource overhead in the executed notebook.
| DB_DIR = Path(DB_PATH).parent | |
| COMBINED_DB = DB_DIR / "historical_validation_2020-07_to_2022-12.db" | |
| combined_conn = sqlite3.connect(COMBINED_DB.as_posix()) | |
| # Reusing the 'conn' established in the setup block. | |
| combined_conn = conn |
| NOTEBOOK_DB_RESOLVER = '''import os | ||
| from pathlib import Path | ||
|
|
||
| _DB_FILENAME = "historical_validation_2020-07_to_2022-12.db" | ||
| _PACKAGE_MARKERS = ("_build_paper_figures.py", "verify.py") | ||
| _ENV_VAR = "RCT_DB_PATH" |
There was a problem hiding this comment.
| def wilson_ci(k, n, z=1.96): | ||
| """Wilson score confidence interval for a proportion.""" | ||
| if n == 0: | ||
| return 0.0, 0.0 | ||
| p = k / n | ||
| denom = 1 + z**2 / n | ||
| center = (p + z**2 / (2 * n)) / denom | ||
| margin = z * np.sqrt((p * (1 - p) + z**2 / (4 * n)) / n) / denom | ||
| return max(0, center - margin), min(1, center + margin) |
There was a problem hiding this comment.
The wilson_ci function is manually implemented here, but the reproducibility package already depends on statsmodels (as seen in requirements.txt and _build_paper_figures.py), which provides a more robust implementation via proportion_confint. This manual implementation is redundant and potentially less accurate than the library version.
| """V10: re-derive the same per-drug stats the build does, compare to | ||
| EXPECTED_OUTPUTS. Same logic as the build's V10 cell, but standalone.""" | ||
| from scipy.stats import binomtest | ||
| from statsmodels.stats.proportion import proportion_confint as wilson # noqa: F401 |
| def _detect_parent_cycles(id_to_parent: dict) -> None: | ||
| """Raise ValueError if id_to_parent contains any cycle. | ||
|
|
||
| The upstream-mentioned-drugs recursion below assumes id_to_parent is a | ||
| forest (each node has at most one parent and chains terminate). A cycle | ||
| in malformed imported data would cause the recursion to memoize on | ||
| (eid, remaining) and still never terminate (cache stores final values, | ||
| not in-progress visits). Detect once up front and fail loudly. | ||
|
|
||
| Iterative DFS with white/gray/black coloring; O(N). | ||
| """ | ||
| UNVISITED, VISITING, DONE = 0, 1, 2 | ||
| color = {} | ||
| for start in id_to_parent: | ||
| if color.get(start, UNVISITED) != UNVISITED: | ||
| continue | ||
| stack = [start] | ||
| path = [] | ||
| while stack: | ||
| node = stack[-1] | ||
| c = color.get(node, UNVISITED) | ||
| if c == UNVISITED: | ||
| color[node] = VISITING | ||
| path.append(node) | ||
| parent = id_to_parent.get(node) | ||
| if parent and parent in id_to_parent: | ||
| pcol = color.get(parent, UNVISITED) | ||
| if pcol == VISITING: | ||
| # back-edge into the active path -> cycle | ||
| i = path.index(parent) | ||
| raise ValueError( | ||
| "parent_id cycle detected: " | ||
| + " -> ".join(path[i:] + [parent]) | ||
| ) | ||
| if pcol == UNVISITED: | ||
| stack.append(parent) | ||
| continue | ||
| # parent missing, NULL, or already DONE -> this node is finished | ||
| color[node] = DONE | ||
| if path and path[-1] == node: | ||
| path.pop() | ||
| stack.pop() | ||
| else: | ||
| if path and path[-1] == node: | ||
| path.pop() | ||
| stack.pop() | ||
| color[node] = DONE | ||
|
|
There was a problem hiding this comment.
The cycle detection logic in _detect_parent_cycles is implemented using an iterative DFS. While correct, this logic is duplicated in docs/RCT_historical_validation/verify.py. Consider moving this to a shared utility module (e.g., src/utilities/graph.py) to improve maintainability, especially since both locations are part of the same repository.
H4 (substantive correctness, classify.py alias-resolution gap)
When the pipeline runs with --drug but without --drug-file, AND
skip_canonicalize is set (or the DB hasn't been canonicalized yet),
synonyms_for is empty. Aliased mentions like "LDN" or "low dose
naltrexone" still get classified — they pass the target_aliases
filter — but the system_prompt called for them with
synonyms_for.get(alias) = None, so the prompt lacks the canonical-
drug context. Less obvious aliases (e.g., "low dose 3mg pills")
could be misclassified by a model that doesn't know the alias
refers to naltrexone.
Fix: when target_aliases is populated (i.e., --drug mode), use the
resolved alias set (minus the matched drug itself) as the synonym
hint for system_prompt instead of synonyms_for.get(drug). Falls
back to synonyms_for.get(drug) when target_aliases is None
(whole-corpus runs).
Doesn't affect the published numbers — extraction was done with
canonicalize ON, so synonyms_for was populated at run time. Future
reviewers re-running cold benefit.
M3 (redundant DB connection in _build_paper_figures.py)
combined_conn = sqlite3.connect(...) opened a second connection to
the same DB the setup cell already opened as `conn`. Replaced with
`combined_conn = conn`.
M5 (redundant wilson_ci implementation in build_notebook.py)
Hand-rolled Wilson CI replaced with statsmodels.proportion_confint
(already a transitive dep). Same numerical result; less code to
defend.
M6 (unused import in verify.py)
Dropped the dead `from statsmodels.stats.proportion import
proportion_confint as wilson` and its `# noqa: F401` muzzle.
M2 (utf-8 encoding on read_text)
run_sentiment_pipeline.py was already fixed in earlier commits.
Two more read_text() calls without encoding remained (classify.py
prefilter cache, utilities/__init__.py drug-aliases cache); added
encoding="utf-8" to both for cross-platform robustness.
M1 (BOM in build_notebook.py)
Already stripped by an earlier rewrite. Verified no BOM remains.
M4 (DRY: notebook resolver constants vs paths.py)
The inlined notebook resolver in _build_paper_figures.py duplicated
DB_FILENAME / PACKAGE_MARKERS / RCT_DB_PATH literals already
defined in paths.py — drift risk. Now interpolates those constants
from `paths.DB_FILENAME` etc. into the resolver template at build
time. Single source of truth; verified the rebuilt notebook has
the correct values inlined.
M7 (DRY: parent-cycle detection)
src/pipeline/extract.py:_detect_parent_cycles and
docs/RCT_historical_validation/verify.py:_detect_cycles were two
copies of the same iterative DFS. Extracted the shared
cycle-finding logic into a new src/utilities/graph.py:
find_parent_cycles (returns all cycles); extract.py now wraps it
for fail-fast raise. verify.py keeps its inline copy on purpose —
the reproducibility package can't import from src/ — with a
docstring noting the duplication is deliberate and instructing
future maintainers to mirror changes.
False-positive "critical" findings dismissed (no fix)
Three Gemini findings claimed `{{...}}` in build_notebook.py's
SETUP_CODE template would create unhashable set-of-dict TypeErrors.
These are intentional Python format-string escapes (the whole
string is passed to .format(db_path_block=...)); the rendered
notebook code has correct single-brace literals. No fix needed.
Verification
- verify.py: ALL 5 CHECKS PASSED
- Full build: clean, all numbers unchanged
- Inlined notebook resolver constants confirmed pulled from paths.py:
_DB_FILENAME = "historical_validation_2020-07_to_2022-12.db"
_PACKAGE_MARKERS = ('_build_paper_figures.py', 'verify.py')
_ENV_VAR = "RCT_DB_PATH"
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request establishes a reproducibility package for the RCT historical validation analysis, providing the necessary scripts, documentation, and audit trails to regenerate paper figures and tables. Significant improvements were made to the data import pipeline, including a fix for Reddit ID prefix handling to preserve thread structure and the addition of cycle detection. The reviewer feedback highlights the need for stricter determinism in data extraction and deduplication by adding explicit ordering to SQL queries and sort logic, alongside a recommendation to specify minimum versions for dependencies like Scipy to ensure environment compatibility.
| WHERE lower(t.canonical_name) = ? | ||
| AND p.post_date IS NOT NULL | ||
| AND p.post_date < ? | ||
| AND p.user_id != 'deleted' | ||
| ''', (drug, window_end_exclusive_ts)).fetchall() |
There was a problem hiding this comment.
The SQL query lacks an ORDER BY clause. Since the deduplication logic in dedup_recent_then_strength (line 351) is order-dependent for ties (where both post_date and signal_strength are identical), the results may be non-deterministic depending on the database's internal row ordering. Adding a unique identifier like tr.post_id to the ordering ensures absolute reproducibility, which is critical for this package.
| WHERE lower(t.canonical_name) = ? | |
| AND p.post_date IS NOT NULL | |
| AND p.post_date < ? | |
| AND p.user_id != 'deleted' | |
| ''', (drug, window_end_exclusive_ts)).fetchall() | |
| WHERE lower(t.canonical_name) = ? | |
| AND p.post_date IS NOT NULL | |
| AND p.post_date < ? | |
| AND p.user_id != 'deleted' | |
| ORDER BY tr.post_id | |
| ''', (drug, window_end_exclusive_ts)).fetchall() |
| WHERE lower(t.canonical_name) = ? | ||
| AND p.post_date IS NOT NULL | ||
| AND p.post_date < ? | ||
| AND p.user_id != 'deleted' | ||
| """, | ||
| (drug, cutoff_ts), | ||
| ).fetchall() |
There was a problem hiding this comment.
The SQL query lacks an ORDER BY clause, which can lead to non-deterministic results during deduplication if multiple reports share the same timestamp and signal strength. Adding ORDER BY tr.post_id ensures consistent behavior across different environments and database snapshots.
WHERE lower(t.canonical_name) = ?
AND p.post_date IS NOT NULL
AND p.post_date < ?
AND p.user_id != 'deleted'
ORDER BY tr.post_id
"",
(drug, cutoff_ts),
).fetchall()| df = df.sort_values(["user_id", "drug", "post_date", "sig_rank"], | ||
| ascending=[True, True, False, False]) |
There was a problem hiding this comment.
To ensure absolute determinism in deduplication, post_id should be included as a final tie-breaker in the sort order. This prevents drop_duplicates from picking an arbitrary row when all other sorted columns are identical, ensuring the generated CSVs are bit-identical across runs.
| df = df.sort_values(["user_id", "drug", "post_date", "sig_rank"], | |
| ascending=[True, True, False, False]) | |
| df = df.sort_values(["user_id", "drug", "post_date", "sig_rank", "post_id"], | |
| ascending=[True, True, False, False, False]) |
| pandas | ||
| matplotlib | ||
| seaborn | ||
| scipy |
There was a problem hiding this comment.
- This file is missing package versions. That means this will fail to build in a few years as packages get updated.
- We should be using uv and pyproject.toml.
uv builds are more reproducible because uv.lock specifies the specific package versions actually installed, along with the python version desired.
Todo: take the requirements from this file and move them to pyproject.toml. run uv sync Then update the docs to remove mention of this file.
…06-06
The Glynne et al. medRxiv preprint went live on 2021-06-06 (URL
medrxiv.org/content/10.1101/2021.06.06.21258272v1). The previous
cutoff 2021-06-07 (strict-<) was off-by-one: it INCLUDED posts made
on 2021-06-06 itself, after the preprint was public. That's a small
but real contamination window — the methodology paragraph promises
"strictly before midnight UTC of the comparator paper's publication
date," and for Glynne the first-public-availability date is the
preprint, not the day after.
Every other drug in the panel uses cutoff = pub_date (strict-< then
excludes the pub date itself). Glynne is now consistent with that
convention.
Empirical effect: 3 treatment_report rows on 2021-06-06 (1 famotidine,
2 loratadine) are now excluded.
famotidine user 3cc039f2... had no other pre-preprint reports for
the drug (10 lifetime mentions, the 2021-06-06 one was the first).
Drops from analysis: n_famotidine 232 -> 231 (-1 positive user).
loratadine user 336f9e181f85... had only one lifetime loratadine
report, on 2021-06-06. Drops: n_loratadine 90 -> 89 (-1 positive user).
loratadine user 93d34b804d58... had 3 earlier pre-preprint reports
(Mar, Apr, May 2021). Stays in analysis under most-recent-rule
fallback to May 28; no effect on n.
New numbers (verified by verify.py, EXPECTED_OUTPUTS updated in both
_build_paper_figures.py and verify.py):
famotidine: n=231 pos=178 pos_pct=77.056% p=5.51e-17
(was n=232 pos=179 pos_pct=77.155% p=3.57e-17)
loratadine: n= 89 pos= 72 pos_pct=80.899% p=3.17e-9
(was n= 90 pos= 73 pos_pct=81.111% p=1.95e-9)
Headline pattern unchanged; both still extreme-low p-values, CIs
still mutually exclusive from non-responder CIs, all 6 trial-outcome
comparisons still match. Tightening makes the methodology paragraph
self-consistent rather than off-by-one.
Files updated:
- DRUG_CUTOFFS in _build_paper_figures.py, verify.py,
scripts/dedup_sample_audit.py, scripts/dump_per_drug_csvs.py
- EXPECTED_OUTPUTS in _build_paper_figures.py and verify.py
- README: pre-publication-cutoffs table, window-verification
table (MIN/MAX/reports for famotidine + loratadine refreshed
against the new cutoff), Expected Output table
- Regenerated DEDUP_AUDIT.md (now reflects fam/lor raw=692/188
instead of 693/190), DRUG_ALIASES.md
- Regenerated paper_figures.html, .ipynb, .executed.ipynb,
figure0.png, figure1.png, provenance.json
Also corrected stale citation form: "Glynne et al. 2021" -> "Glynne
et al. 2022" everywhere (the journal version is 2022; the preprint
is 2021-06-06, but we cite the journal). Aligned with Tables 2/3/4
in the paper draft.
verify.py: ALL 5 CHECKS PASSED.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t plot) -> Figure 2
The script previously labelled the per-drug sentiment-breakdown
stacked-bar chart as "Figure 0" and the responder/non-responder
forest plot as "Figure 1". Publication convention is to number from
1; the existing "Figure 0" reads as a draft artifact. Renumbered so
that:
- Figure 1 = per-drug full sentiment breakdown (stacked bar)
- Figure 2 = forest plot (responders vs non-responders, 95% Wilson CIs)
Cascade:
- All "Figure 0" -> "Figure 1" (build script titles, captions,
intro markdown describing the notebook artifacts)
- All "Figure 1" -> "Figure 2" (where Figure 1 was previously the
forest plot; this includes body-text references in markdown
cells, comments, and references in README)
- PNG filenames: figure0.png -> figure1.png (stacked bar);
figure1.png -> figure2.png (forest plot). The old figure0.png
was removed from tracking.
- README updates:
* Intro: "Reproduces Figure 1, Table 2, and Table 3" -> "Figure 1,
Figure 2, Table 2, and Table 3"
* Output-files list: lists both figure1.png and figure2.png with
what each shows
* "What's reproduced" table: split the old Figure 1 row into
Figure 1 (stacked bar) + Figure 2 (forest plot), with
Figure 2 carrying the responder-vs-non-responder narrative
* Methodology references to "Figure 1" updated to "Figure 2"
where they mean the forest plot (error bars, n column,
Wilson CI step, deleted-user-filter propagation)
- Docstring of the build script updated to describe both figures.
Verification:
- verify.py: ALL 5 CHECKS PASSED (renumber is presentation-only;
numeric assertions unchanged)
- Rebuild produces output/figure1.png (157 KB, stacked bar) and
output/figure2.png (128 KB, forest plot); old figure0.png removed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| - `loradatine` *(misspelling)* | ||
| - `loratadinе` *(misspelling)* | ||
| - `claritin redi-tabs` *(multi-word)* | ||
| - `loratab` |
There was a problem hiding this comment.
https://www.google.com/search?q=loratab
this looks like different drug
| lines.append("# Drug aliases used in extraction\n") | ||
| lines.append(f"**Generated:** {now} from `{args.db.as_posix()}`") | ||
| lines.append("**Generator:** `scripts/dump_drug_aliases.py`") | ||
| lines.append("") |
There was a problem hiding this comment.
If we add something like this, I bet Gemini will catch the errors:
lines.append("Reviewers: Check that all aliases are correct by ensuring they actually refer to the same treatment. ")
Summary
Adds a self-contained reproducibility package for the paper's core claim: that pre-publication community sentiment on r/covidlonghaulers predicted clinical trial outcomes for 6 Long COVID drugs.
The package lives at
docs/RCT_historical_validation/and reproduces Figure 1, Table 2, and Table 3 from "A Methodology for Gathering Real-World Evidence at Scale: Using NLP-Extracted Community Treatment Reports to Predict Clinical Trial Outcomes."What's inside
output/paper_figures.htmloutput/figure1.pngoutput/figure0.pngoutput/paper_figures_executed.ipynboutput/provenance.json_build_paper_figures.pyverify.pypaths.pybuild_notebook.pyDRUG_ALIASES.md/THREAD_AUDIT.md/DEDUP_AUDIT.md/ARCTIC_SHIFT_PROVENANCE.mdrequirements.txtREADME.mdWhat the analysis shows
For each of 6 drugs, we count what proportion of r/covidlonghaulers users reported a positive experience — using only posts from before the relevant trial published. Per-(user, drug) deduplication: most recent post wins (full UTC timestamp); signal strength acts as a near-vestigial tiebreaker on exact-timestamp ties. Result:
All 6 match: drugs the community clearly favored → trials found effective. Drugs the community was split on → trials found no effect.
How to use
Just viewing: Clone the repo → open
docs/RCT_historical_validation/output/paper_figures.htmlin a browser.Full re-run: Download the single ~314 MB analysis DB from S3 into
data/, then run the build:verify.pyruns every build-time assertion (DB integrity, SHA-256 match, per-drug window, thread reconstruction, expected outputs) and exits 0/1. Run it immediately after the download to catch HTTP errors at the download step rather than mid-build.The build script and notebook find the DB through the canonical resolver in
paths.py— they work from any cwd (workspace root, package directory, anywhere). Override withRCT_DB_PATH=/path/to/dbif needed.Data access
A single self-sufficient SQLite DB (
historical_validation_2020-07_to_2022-12.db, 326 MB, SHA-256c50fcacd…) is publicly hosted on S3, along with the raw Reddit JSON and per-drug merged + deduped CSVs. All download links in the README. Bucket policy grants anonymouss3:GetObjecton thescientific_validation/rct_historical/*prefix.Reviewer findings addressed in this PR
Eli's blocking review (resolved):
DEDUP_AUDIT.md— script's output template fixed; "identical file" softened to acknowledge the regeneratedGeneratedtimestamp; audit regenerated.DEDUP_AUDIT.md— regenerated against the package-local posix path.Sonnet's deep-pass findings (resolved):
4. ✅
provenance.jsondirty=true— README "Notes onoutput/provenance.json" subsection documents the chicken-and-egg constraint and what reviewers should do.5. ✅ DB built at extraction commit
16c2569b…≠ HEAD — README subsection enumerates the diff: SQL predicate reformalization, deleted-user exclusion, colchicine date correction, audit/verify scripts, parent_id prefix fix; explicitly notes no alias-resolution or sentiment-classification logic was changed.6. ✅
scripts/dump_per_drug_csvs.pydefault DB path mismatch — fixed by routing throughpaths.py.7. ✅ README "every comparison reaches p ≤ 0.0001" — corrected to "p < 0.001" (naltrexone is 1.358e-4).
8. ✅ Windows backslashes + undocumented
master_gapinprovenance.json— provenance writer now normalizes path separators; README subsection explainsmaster_gapwas the original Windows extraction working directory.Codex's reproduction-pass findings (resolved):
9. ✅ Provenance could record an unrelated repo's commit —
_git_metadata()now anchorsgitto the package root and verifies the script file is tracked there. Recordsoutside-git(visibly broken) instead of an unrelated SHA (silently misleading) when the package is extracted outside its own checkout.10. ✅
curl -Lexits 0 on HTTP errors — README now usescurl -fLwith a comment, and explicitly tells reviewers to runpython verify.pyimmediately after the download.11. ✅ Alias sensitivity claim was unsupported —
dump_drug_aliases.pytext softened. Now states explicitly that no per-alias quantitative sensitivity analysis has been run; producing one is documented as the open methodology task before publication.Bonus parallel-bug catches in adjacent audit scripts (
dump_drug_aliases.py,thread_audit.py): same false "produces an identical file" claim and Windows-backslash DB-path renderings → both fixed; both audits regenerated.Test plan
output/paper_figures.htmlin a browser — Figure 1 forest plot, Table 2, Table 3 render correctlypython verify.pyfrom any cwd — should report5/5 PASSEDand the resolution-mode line should reflect how the DB was locatedpython _build_paper_figures.py→ numbers identical to pre-builtprovenance.jsongit.commitshould recordoutside-gitrather than an unrelated repo's SHA🤖 Generated with Claude Code