Skip to content

Add RCT historical validation reproducibility package#37

Open
Airwhale wants to merge 37 commits into
mainfrom
shaun/rct-historical-validation
Open

Add RCT historical validation reproducibility package#37
Airwhale wants to merge 37 commits into
mainfrom
shaun/rct-historical-validation

Conversation

@Airwhale
Copy link
Copy Markdown
Collaborator

@Airwhale Airwhale commented Apr 30, 2026

Summary

Adds a self-contained reproducibility package for the paper's core claim: that pre-publication community sentiment on r/covidlonghaulers predicted clinical trial outcomes for 6 Long COVID drugs.

The package lives at docs/RCT_historical_validation/ and reproduces Figure 1, Table 2, and Table 3 from "A Methodology for Gathering Real-World Evidence at Scale: Using NLP-Extracted Community Treatment Reports to Predict Clinical Trial Outcomes."

What's inside

File Purpose
output/paper_figures.html Open in any browser to see all figures — no setup required
output/figure1.png Figure 1 standalone (forest plot: responders + non-responders, 95% Wilson CIs, 6 drugs)
output/figure0.png Figure 0 standalone (full sentiment breakdown)
output/paper_figures_executed.ipynb Executed Jupyter notebook with all outputs
output/provenance.json Build manifest (git commit, DB SHA-256, model env, full extraction-runs table)
_build_paper_figures.py Main script — reads the analysis SQLite DB, runs the analysis, generates notebook + HTML + PNG
verify.py One-command reproducibility gate (PASS/FAIL summary, exits 0/1)
paths.py Canonical DB-path resolver (env-var override + marker-based anchor walk-up)
build_notebook.py Helper for notebook creation/execution/export
DRUG_ALIASES.md / THREAD_AUDIT.md / DEDUP_AUDIT.md / ARCTIC_SHIFT_PROVENANCE.md Reviewer-facing audit artifacts
requirements.txt 10 Python packages
README.md Full methodology walkthrough, download links, expected output for spot-checking

What the analysis shows

For each of 6 drugs, we count what proportion of r/covidlonghaulers users reported a positive experience — using only posts from before the relevant trial published. Per-(user, drug) deduplication: most recent post wins (full UTC timestamp); signal strength acts as a near-vestigial tiebreaker on exact-timestamp ties. Result:

Drug n % Positive p-value Trial Result
loratadine 90 81.1% < 1e-9 ✅ positive
famotidine 232 77.2% < 1e-16 ✅ positive
naltrexone 154 65.6% 1.36e-4 ✅ positive
colchicine 91 53.8% 0.530 ⬜ null
paxlovid 196 54.1% 0.284 ⬜ null
prednisone 343 48.7% 0.666 ⬜ null

All 6 match: drugs the community clearly favored → trials found effective. Drugs the community was split on → trials found no effect.

How to use

Just viewing: Clone the repo → open docs/RCT_historical_validation/output/paper_figures.html in a browser.

Full re-run: Download the single ~314 MB analysis DB from S3 into data/, then run the build:

cd docs/RCT_historical_validation/
pip install -r requirements.txt

mkdir -p data
curl -fL -o data/historical_validation_2020-07_to_2022-12.db \
    https://patientpunk.s3.amazonaws.com/scientific_validation/rct_historical/processed/historical_validation_2020-07_to_2022-12.db

python verify.py            # one-shot PASS/FAIL gate (~10s)
python _build_paper_figures.py

verify.py runs every build-time assertion (DB integrity, SHA-256 match, per-drug window, thread reconstruction, expected outputs) and exits 0/1. Run it immediately after the download to catch HTTP errors at the download step rather than mid-build.

The build script and notebook find the DB through the canonical resolver in paths.py — they work from any cwd (workspace root, package directory, anywhere). Override with RCT_DB_PATH=/path/to/db if needed.

Data access

A single self-sufficient SQLite DB (historical_validation_2020-07_to_2022-12.db, 326 MB, SHA-256 c50fcacd…) is publicly hosted on S3, along with the raw Reddit JSON and per-drug merged + deduped CSVs. All download links in the README. Bucket policy grants anonymous s3:GetObject on the scientific_validation/rct_historical/* prefix.

Reviewer findings addressed in this PR

Eli's blocking review (resolved):

  1. Wrong script name + false "identical file" claim in DEDUP_AUDIT.md — script's output template fixed; "identical file" softened to acknowledge the regenerated Generated timestamp; audit regenerated.
  2. "Same date" wording vs full-timestamp implementation — wording reconciled across all four files (script docstring, README, generated audit body, build script docstring) to "full UTC timestamp" / "exact-timestamp ties." Implementation unchanged; clarified that the signal-strength tiebreaker is near-vestigial in practice.
  3. Stale Windows-backslash DB path in DEDUP_AUDIT.md — regenerated against the package-local posix path.

Sonnet's deep-pass findings (resolved):
4. ✅ provenance.json dirty=true — README "Notes on output/provenance.json" subsection documents the chicken-and-egg constraint and what reviewers should do.
5. ✅ DB built at extraction commit 16c2569b… ≠ HEAD — README subsection enumerates the diff: SQL predicate reformalization, deleted-user exclusion, colchicine date correction, audit/verify scripts, parent_id prefix fix; explicitly notes no alias-resolution or sentiment-classification logic was changed.
6. ✅ scripts/dump_per_drug_csvs.py default DB path mismatch — fixed by routing through paths.py.
7. ✅ README "every comparison reaches p ≤ 0.0001" — corrected to "p < 0.001" (naltrexone is 1.358e-4).
8. ✅ Windows backslashes + undocumented master_gap in provenance.json — provenance writer now normalizes path separators; README subsection explains master_gap was the original Windows extraction working directory.

Codex's reproduction-pass findings (resolved):
9. ✅ Provenance could record an unrelated repo's commit_git_metadata() now anchors git to the package root and verifies the script file is tracked there. Records outside-git (visibly broken) instead of an unrelated SHA (silently misleading) when the package is extracted outside its own checkout.
10. ✅ curl -L exits 0 on HTTP errors — README now uses curl -fL with a comment, and explicitly tells reviewers to run python verify.py immediately after the download.
11. ✅ Alias sensitivity claim was unsupporteddump_drug_aliases.py text softened. Now states explicitly that no per-alias quantitative sensitivity analysis has been run; producing one is documented as the open methodology task before publication.

Bonus parallel-bug catches in adjacent audit scripts (dump_drug_aliases.py, thread_audit.py): same false "produces an identical file" claim and Windows-backslash DB-path renderings → both fixed; both audits regenerated.

Test plan

  • Open output/paper_figures.html in a browser — Figure 1 forest plot, Table 2, Table 3 render correctly
  • Spot-check Table 3 values against the expected output table in README (famotidine 232, loratadine 90, naltrexone 154, paxlovid 196, prednisone 343, colchicine 91)
  • Download the DB via the README curl command — should succeed (200 OK) and the file should be ~326 MB
  • python verify.py from any cwd — should report 5/5 PASSED and the resolution-mode line should reflect how the DB was located
  • (Optional) Full re-run: python _build_paper_figures.py → numbers identical to pre-built
  • (Optional) Test outside-git handling: copy the package to a scratch directory, run the build → provenance.json git.commit should record outside-git rather than an unrelated repo's SHA

🤖 Generated with Claude Code

polya-b and others added 5 commits April 30, 2026 11:07
Self-contained folder for reproducing Figure 1, Table 2, and Table 3
from the paper. Includes build script, notebook helper, README with
plain-language methodology, and requirements. Databases and outputs
are gitignored — share via OneDrive.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Force-add HTML, notebooks, and figure so reviewers can view
results without the databases. README now explains how to
view pre-built outputs vs re-run from scratch, and directs
users to contact authors for the underlying data.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@qodo-code-review
Copy link
Copy Markdown

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a reproducibility package for RCT historical validation, including documentation, build scripts, and pre-built Jupyter notebooks. It also enhances the sentiment pipeline to support hand-curated drug alias lists via a new --drug-file argument. My feedback focuses on a portability issue where the notebook build process hardcodes absolute local file paths into the generated .ipynb files, which prevents them from being executed on other machines without manual modification.

Comment thread docs/RCT_historical_validation/build_notebook.py Outdated
Comment thread docs/RCT_historical_validation/output/paper_figures.ipynb Outdated
Comment thread docs/RCT_historical_validation/output/paper_figures_executed.ipynb Outdated
Addresses Gemini review: pre-built notebooks contained
C:/Users/scgee/... paths. Now uses ../data/ relative paths
so notebooks are portable across machines.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Airwhale
Copy link
Copy Markdown
Collaborator Author

/gemini review

@Airwhale Airwhale requested review from Ely-S and polya-b April 30, 2026 18:56
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a reproducibility package for RCT historical validation, featuring comprehensive documentation, analysis scripts, and pre-built Jupyter notebooks. Additionally, it enhances the sentiment pipeline by allowing drug aliases to be provided via a text file and centralizing alias resolution logic. Feedback identifies an inconsistency in alias handling during classification, a non-standard Byte Order Mark in a new script, and recommends explicitly specifying UTF-8 encoding when reading the drug aliases file.

Comment thread src/pipeline/classify.py Outdated
Comment thread docs/RCT_historical_validation/build_notebook.py Outdated
Comment thread src/run_sentiment_pipeline.py Outdated
- classify.py: use resolve_aliases() for consistent alias handling
  with extract/canonicalize (fixes missed aliases when skipping canon)
- build_notebook.py: strip UTF-8 BOM
- run_sentiment_pipeline.py: explicit encoding="utf-8" on file read

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Airwhale
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a reproducibility package for RCT historical validation, including scripts to generate figures and tables from SQLite databases. It also refactors the drug alias resolution logic in the sentiment pipeline to support both LLM-based lookups and manual alias files. I have identified critical syntax errors in the build_notebook.py file where double curly braces were incorrectly used for dictionary and set definitions, which will cause runtime TypeErrors.

conn = sqlite3.connect(DB_PATH)

# ── Sentiment mapping ──
SENTIMENT_SCORE = {{"positive": 1.0, "mixed": 0.5, "neutral": 0.0, "negative": -1.0}}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The use of double curly braces {{...}} is incorrect for defining a dictionary. This syntax attempts to create a set containing a dictionary, which will raise a TypeError because dictionaries are not hashable. To define a dictionary, please use single curly braces {...}.

Suggested change
SENTIMENT_SCORE = {{"positive": 1.0, "mixed": 0.5, "neutral": 0.0, "negative": -1.0}}
SENTIMENT_SCORE = {"positive": 1.0, "mixed": 0.5, "neutral": 0.0, "negative": -1.0}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree. Double braces are how you produce literal { and } in format strings. Doing this causes an error.

Comment thread docs/RCT_historical_validation/build_notebook.py
Comment thread docs/RCT_historical_validation/build_notebook.py
@Ely-S
Copy link
Copy Markdown
Owner

Ely-S commented May 1, 2026

Concern about the dedupelication logic:


The Deduplication Code

  SENT_RANK = {"positive": 3, "mixed": 2, "neutral": 1, "negative": 0}                                                                                                                                                                            
  SIG_RANK  = {"strong": 2, "moderate": 1, "weak": 0, ...}                                                                                                                                                                                          
                                                                                                                                                                                                                                                    
  for uid, drug, sent, sig, _d in rows:
      k = (SENT_RANK.get(sent, -1), SIG_RANK.get(sig, 0))                                                                                                                                                                                           
      key = (uid, drug)                                                                                                                                                                                                                             
      if key not in best or k > best[key][0]:
          best[key] = (k, sent, drug)                                                                                                                                                                                                               

Each (user, drug) pair keeps exactly one record — the one with the highest rank tuple (sentiment_rank, signal_strength_rank).


The Problem

The ranking is a peak-performance rule, not a summary. A user's final classification is determined by their single best report about a drug, ever.

Consider a user who posts the folloing naltrexone:

  │ "really helped me!"   │ positive  │ strong   │
  │ "not working anymore" │ negative  │ strong   │                                                                                                                                                                                                  
  │ "gave up on it"       │ negative  │ moderate │                                                                                                                                                                                                  

This would be counted as a positive outcome.

It's asymmetric in only one direction: a single positive report can override any number of negatives, but no number of negatives can override a single positive.

@Ely-S
Copy link
Copy Markdown
Owner

Ely-S commented May 1, 2026

To make it easier to verify, can you modify the SQL that collects the posts for analysis to include

WHERE post_date < study_date

Shaun and others added 5 commits May 2, 2026 16:18
Re-runs Figure 1 / Table 2 / Table 3 with three substantive changes:

1. Data sources: adds master_gap_2020-07_to_2022-12.db, which classifies all
   six target drugs across the full corpus from r/covidlonghaulers inception
   (2020-07-24) through end of 2022. Built by running the pipeline once per
   drug with --drug, leveraging the substring-prefilter for cost efficiency.
   Existing per-drug DBs are still used; cross-DB rows are deduped on
   post_id with master_gap taking priority.

2. Cutoff dates: corrected to "day before earliest public release" using
   verified preprint and journal online dates. Glynne now uses 2021-06-06
   (medRxiv preprint) instead of 2021-10-01 (journal). Prednisone comparator
   updated to Utrero-Rico 2021. Naltrexone, paxlovid, colchicine cutoffs
   verified against publication metadata.

3. Per-(user, drug) dedup rule: replaces the previous "best-report" peak-rank
   rule (positive > mixed > neutral > negative; signal as tiebreaker) with
   a "most recent + signal-strength tiebreaker" rule. The new rule is
   symmetric on sentiment direction and addresses the asymmetry concern
   raised in code review (single positive could override any number of
   later non-positives).

Headline numbers are tighter and more defensible:
- Trial-positive drugs (loratadine, famotidine, naltrexone) all clear 50%
  with p < 0.0001.
- Trial-null drugs (colchicine, paxlovid, prednisone) all hover near 50%
  with no per-drug significance against the null.
- Prednisone, the most-debated case in the original analysis, settles at
  49.3% with the larger sample, exactly at chance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Table 3 previously combined the comparator paper, journal/preprint source,
and date into one parenthesized string per row. Splitting them into two
columns makes the per-drug comparator easier to scan and the publication
date explicit.

DRUG_CUTOFFS now stores three fields per drug:
  (cutoff_date, paper_short, source_date)

Figure 1 unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the build script pulled classifications from five databases
(master_gap plus four older per-drug runs) and deduped on post_id at
analysis time. About 3% of relevant classifications lived only in the
older DBs and were missing from master_gap, so cross-DB merging was
required to reproduce the headline numbers.

To simplify reproduction, scripts/build_combined_db.py was added to copy
the missing rows into the master_gap DB and write out a new self-sufficient
SQLite file: historical_validation_2020-07_to_2022-12.db (314 MB). It
has been uploaded to S3 at:

  s3://patientpunk/scientific_validation/rct_historical/processed/
      historical_validation_2020-07_to_2022-12.db

(replacing the older master_gap_2020-07_to_2022-12.db).

Reviewers reproducing the paper now need to download a single 314 MB DB
instead of five DBs totalling ~580 MB. The build script reads from this
single DB; numbers are bit-for-bit identical to the prior multi-DB version.

The original per-drug DBs are still published on S3 as optional source
data so their contents can be independently verified against the combined
DB if desired.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A few spots in the README still talked about cross-DB merging or pointed
at the old master_gap_*.json filename. Cleaned those up:

- Reproduce section now includes the curl step to download the single
  required DB.
- Per-drug CSV section no longer references the deleted
  scripts/merge_and_analyze_historical.py.
- Databases table reframes the older per-drug DBs as "Optional,
  transparency only" with a clear "Required" marker on the combined DB.
- Why-multiple-DBs paragraph rewritten to reflect the actual provenance
  (master pipeline run + small backfill from earlier per-drug runs).
- Renamed master_gap_2020-07_to_2022-12.json to
  historical_validation_2020-07_to_2022-12.json on S3 for naming
  consistency with the DB; README link updated.
- Step 1 now reads "Pull all reports per drug" from a single DB rather
  than "Merge all databases per drug".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Incorporates the two charts requested in PR #38 (Eli's working-files PR,
explicitly marked Do-Not-Merge). The PR description asked for two specific
items to be merged into our notebook; everything else in that PR is V&V
audit work and unrelated artifacts.

Figure 0 (NEW) — Full sentiment breakdown per drug:
  Stacked horizontal bar showing the four-way sentiment split
  (positive / mixed / neutral / negative) per drug, with each segment
  labeled by percentage and raw count. Y-axis labels include drug name +
  trial-direction tag. Plotted from our resp_df, so it inherits the
  combined-DB single-source workflow and the most-recent-plus-strength
  dedup rule.

Figure 1 — cosmetic upgrades only:
  - figsize 12x5.5 -> 13.5x6.8 (more breathing room, prints larger)
  - value-label fontsize 9 -> 11 (fontweight bold preserved)
  - y-tick labels 10 -> 12; xlabel 11 -> 12; title 12 -> 14
  - trial-direction tags switched from monospace to sans-serif at fontsize 11
  - title prefixed with "Figure 1 — " for symmetry with Figure 0

Numbers in Figure 1 are unchanged. The data source (combined DB) and
dedup rule are unchanged. Only styling and the new Figure 0 are added.

Other items from PR #38 deliberately not merged:
  - Eli's _gen_analysis.py / analysis.ipynb use the old multi-DB merge and
    old peak-rank dedup; adopting them would undo the methodology fixes
    landed in 2208e2d, ad7a74c, 6424e03, d1d7e1a.
  - The V&V audit in vv/*.md is substantive review material but not code
    to merge; several of its critiques are already addressed by recent
    work (V2 sampling-window cutoffs, V8 cross-DB dedup, V9 hard-coded
    Table 2). Open items (V1 build manifest, V3 cycle guard, V7 dedup
    sensitivity, V10 expected-output assertions) are tracked separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Airwhale
Copy link
Copy Markdown
Collaborator Author

Airwhale commented May 3, 2026

Re: dedup : as we discussed, changed it so that it picks most recent highest signal sentiment. So if there is a strong postive and then a strong negitive, will show up as a negative. If there is a strong positive and then a moderate negative, will show up as positive.

I re-ran (and repopulated S3) with a larger dataset, from the beginning of the dataset to a day before publication of the relevant paper, or to two years of the dataset, whichever is less. Effect sizes and significance both increase. I need to confirm reproducability on this dataset and make sure documentation matches.

I made the requested merge from #38, but there are some methodological critiques I need to go through: it's like third on the list of things I need to do.

Will look into the SQL change requested above, seems pretty straightforward.

Closing and will reopen once those check marks are ticked off.

@Airwhale Airwhale closed this May 3, 2026
Shaun and others added 5 commits May 3, 2026 15:03
…master_gap

The build_combined_db.py script (introduced in 6424e03) backfilled rows from
the older per-drug DBs into the master_gap DB. That backfill was buggy: it
copied treatment_reports.user_id from the source DBs while leaving the posts
table's user_id as master_gap's. Because the older DBs and master_gap use
different username-hashing conventions, every one of the 262 backfilled rows
ended up with a treatment_reports.user_id that did not match the posts.user_id
for the same post. That breaks dedup, which keys on user_id.

Verified: in the contaminated DB, all 262 backfilled rows had mismatched
user_ids (3.30% of 7,949 total).

Fix: replace the contaminated DB with a clean copy of master_gap (the direct
output of the master pipeline run, with no backfill). This loses the small
amount of classification data that lived only in older DBs (sample sizes
shrink ~6-12% for prednisone, naltrexone, colchicine), but the headline
finding is unchanged: every trial-positive drug is significantly above the
50% null at p ≤ 0.0001, and no trial-null drug reaches significance.

Updated values in Table 3 / Figure 1:
  loratadine    n=91   81.3%   p<0.0001  trial: positive
  famotidine    n=233  77.3%   p<0.0001  trial: positive
  naltrexone    n=155  65.8%   p=0.0001  trial: positive
  colchicine    n=92   54.3%   p=0.47    trial: null
  paxlovid      n=196  54.1%   p=0.28    trial: null
  prednisone    n=344  48.8%   p=0.71    trial: null

Changes in this commit:
- Local + S3 historical_validation_2020-07_to_2022-12.db replaced with the
  clean master_gap snapshot (313 MB; 7,687 reports; 0 user_id mismatches).
- merged/*.csv on S3 regenerated from clean DB.
- README updated: the Provenance section now states the DB is the direct
  output of one master pipeline run with no backfill, and explains why
  cross-DB merging is unsafe given different username-hashing conventions.
  Expected-Output table updated with the new (clean) values.
- Outputs (figure0.png, figure1.png, paper_figures.{html,ipynb}) rebuilt
  from clean DB.
- scripts/build_combined_db.py and scripts/merge_and_analyze_historical.py
  removed; superseded by scripts/dump_per_drug_csvs.py which reads only
  the single combined DB.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…older

Three follow-ups from a reviewer who downloaded the (then-contaminated) DB
before the c5c4957 fix landed:

1. Publish SHA-256 of the required database in the README and provide
   verification commands for macOS/Linux/Windows.
   Hash:
     FA67BE6D10FFE0B98365390A5223F6DFBC708360E830C725B74B443337AB4952

2. Add an integrity assertion at the top of the build script that fails
   loudly if any treatment_reports.user_id != posts.user_id. The build
   now prints "DB integrity check: 0 mismatched user_ids (must be 0). PASS."
   on every successful run. If the bug ever re-enters, the build aborts
   before producing figures, so a contaminated DB cannot silently produce
   numbers.

3. Stop creating a 0-byte data/corpus_baseline_onemonth.db when the
   reviewer doesn't have that file. The build was passing
   "../data/corpus_baseline_onemonth.db" to build_notebook.py purely as
   the cosmetic placeholder DB path that the helper's setup cell connects
   to. sqlite3.connect() silently creates an empty file when the path
   doesn't exist, which left a confusing 0-byte artifact behind. Switched
   the path to "../data/historical_validation_2020-07_to_2022-12.db" so
   the helper opens a real DB. No analysis cell uses the helper's conn
   variable, so the change is purely cosmetic but eliminates the
   misleading file.

The contaminated-DB issue itself was already fixed in c5c4957; the S3
file is clean and matches the SHA-256 above. The reviewer's hash
(4BBF06E7...) was for the old contaminated upload, which is no longer
on S3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two reviewer follow-ups, both intentional methodology fixes:

1. Deleted-user policy (was: implicit collapse; now: explicit exclusion)
   The DB has 52,603 posts where the Reddit author field was [deleted] or
   [removed] at scrape time, all mapped to the placeholder user_id "deleted".
   122 of those posts have classified treatment reports. Per-user dedup
   would collapse them all into one pseudo-user (one vote per drug for the
   entire deleted population) — methodologically wrong because they come
   from many distinct real users we cannot identify.

   New policy: exclude every post with user_id = "deleted" from analysis.
   Verified empirically that the shift is at most 0.5pp and 1 unit of n per
   drug, so no headline conclusion changes:
     loratadine    n=91  -> 90   (81.3% -> 81.1%)
     famotidine    n=233 -> 232  (77.3% -> 77.2%)
     naltrexone    n=155 -> 154  (65.8% -> 65.6%)
     colchicine    n=92  -> 91   (54.3% -> 53.8%)
     paxlovid      n=196 -> 196  (54.1% -> 54.1%)
     prednisone    n=344 -> 343  (48.8% -> 48.7%)

   Implementation: AND p.user_id != 'deleted' added to the SQL query in
   both _build_paper_figures.py and scripts/dump_per_drug_csvs.py, so the
   filter propagates to figures, tables, and the published per-drug CSVs
   uniformly. README adds a "Step 1a: Deleted-user exclusion" section
   documenting the policy and the empirical shift.

2. Colchicine first-public date correction
   Reviewer flagged that JAMA's article page lists Bassi et al. 2025 as
   "Published Online: October 20, 2025"; the December 2025 date in the
   README was the print issue date, not the online date. Fixed:
     cutoff       2025-11-30 -> 2025-10-19
     source       JAMA Intern Med 2025-12-01 -> JAMA Intern Med 2025-10-20
   Does not affect numbers because colchicine is bound by the end-2022
   analysis cap, but "first public date" is now accurate.

Re-uploaded merged/*.csv to S3 with the new (post-exclusion) values. The
combined database itself is unchanged — the exclusion happens at query
time, not in the data — so its SHA-256 is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two cosmetic-but-real consistency fixes flagged in an internal-consistency audit:

1. The naltrexone comparator label used a Unicode prime character (U+2032,
   "O′Kelly") in the build script's DRUG_CUTOFFS dict. The README used a
   regular apostrophe ("O'Kelly"). Fixed by quoting the paper field in
   double quotes inside the r-string raw cell, which allows a regular
   apostrophe inside the value:
     'O′Kelly et al. 2022'  ->  "O'Kelly et al. 2022"
   Now both files use the same character; rebuilt notebook output reflects
   the corrected label.

2. README never mentioned scripts/dump_per_drug_csvs.py, the script that
   regenerates the published per-drug CSVs from the combined database.
   Added a one-sentence pointer to it in the "Per-drug merged + deduped
   CSVs" subsection so reviewers know how to reproduce that step locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nces

Two related cleanups, no analytical change.

(1) Switch SQL from `post_date <= eod-of-day-before-publication` to
`post_date < midnight-utc-of-publication-date`. Same rows in/out (verified
bit-for-bit by re-running dump_per_drug_csvs.py — same n, same %pos, same
CIs, same p-values), but the predicate now matches the natural-language
"strictly before publication" much more transparently. Renames `cutoff` ->
`pub_date` and `effective_cutoff` -> `window_end_exclusive` throughout.
Adds `epoch_midnight()` (replacing `epoch_eod()`) and changes
`END_2022 = '2022-12-31'` to `END_2022_EXCLUSIVE = '2023-01-01'`. summary.csv
column rename re-uploaded to S3.

(2) Drop README references to the 5 optional smaller .db files. Verified
nothing in the build flow reads them — both _build_paper_figures.py and
scripts/dump_per_drug_csvs.py use only the combined DB. The smaller DBs
remain on S3 (cheap to keep) but are no longer documented as part of the
reproducibility package, removing ~40 lines of stale "optional / for
transparency" tables and trimming the Provenance section. Also updates
build_notebook.py docstring example to use the canonical DB name.

Verified: dump_per_drug_csvs.py and _build_paper_figures.py both produce
identical headline numbers (famotidine 232/77.2%, loratadine 90/81.1%,
naltrexone 154/65.6%, paxlovid 196/54.1%, colchicine 91/53.8%,
prednisone 343/48.7%). DB integrity check still passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Airwhale Airwhale reopened this May 3, 2026
Shaun and others added 2 commits May 4, 2026 12:12
Two fixes addressing review feedback:

1. README post-count reconciliation. Reviewer noted the README's
   "47,434 top-level posts + 684,092 comments" line was inconsistent
   with verify.py / THREAD_AUDIT.md / the DB itself reporting
   47,442 submissions + 684,084 comments. Both numbers are correct
   but describe different stages: 47,434/684,092 is the JSON-level
   reality (actual posts vs comments), 47,442/684,084 is the DB-level
   reality after the import script's dangling-parent cleanup nulls
   parent_ids for 8 comments whose parents fell outside the scrape
   window. Same total. README now explains the 8-row delta inline
   so the apparent inconsistency is documented as a known
   scrape-window edge effect.

2. Drop internal audit codes from public-facing files. The "V1/V2/V3/
   V7/V10" prefixes in section headers, check names, and prose came
   from an internal audit framework that shouldn't be referenced in
   the public reproducibility package. Renamed:
     - V3_THREAD_AUDIT.md -> THREAD_AUDIT.md
     - V7_DEDUP_AUDIT.md -> DEDUP_AUDIT.md
     - scripts/v3_thread_audit.py -> scripts/thread_audit.py
     - scripts/v7_dedup_sample_audit.py -> scripts/dedup_sample_audit.py
   Plus stripped V-codes from verify.py check names ("V1: DB integrity"
   -> "DB integrity"), build-script cell headings ("## Window
   verification (V2)" -> "## Window verification"), build-script
   stdout ("V2 window verification: ..." -> "Window verification: ..."),
   DRUG_ALIASES.md prose, and a few other call sites. Section
   structure unchanged; only the audit-framework labels are gone.

Verified: build runs cleanly, all assertions still PASS, headline
numbers unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both small clarifications, addressing reviewer comments that pointed at
genuine ambiguity in the docs (not at any methodological issue):

1. README "Raw Reddit JSON" section — clarify scope of import_posts.py.
   The previous wording ("rebuild the SQLite DB from this JSON file
   using src/import_posts.py") implied that one script rebuilds the
   full analysis DB. It does not: import_posts.py rebuilds only the
   `users` and `posts` tables. The `treatment_reports` table (LLM
   classifications) is the frozen output of running
   src/run_sentiment_pipeline.py per target drug — Anthropic API
   credentials and ~2 hr wall time required. README now states this
   explicitly.

2. DRUG_ALIASES.md "Epistemic status" section — make the LLM-aliases
   methodology choice explicit. Previously the file noted that aliases
   are LLM-generated and not manually adjudicated, but didn't address
   the obvious reviewer question: is this a workaround, an
   unintentional limitation, or a deliberate choice? Added a short
   paragraph stating it's deliberate — evaluating the LLM
   canonicalization step is part of the method being demonstrated, so
   we treat alias generation as a pipeline output rather than a hand-
   authored input, and publish the generated aliases for auditability.
   Updated in scripts/dump_drug_aliases.py and regenerated DRUG_ALIASES.md.

No code logic changed; no numbers shifted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Airwhale
Copy link
Copy Markdown
Collaborator Author

Airwhale commented May 4, 2026

/gemini review

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

@Airwhale
Copy link
Copy Markdown
Collaborator Author

Airwhale commented May 4, 2026

Update: substantial methodology and transparency pass

Pushed a substantial round of changes in response to PR review feedback (Eli, Qodo, Gemini, plus a more recent doc review). Headline conclusion is unchanged — every trial-positive drug still cleanly above 50% with strong CIs, every null-trial drug still hugging 50% — but the path to those numbers is now documented, asserted, and reviewable end-to-end.

TL;DR

  • Larger corpus. Pulled the full r/covidlonghaulers history from Arctic Shift (2020-07-24 → 2022-12-31, ~2.5 years) and run every drug query against that single corpus instead of the per-drug short-window scrapes used earlier.
  • Three additional substantive methodology changes (single-DB consolidation, "most recent + signal-strength" dedup, strict-< pre-publication SQL).
  • One real bug found and fixed during review (parent_id loss on import → analysis DB rebuilt with prefix fix; published numbers unaffected).
  • Build-time assertions wired into the notebook execution: DB integrity, per-drug window, expected-output match, dedup audit. Build fails on drift.
  • Standalone reviewer artifacts: DRUG_ALIASES.md, THREAD_AUDIT.md, DEDUP_AUDIT.md, ARCTIC_SHIFT_PROVENANCE.md, output/provenance.json.
  • One-command reviewer check: python verify.py — runs every assertion, prints PASS/FAIL summary, exits 0/1.
  • Headline numbers stable across every re-run.

Methodology changes

Change Why Effect on numbers
Larger corpus pull from Arctic Shift. Earlier per-drug runs used short scrape windows (~50 days around each publication date, run separately per drug). Now: one corpus-wide pull (2020-07-24 → 2022-12-31, ~2.5 years) that every drug query reads from. Short windows around each pub date were enough to draw the original samples but threw away most of the available pre-publication discussion. The full-corpus pull lets every drug see whatever discussion exists in the 30-month corpus. Pre-publication sample sizes substantially larger; naltrexone now has ~21 months of pre-publication data (2020-10 → 2022-07, n=154); colchicine has ~28 months (2020-08 → 2022-12, n=91, capped at end-of-2022). See per-drug timeframe table below.
Single-DB consolidation (historical_validation_2020-07_to_2022-12.db, 311 MB, SHA c50fcacd…). Was 5 separate DBs being merged at analysis time. Cleaner reproducibility story: one file, one hash, no cross-DB hashing-convention concerns. Numbers stable; reviewers download 1 file instead of 5.
Dedup rule changed from peak-rank (SENT_RANK, SIG_RANK) to "most recent post wins; signal-strength breaks ties on same date." Direct response to Eli's asymmetry concern that a single positive could override any number of later non-positives under the old rule. New rule is symmetric on sentiment direction. Headline numbers shifted (e.g., famotidine 207 → 232); shifts are documented and consistent across _build_paper_figures.py and dump_per_drug_csvs.py.
SQL predicate changed from post_date <= eod-of-day-before-publication to post_date < midnight-utc-of-publication-date. Eli's request: cleaner natural-language mapping. The strict-< predicate maps directly to "strictly before publication." Bit-for-bit identical numbers (same rows in/out); semantically clearer.
Deleted-user exclusion added (AND p.user_id != 'deleted'). Reddit's [deleted]/[removed] author placeholder mapped to one pseudo-user, giving the entire deleted population one vote per drug. Documented policy now: exclude. Each drug's headline shifted ≤0.5pp and ≤1 unit of n.
Colchicine date corrected from 2025-12-01 (PubMed print issue) to 2025-10-20 (JAMA online-first). Caught during reviewer audit. End-of-2022 cap is binding for colchicine, so cap-bound numbers unchanged; the documented cutoff is now accurate.

Bug found and fixed during the audit pass

src/import_posts.py was inserting comment.parent_id verbatim from the JSON (t1_<id> / t3_<id> Reddit prefixes) but storing post_id bare. The cleanup query UPDATE parent_id = NULL WHERE parent_id NOT IN (SELECT post_id ...) then nulled every prefixed parent as "dangling." Result: in the analysis DB every comment had parent_id IS NULL and the thread structure was lost.

  • Impact on the published numbers: zero. Each pipeline run used --drug substring filter, so upstream-context inheritance was a no-op even before the fix.
  • Fix: 1-line strip_reddit_prefix() helper in import_posts.py. Analysis DB rebuilt via scripts/rebuild_analysis_db_with_parent_ids.py (fresh import + copy LLM classifications across); 0 user_id mismatches, treatment_reports count unchanged, headline numbers identical bit-for-bit. New SHA-256 published in README.
  • Cycle guard added to extract.py's upstream-traversal recursion. idx_posts_parent index added to schema.sql.

New transparency / reviewer artifacts

All under docs/RCT_historical_validation/:

File What it is
README.md Adds Window verification table, Drug aliases section, Dedup audit section, Verify section, Arctic Shift provenance pointer
verify.py Single-command reproducibility runner (PASS/FAIL summary, exit 0/1)
ARCTIC_SHIFT_PROVENANCE.md How the raw JSON was acquired (URL, scrape date, conversion process, byte sizes)
DRUG_ALIASES.md Every alias the pipeline substring-matched against (170 across 6 drugs, 0 cross-drug collisions, reviewer notes flagging entries worth a closer look). Includes an explicit Epistemic Status section: alias generation is a deliberate pipeline output, not a hand-authored input, because evaluating the LLM canonicalization step is part of the method this paper demonstrates.
THREAD_AUDIT.md Orphan parent count, cycle detection, 50 sampled reply chains with timestamp monotonicity
DEDUP_AUDIT.md 36 sampled multi-report (user, drug) pairs showing every candidate report with the rule's retained pick marked
output/provenance.json Build-time manifest: git commit + dirty flag, DB SHA-256, model env, full extraction_runs table

Plus standalone reproducible scripts (scripts/thread_audit.py, scripts/dedup_sample_audit.py, scripts/dump_drug_aliases.py, scripts/rebuild_analysis_db_with_parent_ids.py) — all deterministic, all idempotent.

Build now has hard gates that fire on every notebook execution

@Ely-S
Copy link
Copy Markdown
Owner

Ely-S commented May 5, 2026

Great results!

@@ -0,0 +1,128 @@
{
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great solution to documenting what was run and how !

comments" downloads, save the resulting `.jsonl` files locally.
3. Run `scripts/convert_jsonl_to_source.py` to filter to the analysis
window and produce `historical_validation_2020-07_to_2022-12.json`.
4. Verify SHA-256 matches `298d5bc7...abfe6a`.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect and reproducible!

guaranteed against the snapshot Arctic Shift had on 2026-05-02. Earlier
or later Arctic Shift snapshots may differ.

## Why the date range stops at 2022-12-31
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect explanation

the analysis SQLite DB. The lists are what every pipeline run
substring-matched posts and comments against during the extraction
step and what every canonicalization step normalized to.
Reviewers can audit these lists directly without running anything.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome visibility

Comment on lines +97 to +100
- `h2 blocker` *(multi-word)*
- `h2 antagonist` *(multi-word)*
- `acid reducer` *(multi-word)*
- `heartburn relief` *(multi-word)*
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if heartburn relief and acid reducer are correct.

Comment on lines +169 to +174
- `50mg naltrexone` *(multi-word)*
- `oral naltrexone` *(multi-word)*
- `injectable naltrexone` *(multi-word)*
- `extended release naltrexone` *(multi-word)*
- `naltrexone er` *(multi-word)*
- `naltrexone xr` *(multi-word)*
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are definitely not the same thing as LDN

  • 50mg naltrexone
  • injectable naltrexone
  • extended release naltrexone
  • naltrexone er
  • naltrexone xr
  • vivitrol
  • revia
  • depade

  1. the brand name naltrexones are all at 50mg
  2. The 50mg dose is only for opioid and alcohol use disorder
  3. While 50mg might be effective for long covid, it's essentially a different treatment
  4. The Injectable and XR formulations are definitely not the same thing

LDN works because it has different effects at a lower dosage.

- `famotidins` *(misspelling)*
- `h2 blocker` *(multi-word)*
- `h2 antagonist` *(multi-word)*
- `acid reducer` *(multi-word)*
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I would use acid reducer and heartburn relief as aliases to pepcid.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that category would include Tums

Comment on lines +273 to +275
- **loratadine** includes `loratab` — Lortab is a brand of
hydrocodone/acetaminophen, a different (opioid) drug; this is
likely an LLM error and should be reviewed.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

- `predni`
- `prednisona` *(misspelling)*
- `corticosteroid`
- `steroid`
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would probably remove steroid as an alias to prednisone lol

Comment on lines +286 to +291
To turn any of these into actual corrections, the path is: edit the
alias list, re-run canonicalization, re-run classification, regenerate
the analysis DB and figures. None of this changes the headline
conclusion (every drug's responder rate stays in its current bucket
when individual ambiguous aliases are removed) but it is the open
methodology task of manually reviewing every alias before publication.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixing any canonicalization errors would strengthen our results


## Reproducibility

Re-running `scripts/v7_dedup_sample_audit.py --db <db> --out <out> --seed 42 --per-drug 6` against the same DB produces an identical file. Total sampled (user, drug) pairs: **36**. No newline at end of file
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scripts/dedup_sample_audit.py ?

@Ely-S
Copy link
Copy Markdown
Owner

Ely-S commented May 5, 2026

Issue 1: the dedup audit reproducibility command is not runnable as written, and the “identical file” claim is too strong.

docs/RCT_historical_validation/DEDUP_AUDIT.md says:

Re-running scripts/v7_dedup_sample_audit.py --db <db> --out <out> --seed 42 --per-drug 6 against the same DB produces an identical file.

Two problems:

  1. scripts/v7_dedup_sample_audit.py does not exist in this repo. The actual script is scripts/dedup_sample_audit.py.
  2. Re-running the current script does not produce a byte-identical file because the Generated timestamp changes. It does reproduce the same sampled rows and still reports 36 sampled (user, drug) pairs.

Suggested fix: update the sentence to something like:

Re-running scripts/dedup_sample_audit.py --db docs/RCT_historical_validation/data/historical_validation_2020-07_to_2022-12.db --out <out> --seed 42 --per-drug 6 against the same DB reproduces the same sampled rows, aside from generated metadata. Total sampled (user, drug) pairs: 36.

@Ely-S
Copy link
Copy Markdown
Owner

Ely-S commented May 5, 2026

Issue 2: the dedup rule wording says signal strength breaks ties on the “same date,” but the implementation uses the full post_date timestamp. In other words, the most recent post wins over Signal Strength.

In scripts/dedup_sample_audit.py, rows are sorted by:

(r["date"], SIG_RANK.get(r["sig"], 0))

with reverse=True. That means the most recent timestamp wins first, and signal strength only breaks ties when the full timestamp value is identical. Same-calendar-day rows are still ordered by time before signal strength.

This shows up in DEDUP_AUDIT.md for loratadine: a later moderate report at 2020-12-24 16:52 is retained over an earlier strong report at 2020-12-24 15:20.

Suggested fix: change the wording in both the script-generated intro and the checked-in audit from “ties on the same date” to “ties on the same timestamp” or “ties on the same post_date value.”

@Ely-S
Copy link
Copy Markdown
Owner

Ely-S commented May 5, 2026

Minor Issue 3: the DB path recorded in DEDUP_AUDIT.md is stale and does not match the current reproducibility package layout.

DEDUP_AUDIT.md currently records:

data\historical_validation\historical_validation_2020-07_to_2022-12.db

But the README now instructs reviewers to place the DB in the data/ subfolder under docs/RCT_historical_validation/, i.e.:

docs/RCT_historical_validation/data/historical_validation_2020-07_to_2022-12.db

The current build/verification flow also uses that package-local path. This stale metadata will confuse reviewers trying to rerun the audit.

Suggested fix: regenerate DEDUP_AUDIT.md using the current DB location, or manually update the metadata and reproduction command so they point at docs/RCT_historical_validation/data/historical_validation_2020-07_to_2022-12.db.

@Ely-S
Copy link
Copy Markdown
Owner

Ely-S commented May 5, 2026

Here is the same writeup from claude code with nicer fromatting, if this helps.


Code review

Found 3 issues (re-checking the 3 items from the previous changes-requested review):

  1. Wrong script name (and incorrect reproducibility claim) in DEDUP_AUDIT.md — the Reproducibility section says to run scripts/v7_dedup_sample_audit.py, but that file does not exist; the actual script is scripts/dedup_sample_audit.py. The bug is in the script's own output string, not just the committed markdown. Additionally, the claim "produces an identical file" is false: the script writes **Generated:** <current UTC time> on every run (line 77 of the script), so re-runs differ in the timestamp header.

lines.append("")
lines.append(
f"Re-running `scripts/v7_dedup_sample_audit.py --db <db> --out <out> "
f"--seed {args.seed} --per-drug {args.per_drug}` against the same DB "
f"produces an identical file. Total sampled (user, drug) pairs: "
f"**{total_sampled}**."
)

  1. Dedup rule wording says "same date" but implementation uses full timestamp — the docstring and generated audit text both say "signal_strength breaks ties on the same date," but the sort key is (r["date"], SIG_RANK) where r["date"] is a Unix epoch in seconds, not a calendar date. A later post on the same calendar day beats an earlier stronger post. The loratadine user 87b65a2f… in the committed DEDUP_AUDIT.md confirms this: moderate at 16:52 is retained over strong at 15:20, both on 2020-12-24. The wording in README.md ("For posts on the same date, the strongest signal wins") has the same issue.

Dedup rule under audit: "most recent post wins; signal_strength breaks
ties on the same date (strong > moderate > weak > n/a)."

### user `87b65a2f4371463c…` (2 reports)
| | post_date (UTC) | sentiment | signal | post_id |
|---|---|---|---|---|
| **RETAINED** | 2020-12-24 16:52 | positive | moderate | `ggwtglb` |
| | 2020-12-24 15:20 | mixed | strong | `ggwjynp` |

  1. Stale DB path in DEDUP_AUDIT.md — line 4 of the committed file records data\historical_validation\historical_validation_2020-07_to_2022-12.db (Windows backslashes; includes a historical_validation/ subdirectory). The README.md instructs reviewers to place the DB at data/historical_validation_2020-07_to_2022-12.db (flat inside docs/RCT_historical_validation/data/, no subdirectory). The path recorded in the audit file does not match the path the README tells reviewers to use.

**Generated:** 2026-05-04 19:10 UTC
**DB:** `data\historical_validation\historical_validation_2020-07_to_2022-12.db`
**Sample seed:** 42


Issues 4–6 from earlier comments are resolved: dump_per_drug_csvs.py uses Path(__file__).resolve().parents[1] (no hardcoded Windows paths), the post_date < window_end SQL predicate is present in both scripts, and the dedup logic correctly implements most-recent-timestamp-wins.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@Ely-S
Copy link
Copy Markdown
Owner

Ely-S commented May 5, 2026

Here are some of Sonnet's comments. I don't know that any of these are meaningful, but leave it up to you.


Code review (OPTIONAL deep pass — reproducibility, data quality, robustness)

Found 5 new issues:

  1. Committed output artifacts were built from a dirty working treeprovenance.json records "dirty": true, meaning figure1.png, paper_figures.html, and the executed notebook were generated from code not captured in any commit. A reviewer who checks out 384e0d0 and runs _build_paper_figures.py on a clean tree cannot expect to produce byte-identical outputs.

"git": {
"commit": "8a05a31a116f0206e78531dfa280337e6846acb1",
"dirty": true
},
"db": {

  1. The frozen DB was built at a different commit than the code under review — all six extraction_runs in provenance.json record "commit_hash": "16c2569b0273e793f7d5f8894c912073a0a85753" (commit title: "address Gemini review: alias resolution, BOM, encoding"), while the current HEAD is 384e0d0. The Gemini review that was being addressed raised alias-resolution concerns; it's unclear whether the alias logic used during the actual extraction run differs from the logic now in the PR. There is no diff between those commits in the documented provenance.

"extraction_runs": [
{
"run_id": 1,
"run_at": 1777759720,
"commit_hash": "16c2569b0273e793f7d5f8894c912073a0a85753",
"extraction_type": "treatment_sentiment",
"config": {
"models": {
"fast": "anthropic/claude-haiku-4.5",
"strong": "anthropic/claude-sonnet-4.6"
},
"limit": 0,
"reclassify": false,
"skip_canonicalize": false,
"output_dir": "data\\historical_validation\\master_gap\\pipeline_output",
"drug": "famotidine"
},
"run_at_iso": "2026-05-02T22:08:40+00:00"
},

  1. dump_per_drug_csvs.py default DB path does not match the path the README tells reviewers to use — the README's "Reproduce" section instructs downloading the DB to docs/RCT_historical_validation/data/historical_validation_2020-07_to_2022-12.db (no subdirectory, confirmed by provenance.json's db.path). But dump_per_drug_csvs.py's default resolves to <repo-root>/data/historical_validation/historical_validation_2020-07_to_2022-12.db — a different location entirely. Running the script without --db after following the README download instructions will fail with "DB not found." The README also internally describes the script as reading from data/historical_validation/... (a third path variant).

# Repo-relative defaults: this file lives at <repo>/scripts/dump_per_drug_csvs.py,
# so parents[1] is the repo root regardless of where the script is invoked from.
REPO_ROOT = Path(__file__).resolve().parents[1]
DEFAULT_DB = REPO_ROOT / "data" / "historical_validation" / "historical_validation_2020-07_to_2022-12.db"
DEFAULT_OUT = REPO_ROOT / "data" / "historical_validation" / "merged"

  1. README claims "every comparison reaches p ≤ 0.0001" but naltrexone's p-value is 1.358e-4verify.py's EXPECTED_OUTPUTS records naltrexone at p: 1.358e-4 = 0.0001358, which is greater than 0.0001. The expected-output table in the README correctly shows naltrexone as 0.0001 (rounded), but the prose is factually wrong.

| prednisone | 343 | 48.7% | [43.4, 54.0] | 0.67 | null |
The pattern: every drug where the community clearly leaned positive (loratadine, famotidine, naltrexone) corresponds to a trial that found a positive result — every comparison reaches p ≤ 0.0001. Every drug where the community was roughly split (colchicine, paxlovid, prednisone) corresponds to a trial that found no significant effect — none reach significance against the 50% null. All 6 directional comparisons match the eventual trial outcome.

"famotidine": {"n": 232, "pos": 179, "pos_pct": 77.155, "p": 3.565e-17},
"loratadine": {"n": 90, "pos": 73, "pos_pct": 81.111, "p": 1.948e-9},
"prednisone": {"n": 343, "pos": 167, "pos_pct": 48.688, "p": 0.6658},
"naltrexone": {"n": 154, "pos": 101, "pos_pct": 65.584, "p": 1.358e-4},
"paxlovid": {"n": 196, "pos": 106, "pos_pct": 54.082, "p": 0.2839},
"colchicine": {"n": 91, "pos": 49, "pos_pct": 53.846, "p": 0.5296},

  1. provenance.json extraction-run paths use Windows backslashes and an undocumented master_gap directory — every extraction_runs[*].config.output_dir records a Windows-style path such as data\\historical_validation\\master_gap\\pipeline_output_loratadine. The master_gap subdirectory does not appear in the README, the documented data layout, or any script. Reviewers who inspect provenance.json to understand how the DB was built cannot trace these intermediate artifacts.

"limit": 0,
"reclassify": false,
"skip_canonicalize": false,
"output_dir": "data\\historical_validation\\master_gap\\pipeline_output",
"drug": "famotidine"
},
"run_at_iso": "2026-05-02T22:08:40+00:00"
},


Minor (not blocking): _build_paper_figures.py's module docstring (line 18) still says "Signal-strength is the tiebreaker for posts on the same date" — the same wording issue flagged in the previous review for DEDUP_AUDIT.md and README.md also appears here.

Per-(user, drug) deduplication rule:
1) Most recent report wins (post_date desc).
2) Signal-strength is the tiebreaker for posts on the same date
(strong > moderate > weak > n/a).

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Shaun and others added 5 commits May 7, 2026 09:51
PR #37 reviewer findings (Eli + Sonnet):
- Eli #1: dedup_sample_audit.py output template fixed — script name
  was 'v7_dedup_sample_audit.py' (file doesn't exist), now correctly
  references 'scripts/dedup_sample_audit.py'; "produces an identical
  file" claim softened to acknowledge the regenerated 'Generated'
  timestamp at the top of the file.
- Eli #2 (Option A — wording, not implementation): "same date"
  reconciled with the full-timestamp implementation across
  _build_paper_figures.py (4 spots), README, dedup_sample_audit.py,
  and the regenerated DEDUP_AUDIT.md. The rule in practice reduces
  to "most recent post wins (full UTC timestamp)" because exact-
  timestamp ties are rare; signal_strength acts as a near-vestigial
  tiebreaker on those rare ties. Implementation unchanged.
- Eli #3: stale Windows-backslash DB path in DEDUP_AUDIT.md
  ('data\historical_validation\...') replaced by the package-local
  posix path on regeneration.
- Sonnet #4 + #5: new README "Notes on output/provenance.json"
  subsection documents the dirty=true chicken-and-egg constraint and
  the extraction commit (16c2569) vs HEAD (this PR) diff —
  reformalized SQL predicate, deleted-user exclusion, colchicine date
  correction, audit/verify scripts, parent_id prefix fix; explicitly
  notes no alias-resolution or sentiment-classification logic was
  changed in this range.
- Sonnet #6: dump_per_drug_csvs.py default DB path mismatch with
  README — fixed by routing through paths.py (see refactor below).
- Sonnet #7: README "every comparison reaches p ≤ 0.0001" prose
  corrected to "p < 0.001" (naltrexone is 1.358e-4 = 0.0001358).
- Sonnet #8: provenance writer now normalizes Windows backslashes in
  extraction_runs.config; new README subsection explains what the
  'master_gap' working directory was.

Path resolution refactor (paths.py):
- New paths.py module with marker-based anchor walk-up + RCT_DB_PATH
  env var override. Single source of truth.
- All four entry points (_build_paper_figures.py, verify.py,
  scripts/dump_per_drug_csvs.py, the executed notebook) use the same
  resolver. The notebook embeds an inlined copy of the resolver in
  its first setup cell so reviewers can read exactly how the DB is
  found without chasing imports.
- build_notebook.py grew an optional db_path_block kwarg so callers
  can inject richer DB resolution (used by _build_paper_figures.py);
  default behavior unchanged for any other notebook builders.
- provenance.json db.path is now package-relative
  ("data/historical_validation_2020-07_to_2022-12.db") — no more
  leaky absolute Windows-user paths.
- verify.py header surfaces resolution mode ("package anchor
  (default)" vs "RCT_DB_PATH=..." vs "--db CLI flag") so a stale
  env var failure mode is obvious at a glance.

Bonus parallel-bug catches in adjacent audit scripts (same false
"produces an identical file" claim and Windows-backslash DB paths):
- dump_drug_aliases.py and thread_audit.py corrected; DRUG_ALIASES.md
  and THREAD_AUDIT.md regenerated with posix DB paths and accurate
  reproducibility wording.

Figure 1 redesign:
- Replaced the paired horizontal bar chart with a two-marker forest
  plot per drug (responder green circle + non-responder red square),
  trial outcome tagged in the row label, paper citation on line two
  of the y-tick label. Each marker has a matching-color 95% Wilson CI
  line; numeric values printed at the right end.

Verification:
- All headline numbers unchanged: famotidine 232, loratadine 90,
  naltrexone 154, paxlovid 196, prednisone 343, colchicine 91.
- DB SHA-256 unchanged: c50fcacd7ce366f397152f5fe4dbb59d5eaf64ba32627faef91dad86fbf6c6f4
- verify.py: ALL 5 CHECKS PASSED.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rl, soften alias claim

Three findings from a fresh-checkout reproduction pass:

P2 — provenance could record the wrong git commit
  _git_metadata() ran `git rev-parse HEAD` and `git status` without
  anchoring to the package root. If a reviewer extracted the package
  into an unrelated dirty repo checkout (Codex's exact scenario),
  output/provenance.json silently recorded that *other* repo's commit
  rather than the package's commit — silent misdirection that's worse
  than recording 'unknown'.

  Fix: run git with cwd=Path(__file__).resolve().parent, then verify
  _build_paper_figures.py is actually tracked in that worktree via
  `git ls-files --error-unmatch`. If git can't resolve HEAD or the
  script isn't tracked, record 'outside-git' (visibly broken) instead
  of an unrelated commit (silently misleading). 'unknown' is now
  reserved for the case where git itself isn't installed.

  Verified across three scenarios:
   - inside the package's own checkout → reports actual SHA
   - extracted to a plain directory      → 'outside-git' + warning
   - extracted into an unrelated repo    → 'outside-git' + warning
                                            (was: returned unrelated SHA)

P3 — README curl command silent on HTTP errors
  `curl -L -o data/...` exits 0 on 4xx, writes the error response body
  to the destination filename, and the next step crashes with a
  confusing `sqlite3.DatabaseError: file is not a database`. Changed
  to `curl -fL -o ...` with a comment explaining why, and added a
  `python verify.py` step right after the download so HTTP/transfer
  errors fail at the download step rather than mid-build.

P3 — alias sensitivity claim was unsupported
  DRUG_ALIASES.md previously said "None of this changes the headline
  conclusion" about alias edits — a quantitative sensitivity claim
  with no per-alias ablation table to back it. Softened the wording
  in scripts/dump_drug_aliases.py: explicitly state we have NOT run a
  quantitative sensitivity analysis, that this file documents
  inspectability rather than robustness, and that producing a
  per-drug ablation table is the open methodology task before
  publication. DRUG_ALIASES.md regenerated.

Also fixed (parallel hygiene):
  - dump_drug_aliases.py reproducibility command rendered Windows
    backslashes for args.db / args.out; now uses .as_posix().

Verification:
  - provenance.json `git.commit` correctly = 665e910… (HEAD before
    this commit), not the parent OneDrive repo's commit
  - All headline numbers unchanged (verify.py: ALL 5 CHECKS PASSED)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ce lag

P3 — README DB table count was wrong
  README said "four tables" (users, posts, treatment, treatment_reports)
  but the DB actually has seven: those four plus extraction_runs (used
  by provenance), plus conditions and user_profiles (both empty,
  reserved for future extensions). Reviewers inspecting the DB
  directly would have been surprised. Replaced the prose count with a
  full inventory table showing every table, its row count at the
  published SHA, and its role.

P3 — committed provenance.json lagged HEAD
  The committed provenance.json recorded git.commit = 665e910 (the
  prior PR commit) while HEAD was 0e1f187 — an artifact of the
  chicken-and-egg constraint: the build runs at commit N, records
  git.commit = N, and gets committed as commit N+1. The committed
  artifact always points at its own parent commit.

  Two changes:
   - Regenerated outputs at current HEAD so the lag drops from 2
     commits to exactly 1 (parent). provenance.json now records
     0e1f187 (the immediate parent of this fix commit).
   - Added an explicit README subsection under Notes on
     output/provenance.json explaining the 1-commit lag, naming the
     reviewer-regenerated file as authoritative, and documenting
     git.commit = "outside-git" semantics introduced in the previous
     commit.

Verification:
  - All headline numbers unchanged
  - verify.py: ALL 5 CHECKS PASSED at HEAD before this commit

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The original paper's data-sources table included a Total-posts column
(corpus posts in each drug's window, the denominator the substring +
LLM extraction ran against). The current package's Table 2 dropped it.
Adding it back so reviewers can see the analysis denominator alongside
the per-drug user/report counts.

Definition (matches the per-drug analysis filters):
  COUNT(*) FROM posts
  WHERE post_date IS NOT NULL
    AND post_date < <window_end_exclusive>
    AND user_id != 'deleted'

Values for the published SHA c50fcacd…:
  famotidine  139,075   (window: 2020-07-24 → 2021-06-06)
  loratadine  139,075   (same window)
  prednisone  218,822   (2020-07-24 → 2021-10-25)
  naltrexone  448,114   (2020-07-24 → 2022-07-02)
  paxlovid    678,923   (2020-07-24 → 2022-12-31, end-2022 cap)
  colchicine  678,923   (same as paxlovid, capped)

These are larger than the original paper's narrower-window post counts
(famotidine 88,077, naltrexone 244,036, etc.) because this PR switched
to one corpus-wide pull — already documented in Provenance.

The added footnote explicitly spells out what Total posts counts and
why it differs from "raw post-row count" (the deleted-user and
NULL-post_date exclusions match the analysis filters, so the column
reads as the actual analysis denominator rather than a rawer count
that would mismatch downstream rows).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Table 3 formatter used to bin p-values into one of three buckets:
  - p < 0.0001  → "<0.0001"   (lossy — discards the actual order
                                of magnitude; loratadine and
                                famotidine both rendered as the same
                                string despite differing by 8 orders
                                of magnitude)
  - p < 0.001   → 4 decimal places (e.g. "0.0001")
  - p ≥ 0.001   → 3 decimal places (e.g. "0.284")

New behavior:
  - p < 0.001   → 3-sig-fig scientific notation (e.g. "3.56e-17",
                  "1.95e-9", "1.36e-4")
  - p ≥ 0.001   → 3 decimal places (unchanged)

Single-digit exponents have their leading zero stripped (1.95e-9 not
1.95e-09) for readability. Both Table 3 in the executed notebook and
the README's Expected Output table now use the same formatter and
show the same values.

Verification:
  - verify.py: ALL 5 CHECKS PASSED (formatter change is
    presentation-only; numeric assertions unchanged)
  - Rendered Table 3 p-values:
      loratadine  1.95e-9
      famotidine  3.56e-17
      naltrexone  1.36e-4
      paxlovid    0.284
      colchicine  0.530
      prednisone  0.666

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Airwhale
Copy link
Copy Markdown
Collaborator Author

Airwhale commented May 7, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive reproducibility package for the RCT historical validation analysis, including documentation, audit logs, and automated build scripts. Key changes include fixing a parent_id prefix issue in the Reddit import logic, adding cycle detection to the extraction pipeline, and providing a suite of verification tools to ensure data integrity and numerical consistency. Feedback focuses on improving code maintainability by reusing existing database connections and constants, removing redundant implementations of statistical functions, and consolidating duplicated graph traversal logic into shared utilities.

Comment on lines +290 to +292
DB_DIR = Path(DB_PATH).parent
COMBINED_DB = DB_DIR / "historical_validation_2020-07_to_2022-12.db"
combined_conn = sqlite3.connect(COMBINED_DB.as_posix())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The generated notebook code creates a redundant database connection combined_conn. The setup block (injected via build_notebook) already establishes a global conn object using the same database file. It is better to reuse the existing conn to avoid confusion and resource overhead in the executed notebook.

Suggested change
DB_DIR = Path(DB_PATH).parent
COMBINED_DB = DB_DIR / "historical_validation_2020-07_to_2022-12.db"
combined_conn = sqlite3.connect(COMBINED_DB.as_posix())
# Reusing the 'conn' established in the setup block.
combined_conn = conn

Comment on lines +1166 to +1171
NOTEBOOK_DB_RESOLVER = '''import os
from pathlib import Path

_DB_FILENAME = "historical_validation_2020-07_to_2022-12.db"
_PACKAGE_MARKERS = ("_build_paper_figures.py", "verify.py")
_ENV_VAR = "RCT_DB_PATH"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The database filename and package markers are hardcoded here, duplicating constants already defined in paths.py. While inlining the resolver into the notebook is a good design for portability, the generation of this string should ideally use the constants from paths.py to ensure they stay in sync.

Comment on lines +64 to +72
def wilson_ci(k, n, z=1.96):
"""Wilson score confidence interval for a proportion."""
if n == 0:
return 0.0, 0.0
p = k / n
denom = 1 + z**2 / n
center = (p + z**2 / (2 * n)) / denom
margin = z * np.sqrt((p * (1 - p) + z**2 / (4 * n)) / n) / denom
return max(0, center - margin), min(1, center + margin)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The wilson_ci function is manually implemented here, but the reproducibility package already depends on statsmodels (as seen in requirements.txt and _build_paper_figures.py), which provides a more robust implementation via proportion_confint. This manual implementation is redundant and potentially less accurate than the library version.

"""V10: re-derive the same per-drug stats the build does, compare to
EXPECTED_OUTPUTS. Same logic as the build's V10 cell, but standalone."""
from scipy.stats import binomtest
from statsmodels.stats.proportion import proportion_confint as wilson # noqa: F401
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The import of proportion_confint as wilson is unused in this function. Removing unused imports keeps the code clean and avoids unnecessary dependencies at runtime.

Comment thread src/pipeline/extract.py
Comment on lines +65 to +112
def _detect_parent_cycles(id_to_parent: dict) -> None:
"""Raise ValueError if id_to_parent contains any cycle.

The upstream-mentioned-drugs recursion below assumes id_to_parent is a
forest (each node has at most one parent and chains terminate). A cycle
in malformed imported data would cause the recursion to memoize on
(eid, remaining) and still never terminate (cache stores final values,
not in-progress visits). Detect once up front and fail loudly.

Iterative DFS with white/gray/black coloring; O(N).
"""
UNVISITED, VISITING, DONE = 0, 1, 2
color = {}
for start in id_to_parent:
if color.get(start, UNVISITED) != UNVISITED:
continue
stack = [start]
path = []
while stack:
node = stack[-1]
c = color.get(node, UNVISITED)
if c == UNVISITED:
color[node] = VISITING
path.append(node)
parent = id_to_parent.get(node)
if parent and parent in id_to_parent:
pcol = color.get(parent, UNVISITED)
if pcol == VISITING:
# back-edge into the active path -> cycle
i = path.index(parent)
raise ValueError(
"parent_id cycle detected: "
+ " -> ".join(path[i:] + [parent])
)
if pcol == UNVISITED:
stack.append(parent)
continue
# parent missing, NULL, or already DONE -> this node is finished
color[node] = DONE
if path and path[-1] == node:
path.pop()
stack.pop()
else:
if path and path[-1] == node:
path.pop()
stack.pop()
color[node] = DONE

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The cycle detection logic in _detect_parent_cycles is implemented using an iterative DFS. While correct, this logic is duplicated in docs/RCT_historical_validation/verify.py. Consider moving this to a shared utility module (e.g., src/utilities/graph.py) to improve maintainability, especially since both locations are part of the same repository.

H4 (substantive correctness, classify.py alias-resolution gap)
  When the pipeline runs with --drug but without --drug-file, AND
  skip_canonicalize is set (or the DB hasn't been canonicalized yet),
  synonyms_for is empty. Aliased mentions like "LDN" or "low dose
  naltrexone" still get classified — they pass the target_aliases
  filter — but the system_prompt called for them with
  synonyms_for.get(alias) = None, so the prompt lacks the canonical-
  drug context. Less obvious aliases (e.g., "low dose 3mg pills")
  could be misclassified by a model that doesn't know the alias
  refers to naltrexone.

  Fix: when target_aliases is populated (i.e., --drug mode), use the
  resolved alias set (minus the matched drug itself) as the synonym
  hint for system_prompt instead of synonyms_for.get(drug). Falls
  back to synonyms_for.get(drug) when target_aliases is None
  (whole-corpus runs).

  Doesn't affect the published numbers — extraction was done with
  canonicalize ON, so synonyms_for was populated at run time. Future
  reviewers re-running cold benefit.

M3 (redundant DB connection in _build_paper_figures.py)
  combined_conn = sqlite3.connect(...) opened a second connection to
  the same DB the setup cell already opened as `conn`. Replaced with
  `combined_conn = conn`.

M5 (redundant wilson_ci implementation in build_notebook.py)
  Hand-rolled Wilson CI replaced with statsmodels.proportion_confint
  (already a transitive dep). Same numerical result; less code to
  defend.

M6 (unused import in verify.py)
  Dropped the dead `from statsmodels.stats.proportion import
  proportion_confint as wilson` and its `# noqa: F401` muzzle.

M2 (utf-8 encoding on read_text)
  run_sentiment_pipeline.py was already fixed in earlier commits.
  Two more read_text() calls without encoding remained (classify.py
  prefilter cache, utilities/__init__.py drug-aliases cache); added
  encoding="utf-8" to both for cross-platform robustness.

M1 (BOM in build_notebook.py)
  Already stripped by an earlier rewrite. Verified no BOM remains.

M4 (DRY: notebook resolver constants vs paths.py)
  The inlined notebook resolver in _build_paper_figures.py duplicated
  DB_FILENAME / PACKAGE_MARKERS / RCT_DB_PATH literals already
  defined in paths.py — drift risk. Now interpolates those constants
  from `paths.DB_FILENAME` etc. into the resolver template at build
  time. Single source of truth; verified the rebuilt notebook has
  the correct values inlined.

M7 (DRY: parent-cycle detection)
  src/pipeline/extract.py:_detect_parent_cycles and
  docs/RCT_historical_validation/verify.py:_detect_cycles were two
  copies of the same iterative DFS. Extracted the shared
  cycle-finding logic into a new src/utilities/graph.py:
  find_parent_cycles (returns all cycles); extract.py now wraps it
  for fail-fast raise. verify.py keeps its inline copy on purpose —
  the reproducibility package can't import from src/ — with a
  docstring noting the duplication is deliberate and instructing
  future maintainers to mirror changes.

False-positive "critical" findings dismissed (no fix)
  Three Gemini findings claimed `{{...}}` in build_notebook.py's
  SETUP_CODE template would create unhashable set-of-dict TypeErrors.
  These are intentional Python format-string escapes (the whole
  string is passed to .format(db_path_block=...)); the rendered
  notebook code has correct single-brace literals. No fix needed.

Verification
  - verify.py: ALL 5 CHECKS PASSED
  - Full build: clean, all numbers unchanged
  - Inlined notebook resolver constants confirmed pulled from paths.py:
    _DB_FILENAME = "historical_validation_2020-07_to_2022-12.db"
    _PACKAGE_MARKERS = ('_build_paper_figures.py', 'verify.py')
    _ENV_VAR = "RCT_DB_PATH"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Airwhale
Copy link
Copy Markdown
Collaborator Author

Airwhale commented May 7, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request establishes a reproducibility package for the RCT historical validation analysis, providing the necessary scripts, documentation, and audit trails to regenerate paper figures and tables. Significant improvements were made to the data import pipeline, including a fix for Reddit ID prefix handling to preserve thread structure and the addition of cycle detection. The reviewer feedback highlights the need for stricter determinism in data extraction and deduplication by adding explicit ordering to SQL queries and sort logic, alongside a recommendation to specify minimum versions for dependencies like Scipy to ensure environment compatibility.

Comment on lines +345 to +349
WHERE lower(t.canonical_name) = ?
AND p.post_date IS NOT NULL
AND p.post_date < ?
AND p.user_id != 'deleted'
''', (drug, window_end_exclusive_ts)).fetchall()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The SQL query lacks an ORDER BY clause. Since the deduplication logic in dedup_recent_then_strength (line 351) is order-dependent for ties (where both post_date and signal_strength are identical), the results may be non-deterministic depending on the database's internal row ordering. Adding a unique identifier like tr.post_id to the ordering ensures absolute reproducibility, which is critical for this package.

Suggested change
WHERE lower(t.canonical_name) = ?
AND p.post_date IS NOT NULL
AND p.post_date < ?
AND p.user_id != 'deleted'
''', (drug, window_end_exclusive_ts)).fetchall()
WHERE lower(t.canonical_name) = ?
AND p.post_date IS NOT NULL
AND p.post_date < ?
AND p.user_id != 'deleted'
ORDER BY tr.post_id
''', (drug, window_end_exclusive_ts)).fetchall()

Comment on lines +106 to +112
WHERE lower(t.canonical_name) = ?
AND p.post_date IS NOT NULL
AND p.post_date < ?
AND p.user_id != 'deleted'
""",
(drug, cutoff_ts),
).fetchall()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The SQL query lacks an ORDER BY clause, which can lead to non-deterministic results during deduplication if multiple reports share the same timestamp and signal strength. Adding ORDER BY tr.post_id ensures consistent behavior across different environments and database snapshots.

        WHERE lower(t.canonical_name) = ?
          AND p.post_date IS NOT NULL
          AND p.post_date < ?
          AND p.user_id != 'deleted'
        ORDER BY tr.post_id
        "",
        (drug, cutoff_ts),
    ).fetchall()

Comment on lines +105 to +106
df = df.sort_values(["user_id", "drug", "post_date", "sig_rank"],
ascending=[True, True, False, False])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To ensure absolute determinism in deduplication, post_id should be included as a final tie-breaker in the sort order. This prevents drop_duplicates from picking an arbitrary row when all other sorted columns are identical, ensuring the generated CSVs are bit-identical across runs.

Suggested change
df = df.sort_values(["user_id", "drug", "post_date", "sig_rank"],
ascending=[True, True, False, False])
df = df.sort_values(["user_id", "drug", "post_date", "sig_rank", "post_id"],
ascending=[True, True, False, False, False])

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good thinking, gemini

pandas
matplotlib
seaborn
scipy
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The analysis scripts use scipy.stats.binomtest, which was introduced in Scipy 1.7.0. Specifying a minimum version ensures that the reproducibility package works correctly in fresh environments without users encountering an ImportError.

scipy>=1.7.0

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This file is missing package versions. That means this will fail to build in a few years as packages get updated.
  2. We should be using uv and pyproject.toml.

uv builds are more reproducible because uv.lock specifies the specific package versions actually installed, along with the python version desired.

Todo: take the requirements from this file and move them to pyproject.toml. run uv sync Then update the docs to remove mention of this file.

@Airwhale Airwhale requested a review from Ely-S May 7, 2026 22:27
Shaun and others added 2 commits May 10, 2026 15:06
…06-06

The Glynne et al. medRxiv preprint went live on 2021-06-06 (URL
medrxiv.org/content/10.1101/2021.06.06.21258272v1). The previous
cutoff 2021-06-07 (strict-<) was off-by-one: it INCLUDED posts made
on 2021-06-06 itself, after the preprint was public. That's a small
but real contamination window — the methodology paragraph promises
"strictly before midnight UTC of the comparator paper's publication
date," and for Glynne the first-public-availability date is the
preprint, not the day after.

Every other drug in the panel uses cutoff = pub_date (strict-< then
excludes the pub date itself). Glynne is now consistent with that
convention.

Empirical effect: 3 treatment_report rows on 2021-06-06 (1 famotidine,
2 loratadine) are now excluded.

  famotidine user 3cc039f2... had no other pre-preprint reports for
  the drug (10 lifetime mentions, the 2021-06-06 one was the first).
  Drops from analysis: n_famotidine 232 -> 231 (-1 positive user).

  loratadine user 336f9e181f85... had only one lifetime loratadine
  report, on 2021-06-06. Drops: n_loratadine 90 -> 89 (-1 positive user).

  loratadine user 93d34b804d58... had 3 earlier pre-preprint reports
  (Mar, Apr, May 2021). Stays in analysis under most-recent-rule
  fallback to May 28; no effect on n.

New numbers (verified by verify.py, EXPECTED_OUTPUTS updated in both
_build_paper_figures.py and verify.py):

  famotidine: n=231 pos=178 pos_pct=77.056% p=5.51e-17
              (was n=232 pos=179 pos_pct=77.155% p=3.57e-17)
  loratadine: n= 89 pos= 72 pos_pct=80.899% p=3.17e-9
              (was n= 90 pos= 73 pos_pct=81.111% p=1.95e-9)

Headline pattern unchanged; both still extreme-low p-values, CIs
still mutually exclusive from non-responder CIs, all 6 trial-outcome
comparisons still match. Tightening makes the methodology paragraph
self-consistent rather than off-by-one.

Files updated:
  - DRUG_CUTOFFS in _build_paper_figures.py, verify.py,
    scripts/dedup_sample_audit.py, scripts/dump_per_drug_csvs.py
  - EXPECTED_OUTPUTS in _build_paper_figures.py and verify.py
  - README: pre-publication-cutoffs table, window-verification
    table (MIN/MAX/reports for famotidine + loratadine refreshed
    against the new cutoff), Expected Output table
  - Regenerated DEDUP_AUDIT.md (now reflects fam/lor raw=692/188
    instead of 693/190), DRUG_ALIASES.md
  - Regenerated paper_figures.html, .ipynb, .executed.ipynb,
    figure0.png, figure1.png, provenance.json

Also corrected stale citation form: "Glynne et al. 2021" -> "Glynne
et al. 2022" everywhere (the journal version is 2022; the preprint
is 2021-06-06, but we cite the journal). Aligned with Tables 2/3/4
in the paper draft.

verify.py: ALL 5 CHECKS PASSED.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t plot) -> Figure 2

The script previously labelled the per-drug sentiment-breakdown
stacked-bar chart as "Figure 0" and the responder/non-responder
forest plot as "Figure 1". Publication convention is to number from
1; the existing "Figure 0" reads as a draft artifact. Renumbered so
that:

  - Figure 1 = per-drug full sentiment breakdown (stacked bar)
  - Figure 2 = forest plot (responders vs non-responders, 95% Wilson CIs)

Cascade:
  - All "Figure 0" -> "Figure 1" (build script titles, captions,
    intro markdown describing the notebook artifacts)
  - All "Figure 1" -> "Figure 2" (where Figure 1 was previously the
    forest plot; this includes body-text references in markdown
    cells, comments, and references in README)
  - PNG filenames: figure0.png -> figure1.png (stacked bar);
    figure1.png -> figure2.png (forest plot). The old figure0.png
    was removed from tracking.
  - README updates:
    * Intro: "Reproduces Figure 1, Table 2, and Table 3" -> "Figure 1,
      Figure 2, Table 2, and Table 3"
    * Output-files list: lists both figure1.png and figure2.png with
      what each shows
    * "What's reproduced" table: split the old Figure 1 row into
      Figure 1 (stacked bar) + Figure 2 (forest plot), with
      Figure 2 carrying the responder-vs-non-responder narrative
    * Methodology references to "Figure 1" updated to "Figure 2"
      where they mean the forest plot (error bars, n column,
      Wilson CI step, deleted-user-filter propagation)
  - Docstring of the build script updated to describe both figures.

Verification:
  - verify.py: ALL 5 CHECKS PASSED (renumber is presentation-only;
    numeric assertions unchanged)
  - Rebuild produces output/figure1.png (157 KB, stacked bar) and
    output/figure2.png (128 KB, forest plot); old figure0.png removed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- `loradatine` *(misspelling)*
- `loratadinе` *(misspelling)*
- `claritin redi-tabs` *(multi-word)*
- `loratab`
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.google.com/search?q=loratab

this looks like different drug

lines.append("# Drug aliases used in extraction\n")
lines.append(f"**Generated:** {now} from `{args.db.as_posix()}`")
lines.append("**Generator:** `scripts/dump_drug_aliases.py`")
lines.append("")
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add something like this, I bet Gemini will catch the errors:

lines.append("Reviewers: Check that all aliases are correct by ensuring they actually refer to the same treatment. ")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants