IRR pilot methodology, samples, and codebook (draft — do not merge yet) by Airwhale · Pull Request #39 · Ely-S/PatientPunk

Airwhale · 2026-05-04T21:23:17Z

⚠️ Not ready to merge

Holding open as a draft. Two reasons:

The 500-sample pilot has not been coded yet. Samples are drawn and ready, the codebook is finalized, but no human or AI coder has run against the 500 set. The IRR α numbers for that pilot don't exist yet.
Want Eli's input on whether the IRR artifacts should ship publicly in this form vs. some other arrangement.

If you have feedback on either, drop a review comment. Otherwise this stays in draft until the 500-pilot completes and Eli weighs in.

What's in this PR

Brings the inter-coder reliability (IRR) pilot artifacts and codebook into version control under docs/, separate from data/ (which stays gitignored — see "Privacy" below). Reviewers can see the samples coders received, the rules they coded against, and how each pilot was drawn — without needing S3 download links.

Two pilots, both included

Pilot	Window	n	Status	Folder
300-sample (original)	1-month r/covidlonghaulers (2026-03-11 → 2026-04-10)	300	Coded — 11 coders (2 human + 9 AI panel); α report kept private in `data/irr_pilot/`	`docs/irr_pilot/`
500-sample (follow-up)	Nov–Dec 2021 (matches the historical-validation paper's analysis window)	500	NOT YET CODED — samples drawn + ready, no coder outputs yet	`docs/irr_pilot_500/`

The 500-pilot intentionally aligns with the paper's analysis window so per-drug agreement statistics map onto the figures we publish.

File inventory

Each folder has a brief README.md. Common artifacts:

SAMPLING_METHODOLOGY.md — how that pilot's sample was drawn (corpus, codable pool, stratification, seed).
coding_input.csv — the sampled units distributed to coders. One row per sample.
ai_labels.csv — analyst-only stratum labels + AI pipeline output. Not distributed to coders (would break blinding).
coder_output_template.csv — the blank form coders fill in.

The 300-pilot folder has the codebook (CODING_INSTRUCTIONS.md + .pdf); the 500-pilot reuses it by reference rather than duplicating.

STOP rule

If personal_use = no, leave sentiment and signal_strength blank (or n/a). Confidence is always recorded. Non-personal-use rows contribute to IRR on the personal_use decision but don't enter per-drug sentiment analysis.

What still needs to happen before merge

Run the 500-pilot study. Distribute coding_input.csv + coder_output_template.csv + the codebook to coders. Compute α. Decide whether the 500-pilot reproduces the per-drug reliability story we got from the 300-pilot.
Eli's review of the IRR
Probably a bunch of other things!

🤖 Generated with Claude Code and edited by Shaun!

Two parallel sampling-methodology documents for the inter-coder reliability pilots that validate the AI sentiment-extraction pipeline. Both follow the same structure (source corpus → codable pool → stratification with enrichment → blinding → reproducibility) so the two pilots are easy to compare side-by-side. Until now the IRR work has lived entirely outside git, distributed via S3 presigned URLs. The data folders (data/irr_pilot/, data/irr_pilot_500/) are gitignored and that is correct — they hold large CSVs, JSON, and DBs. But the methodology documents themselves are pure narrative and do belong in version control. Putting them under docs/ matches the existing pattern (docs/RCT_historical_validation/, docs/DATA_NOTES.md, docs/ldn_notes.md). docs/irr_pilot/SAMPLING_METHODOLOGY.md (300-sample, original): - Source: 1,081 posts + 16,342 comments from r/covidlonghaulers, 2026-03-11 -> 2026-04-10 (30-day window) - Codable pool: 9,930 units after the 100-800 char filter - Stratification: 50/30/20 -> 150 ai_found_drug + 90 keyword_match + 60 random; ai_found_drug stratum is ~1.8% of the codable pool - Seed 42 docs/irr_pilot_500/SAMPLING_METHODOLOGY.md (500-sample, follow-up): - Source: full r/covidlonghaulers history (Arctic Shift), restricted to Nov-Dec 2021 (2,739 posts + 39,559 comments) - Codable pool: 22,915 units after the 100-800 char filter - Stratification: 50/30/20 -> 250 ai_found_drug + 150 keyword_match + 100 random; ai_found_drug stratum is ~12% of the codable pool - Seed 43 (deliberately differs from the 300-pilot's seed=42) - Includes a tighter paper-ready prose version below the reference doc, which can be lifted into a Methods section verbatim Both pilots use the same scripts/sample_for_coding.py, the same max_upstream_depth=2 cap on parent context, and the same blinding (stratum membership in analyst-only ai_labels.csv). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…1.4 schema Brings the IRR sample artifacts and codebook into version control under docs/, separate from data/ which stays gitignored. Reviewers can see the samples coders received and the rules they coded against, without needing S3 download links. What landed ----------- 1. **300-pilot artifacts** added under docs/irr_pilot/: - coding_input.csv (300 sampled units distributed to coders) - ai_labels.csv (analyst-only stratum labels + AI pipeline output) - coder_output_template.csv (blank form, v1.4 schema) - CODING_INSTRUCTIONS.md (codebook, updated to v1.4) - CODING_INSTRUCTIONS.pdf (regenerated from v1.4 md) - SAMPLING_METHODOLOGY.md (already existed; updated to link the new tracked artifacts) 2. **500-pilot artifacts** added under docs/irr_pilot_500/: - coding_input.csv (500 sampled units) - ai_labels.csv (analyst stratum labels) - coder_output_template.csv (blank form, v1.4 schema) - SAMPLING_METHODOLOGY.md (already existed; updated to link tracked artifacts and to point at the shared codebook in docs/irr_pilot/) 3. **Codebook bumped v1.3 -> v1.4.** Schema rationalization: - Removed: side_effects_reported, side_effects_description columns and their full Step-5 section. Per-drug side-effects coding turned out to add coder cognitive load without changing analytical conclusions. - Added: per-drug personal_use (yes/no) column. Replaces the "sentiment = neutral implies no personal use" implicit encoding with an explicit yes/no decision. Cleaner separation: personal_use is the inclusion flag; sentiment is only filled when included. - STOP rule: if personal_use=no, blank sentiment and signal_strength (still record confidence). Non-personal-use rows contribute to IRR on the personal_use decision but don't enter per-drug sentiment analysis. - Updated: multi-drug example, all 8 worked examples, decision tree. 4. **Both templates now have identical headers** (sample_id, coder_id, drug_mention_verbatim, personal_use, sentiment, signal_strength, confidence, notes). The 500-pilot template was already on this schema; the 300-pilot template was hand-regenerated from its sample_ids list to match. 5. **OneDrive nesting fix.** docs/irr_pilot_500/ had been folded under docs/irr_pilot/irr_pilot_500/ by OneDrive sync at some point; moved files back to the top-level docs/irr_pilot_500/ where git expected them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

One README in each pilot folder explaining what's inside, the file list, and the run status: - docs/irr_pilot/README.md — 300-sample pilot, marked CODED (11 coders done, α report in data/irr_pilot/ gitignored). - docs/irr_pilot_500/README.md — 500-sample pilot, marked NOT YET CODED. Samples drawn and ready but no coder outputs yet. Both READMEs cross-link and note that the codebook lives under docs/irr_pilot/ only (the 500-pilot reuses it rather than duplicating). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces documentation and data artifacts for two inter-coder reliability (IRR) pilots (300 and 500 samples) used to validate the PatientPunk drug-extraction and sentiment pipeline. The changes include coding instructions, sampling methodologies, and AI-generated labels for comparison. Feedback focuses on data integrity issues in the 300-sample pilot where 'weak' was incorrectly mapped to the sentiment column, inconsistencies in the coding instructions regarding the handling of 'n/a' values for non-personal use cases, and the use of absolute local file paths in the methodology documentation which hinders reproducibility.

gemini-code-assist · 2026-05-04T21:25:12Z

+irr-pilot-038,t1_oezyt0p,no_ai_drug_keyword_match,0,,,,yes
+irr-pilot-039,t1_o9ykeur,ai_found_drug,1,taurine,positive,weak,no
+irr-pilot-040,t1_oa3d8x6,ai_found_drug,1,taurine,positive,moderate,yes
+irr-pilot-041,t1_oa1m0ln,ai_found_drug,1,hydroxyzine,weak,weak,yes


The ai_sentiments column contains the invalid value weak in several rows (e.g., lines 42, 126, 179). According to the codebook (v1.4) defined in CODING_INSTRUCTIONS.md, valid sentiment values are positive, negative, mixed, or neutral. The value weak is a valid entry for signal_strength, not sentiment. This indicates a data mapping error in the AI pipeline output or the export process for the 300-sample pilot, which will invalidate IRR calculations for the sentiment dimension.

gemini-code-assist · 2026-05-04T21:25:12Z

+We scraped one month of r/covidlonghaulers posting history into a single JSON
+file:
+
+- `~/OneDrive/Documents/Projects/PatientPunk_data/subreddit_posts_month_1081posts.json`


The methodology documentation references an absolute local file path (~/OneDrive/Documents/Projects/...) here and in the reproducibility script (line 176). This limits the reproducibility of the sampling process for other contributors. It is recommended to use a relative path from the repository root or a generic placeholder.

gemini-code-assist · 2026-05-04T21:25:12Z

+> **⚠️ STOP rule.** If `personal_use = no`, leave `sentiment` and
+> `signal_strength` blank (or set to `n/a`) and move on. The classifier and
+> the analysis only count personal-use reports, so non-personal-use rows
+> contribute to inter-coder reliability on the `personal_use` decision and
+> nothing further.


The instructions for the STOP rule are inconsistent with the column definitions and worked examples. Line 28 suggests leaving both sentiment and signal_strength blank or setting them to n/a. However, line 23 explicitly defines n/a as a valid value for signal_strength, while line 22 does not include n/a for sentiment. Worked examples 3 (line 146) and 8 (line 167) show sentiment as blank and signal_strength as n/a. To ensure high inter-coder reliability, the instructions should be standardized: sentiment should be left blank and signal_strength should be set to n/a when personal_use = no.

v1.4 (in the prior commit on this branch) experimentally dropped the per-drug side_effects_reported / side_effects_description columns in favour of just adding personal_use. Reviewer feedback was to keep both: add personal_use AND keep side-effects. v1.5 final schema (10 coder-filled columns): sample_id, coder_id, drug_mention_verbatim, personal_use, sentiment, signal_strength, confidence, side_effects_reported, side_effects_description, notes Changes from v1.4 -> v1.5: - Restored Step 6 (side effects) section in CODING_INSTRUCTIONS.md (yes/no flag + free-text description, only filled when personal_use = yes). - Restored both side_effects_* columns in both coder_output_template.csv files (300-pilot and 500-pilot — both regenerated with the v1.5 schema; both have identical headers). - STOP rule updated: when personal_use = no, blank sentiment, signal_strength, AND both side_effects_* columns. Confidence still always filled. - Worked examples (8 cases) updated to show side_effects_reported and side_effects_description on every row. - Decision tree updated with the side-effects branch. - Multi-drug example updated with side_effects columns. - Changelog rewritten to describe the v1.3 -> v1.4 (draft) -> v1.5 transition honestly. - PDF regenerated from v1.5 markdown. Verified: both template headers identical: sample_id,coder_id,drug_mention_verbatim,personal_use,sentiment, signal_strength,confidence,side_effects_reported, side_effects_description,notes Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…coded) The v1.5 commit on this branch added side_effects_reported and side_effects_description columns "back" to the codebook + both templates, on the (incorrect) assumption that the v1.3 codebook reflected what was actually coded in the 300-pilot. It didn't. Verified by inspecting every file in data/irr_pilot/: - human_coder_polina.csv, human_coder_shaun.csv: 6 cols, no side_effects, has personal_use - all 12 ai_coder_*.csv: 8 cols, no side_effects, has personal_use - merged_long.csv: 8 cols + canonical_drug, no side_effects, has personal_use - alpha_report.md and alpha_pairwise_*.csv: α computed on drug_extracted, personal_use, sentiment, signal_strength; no side_effects α reported - coder_output_template.no_side_effects.csv.bak: original schema backup (personal_use, no side_effects) - The ONLY file in data/irr_pilot/ with side_effects_* columns was the in-data/ blank template — never populated by any coder. So side_effects_reported and side_effects_description appeared only in the v1.3 codebook + an unused template. The actual study data uses personal_use (yes/no) and no side_effects. v1.5 was adding columns to the codebook that don't exist in the data. This commit reverts to v1.4 schema (personal_use, no side_effects) which matches every actual coder output. Changelog rewritten to honestly describe v1.4 as a drift correction rather than a new feature. Also: per Shaun's request, README now explicitly states that per-coder outputs (human_coder_*, merged_long, alpha_report, etc.) are intentionally not redistributed in this repo — coder identity paired with coding decisions is treated as private to the team. data/ is already gitignored at the repo level so this is policy + documentation, not a new technical safeguard. Verified: both template headers identical (8 cols), PDF regenerated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Shaun and others added 3 commits May 3, 2026 17:31

gemini-code-assist Bot reviewed May 4, 2026

View reviewed changes

Shaun and others added 2 commits May 4, 2026 14:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IRR pilot methodology, samples, and codebook (draft — do not merge yet)#39

IRR pilot methodology, samples, and codebook (draft — do not merge yet)#39
Airwhale wants to merge 5 commits into
mainfrom
shaun/irr-methodology-docs

Airwhale commented May 4, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 4, 2026

Uh oh!

gemini-code-assist Bot May 4, 2026

Uh oh!

gemini-code-assist Bot May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Airwhale commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Not ready to merge

What's in this PR

Two pilots, both included

File inventory

STOP rule

What still needs to happen before merge

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Airwhale commented May 4, 2026 •

edited

Loading