IRR pilot methodology, samples, and codebook (draft — do not merge yet)#39
IRR pilot methodology, samples, and codebook (draft — do not merge yet)#39Airwhale wants to merge 5 commits into
Conversation
Two parallel sampling-methodology documents for the inter-coder reliability
pilots that validate the AI sentiment-extraction pipeline. Both follow the
same structure (source corpus → codable pool → stratification with
enrichment → blinding → reproducibility) so the two pilots are easy to
compare side-by-side.
Until now the IRR work has lived entirely outside git, distributed via S3
presigned URLs. The data folders (data/irr_pilot/, data/irr_pilot_500/)
are gitignored and that is correct — they hold large CSVs, JSON, and DBs.
But the methodology documents themselves are pure narrative and do belong
in version control. Putting them under docs/ matches the existing pattern
(docs/RCT_historical_validation/, docs/DATA_NOTES.md, docs/ldn_notes.md).
docs/irr_pilot/SAMPLING_METHODOLOGY.md (300-sample, original):
- Source: 1,081 posts + 16,342 comments from r/covidlonghaulers,
2026-03-11 -> 2026-04-10 (30-day window)
- Codable pool: 9,930 units after the 100-800 char filter
- Stratification: 50/30/20 -> 150 ai_found_drug + 90 keyword_match + 60
random; ai_found_drug stratum is ~1.8% of the codable pool
- Seed 42
docs/irr_pilot_500/SAMPLING_METHODOLOGY.md (500-sample, follow-up):
- Source: full r/covidlonghaulers history (Arctic Shift), restricted to
Nov-Dec 2021 (2,739 posts + 39,559 comments)
- Codable pool: 22,915 units after the 100-800 char filter
- Stratification: 50/30/20 -> 250 ai_found_drug + 150 keyword_match + 100
random; ai_found_drug stratum is ~12% of the codable pool
- Seed 43 (deliberately differs from the 300-pilot's seed=42)
- Includes a tighter paper-ready prose version below the reference doc,
which can be lifted into a Methods section verbatim
Both pilots use the same scripts/sample_for_coding.py, the same
max_upstream_depth=2 cap on parent context, and the same blinding
(stratum membership in analyst-only ai_labels.csv).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…1.4 schema
Brings the IRR sample artifacts and codebook into version control under
docs/, separate from data/ which stays gitignored. Reviewers can see the
samples coders received and the rules they coded against, without needing
S3 download links.
What landed
-----------
1. **300-pilot artifacts** added under docs/irr_pilot/:
- coding_input.csv (300 sampled units distributed to coders)
- ai_labels.csv (analyst-only stratum labels + AI pipeline output)
- coder_output_template.csv (blank form, v1.4 schema)
- CODING_INSTRUCTIONS.md (codebook, updated to v1.4)
- CODING_INSTRUCTIONS.pdf (regenerated from v1.4 md)
- SAMPLING_METHODOLOGY.md (already existed; updated to link the new
tracked artifacts)
2. **500-pilot artifacts** added under docs/irr_pilot_500/:
- coding_input.csv (500 sampled units)
- ai_labels.csv (analyst stratum labels)
- coder_output_template.csv (blank form, v1.4 schema)
- SAMPLING_METHODOLOGY.md (already existed; updated to link tracked
artifacts and to point at the shared
codebook in docs/irr_pilot/)
3. **Codebook bumped v1.3 -> v1.4.** Schema rationalization:
- Removed: side_effects_reported, side_effects_description columns and
their full Step-5 section. Per-drug side-effects coding turned out
to add coder cognitive load without changing analytical conclusions.
- Added: per-drug personal_use (yes/no) column. Replaces the
"sentiment = neutral implies no personal use" implicit encoding
with an explicit yes/no decision. Cleaner separation: personal_use
is the inclusion flag; sentiment is only filled when included.
- STOP rule: if personal_use=no, blank sentiment and signal_strength
(still record confidence). Non-personal-use rows contribute to IRR
on the personal_use decision but don't enter per-drug sentiment
analysis.
- Updated: multi-drug example, all 8 worked examples, decision tree.
4. **Both templates now have identical headers** (sample_id, coder_id,
drug_mention_verbatim, personal_use, sentiment, signal_strength,
confidence, notes). The 500-pilot template was already on this
schema; the 300-pilot template was hand-regenerated from its
sample_ids list to match.
5. **OneDrive nesting fix.** docs/irr_pilot_500/ had been folded under
docs/irr_pilot/irr_pilot_500/ by OneDrive sync at some point; moved
files back to the top-level docs/irr_pilot_500/ where git expected
them.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
One README in each pilot folder explaining what's inside, the file
list, and the run status:
- docs/irr_pilot/README.md — 300-sample pilot, marked CODED
(11 coders done, α report in
data/irr_pilot/ gitignored).
- docs/irr_pilot_500/README.md — 500-sample pilot, marked NOT YET
CODED. Samples drawn and ready
but no coder outputs yet.
Both READMEs cross-link and note that the codebook lives under
docs/irr_pilot/ only (the 500-pilot reuses it rather than duplicating).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces documentation and data artifacts for two inter-coder reliability (IRR) pilots (300 and 500 samples) used to validate the PatientPunk drug-extraction and sentiment pipeline. The changes include coding instructions, sampling methodologies, and AI-generated labels for comparison. Feedback focuses on data integrity issues in the 300-sample pilot where 'weak' was incorrectly mapped to the sentiment column, inconsistencies in the coding instructions regarding the handling of 'n/a' values for non-personal use cases, and the use of absolute local file paths in the methodology documentation which hinders reproducibility.
| irr-pilot-038,t1_oezyt0p,no_ai_drug_keyword_match,0,,,,yes | ||
| irr-pilot-039,t1_o9ykeur,ai_found_drug,1,taurine,positive,weak,no | ||
| irr-pilot-040,t1_oa3d8x6,ai_found_drug,1,taurine,positive,moderate,yes | ||
| irr-pilot-041,t1_oa1m0ln,ai_found_drug,1,hydroxyzine,weak,weak,yes |
There was a problem hiding this comment.
The ai_sentiments column contains the invalid value weak in several rows (e.g., lines 42, 126, 179). According to the codebook (v1.4) defined in CODING_INSTRUCTIONS.md, valid sentiment values are positive, negative, mixed, or neutral. The value weak is a valid entry for signal_strength, not sentiment. This indicates a data mapping error in the AI pipeline output or the export process for the 300-sample pilot, which will invalidate IRR calculations for the sentiment dimension.
| We scraped one month of r/covidlonghaulers posting history into a single JSON | ||
| file: | ||
|
|
||
| - `~/OneDrive/Documents/Projects/PatientPunk_data/subreddit_posts_month_1081posts.json` |
There was a problem hiding this comment.
The methodology documentation references an absolute local file path (~/OneDrive/Documents/Projects/...) here and in the reproducibility script (line 176). This limits the reproducibility of the sampling process for other contributors. It is recommended to use a relative path from the repository root or a generic placeholder.
| > **⚠️ STOP rule.** If `personal_use = no`, leave `sentiment` and | ||
| > `signal_strength` blank (or set to `n/a`) and move on. The classifier and | ||
| > the analysis only count personal-use reports, so non-personal-use rows | ||
| > contribute to inter-coder reliability on the `personal_use` decision and | ||
| > nothing further. |
There was a problem hiding this comment.
The instructions for the STOP rule are inconsistent with the column definitions and worked examples. Line 28 suggests leaving both sentiment and signal_strength blank or setting them to n/a. However, line 23 explicitly defines n/a as a valid value for signal_strength, while line 22 does not include n/a for sentiment. Worked examples 3 (line 146) and 8 (line 167) show sentiment as blank and signal_strength as n/a. To ensure high inter-coder reliability, the instructions should be standardized: sentiment should be left blank and signal_strength should be set to n/a when personal_use = no.
v1.4 (in the prior commit on this branch) experimentally dropped the per-drug side_effects_reported / side_effects_description columns in favour of just adding personal_use. Reviewer feedback was to keep both: add personal_use AND keep side-effects. v1.5 final schema (10 coder-filled columns): sample_id, coder_id, drug_mention_verbatim, personal_use, sentiment, signal_strength, confidence, side_effects_reported, side_effects_description, notes Changes from v1.4 -> v1.5: - Restored Step 6 (side effects) section in CODING_INSTRUCTIONS.md (yes/no flag + free-text description, only filled when personal_use = yes). - Restored both side_effects_* columns in both coder_output_template.csv files (300-pilot and 500-pilot — both regenerated with the v1.5 schema; both have identical headers). - STOP rule updated: when personal_use = no, blank sentiment, signal_strength, AND both side_effects_* columns. Confidence still always filled. - Worked examples (8 cases) updated to show side_effects_reported and side_effects_description on every row. - Decision tree updated with the side-effects branch. - Multi-drug example updated with side_effects columns. - Changelog rewritten to describe the v1.3 -> v1.4 (draft) -> v1.5 transition honestly. - PDF regenerated from v1.5 markdown. Verified: both template headers identical: sample_id,coder_id,drug_mention_verbatim,personal_use,sentiment, signal_strength,confidence,side_effects_reported, side_effects_description,notes Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…coded)
The v1.5 commit on this branch added side_effects_reported and
side_effects_description columns "back" to the codebook + both templates,
on the (incorrect) assumption that the v1.3 codebook reflected what was
actually coded in the 300-pilot. It didn't.
Verified by inspecting every file in data/irr_pilot/:
- human_coder_polina.csv, human_coder_shaun.csv:
6 cols, no side_effects, has personal_use
- all 12 ai_coder_*.csv:
8 cols, no side_effects, has personal_use
- merged_long.csv:
8 cols + canonical_drug, no side_effects, has personal_use
- alpha_report.md and alpha_pairwise_*.csv:
α computed on drug_extracted, personal_use, sentiment,
signal_strength; no side_effects α reported
- coder_output_template.no_side_effects.csv.bak:
original schema backup (personal_use, no side_effects)
- The ONLY file in data/irr_pilot/ with side_effects_* columns was
the in-data/ blank template — never populated by any coder.
So side_effects_reported and side_effects_description appeared only in
the v1.3 codebook + an unused template. The actual study data uses
personal_use (yes/no) and no side_effects. v1.5 was adding columns to
the codebook that don't exist in the data.
This commit reverts to v1.4 schema (personal_use, no side_effects) which
matches every actual coder output. Changelog rewritten to honestly
describe v1.4 as a drift correction rather than a new feature.
Also: per Shaun's request, README now explicitly states that per-coder
outputs (human_coder_*, merged_long, alpha_report, etc.) are
intentionally not redistributed in this repo — coder identity paired
with coding decisions is treated as private to the team. data/ is
already gitignored at the repo level so this is policy + documentation,
not a new technical safeguard.
Verified: both template headers identical (8 cols), PDF regenerated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Holding open as a draft. Two reasons:
If you have feedback on either, drop a review comment. Otherwise this stays in draft until the 500-pilot completes and Eli weighs in.
What's in this PR
Brings the inter-coder reliability (IRR) pilot artifacts and codebook into version control under
docs/, separate fromdata/(which stays gitignored — see "Privacy" below). Reviewers can see the samples coders received, the rules they coded against, and how each pilot was drawn — without needing S3 download links.Two pilots, both included
data/irr_pilot/docs/irr_pilot/docs/irr_pilot_500/The 500-pilot intentionally aligns with the paper's analysis window so per-drug agreement statistics map onto the figures we publish.
File inventory
Each folder has a brief
README.md. Common artifacts:SAMPLING_METHODOLOGY.md— how that pilot's sample was drawn (corpus, codable pool, stratification, seed).coding_input.csv— the sampled units distributed to coders. One row per sample.ai_labels.csv— analyst-only stratum labels + AI pipeline output. Not distributed to coders (would break blinding).coder_output_template.csv— the blank form coders fill in.The 300-pilot folder has the codebook (
CODING_INSTRUCTIONS.md+.pdf); the 500-pilot reuses it by reference rather than duplicating.STOP rule
If
personal_use = no, leavesentimentandsignal_strengthblank (orn/a). Confidence is always recorded. Non-personal-use rows contribute to IRR on the personal_use decision but don't enter per-drug sentiment analysis.What still needs to happen before merge
coding_input.csv+coder_output_template.csv+ the codebook to coders. Compute α. Decide whether the 500-pilot reproduces the per-drug reliability story we got from the 300-pilot.🤖 Generated with Claude Code and edited by Shaun!