Skip to content

IRR pilot methodology, samples, and codebook (draft — do not merge yet)#39

Draft
Airwhale wants to merge 5 commits into
mainfrom
shaun/irr-methodology-docs
Draft

IRR pilot methodology, samples, and codebook (draft — do not merge yet)#39
Airwhale wants to merge 5 commits into
mainfrom
shaun/irr-methodology-docs

Conversation

@Airwhale
Copy link
Copy Markdown
Collaborator

@Airwhale Airwhale commented May 4, 2026

⚠️ Not ready to merge

Holding open as a draft. Two reasons:

  1. The 500-sample pilot has not been coded yet. Samples are drawn and ready, the codebook is finalized, but no human or AI coder has run against the 500 set. The IRR α numbers for that pilot don't exist yet.
  2. Want Eli's input on whether the IRR artifacts should ship publicly in this form vs. some other arrangement.

If you have feedback on either, drop a review comment. Otherwise this stays in draft until the 500-pilot completes and Eli weighs in.


What's in this PR

Brings the inter-coder reliability (IRR) pilot artifacts and codebook into version control under docs/, separate from data/ (which stays gitignored — see "Privacy" below). Reviewers can see the samples coders received, the rules they coded against, and how each pilot was drawn — without needing S3 download links.

Two pilots, both included

Pilot Window n Status Folder
300-sample (original) 1-month r/covidlonghaulers (2026-03-11 → 2026-04-10) 300 Coded — 11 coders (2 human + 9 AI panel); α report kept private in data/irr_pilot/ docs/irr_pilot/
500-sample (follow-up) Nov–Dec 2021 (matches the historical-validation paper's analysis window) 500 NOT YET CODED — samples drawn + ready, no coder outputs yet docs/irr_pilot_500/

The 500-pilot intentionally aligns with the paper's analysis window so per-drug agreement statistics map onto the figures we publish.

File inventory

Each folder has a brief README.md. Common artifacts:

  • SAMPLING_METHODOLOGY.md — how that pilot's sample was drawn (corpus, codable pool, stratification, seed).
  • coding_input.csv — the sampled units distributed to coders. One row per sample.
  • ai_labels.csv — analyst-only stratum labels + AI pipeline output. Not distributed to coders (would break blinding).
  • coder_output_template.csv — the blank form coders fill in.

The 300-pilot folder has the codebook (CODING_INSTRUCTIONS.md + .pdf); the 500-pilot reuses it by reference rather than duplicating.

STOP rule

If personal_use = no, leave sentiment and signal_strength blank (or n/a). Confidence is always recorded. Non-personal-use rows contribute to IRR on the personal_use decision but don't enter per-drug sentiment analysis.

What still needs to happen before merge

  1. Run the 500-pilot study. Distribute coding_input.csv + coder_output_template.csv + the codebook to coders. Compute α. Decide whether the 500-pilot reproduces the per-drug reliability story we got from the 300-pilot.
  2. Eli's review of the IRR
  3. Probably a bunch of other things!

🤖 Generated with Claude Code and edited by Shaun!

Shaun and others added 3 commits May 3, 2026 17:31
Two parallel sampling-methodology documents for the inter-coder reliability
pilots that validate the AI sentiment-extraction pipeline. Both follow the
same structure (source corpus → codable pool → stratification with
enrichment → blinding → reproducibility) so the two pilots are easy to
compare side-by-side.

Until now the IRR work has lived entirely outside git, distributed via S3
presigned URLs. The data folders (data/irr_pilot/, data/irr_pilot_500/)
are gitignored and that is correct — they hold large CSVs, JSON, and DBs.
But the methodology documents themselves are pure narrative and do belong
in version control. Putting them under docs/ matches the existing pattern
(docs/RCT_historical_validation/, docs/DATA_NOTES.md, docs/ldn_notes.md).

docs/irr_pilot/SAMPLING_METHODOLOGY.md (300-sample, original):
  - Source: 1,081 posts + 16,342 comments from r/covidlonghaulers,
    2026-03-11 -> 2026-04-10 (30-day window)
  - Codable pool: 9,930 units after the 100-800 char filter
  - Stratification: 50/30/20 -> 150 ai_found_drug + 90 keyword_match + 60
    random; ai_found_drug stratum is ~1.8% of the codable pool
  - Seed 42

docs/irr_pilot_500/SAMPLING_METHODOLOGY.md (500-sample, follow-up):
  - Source: full r/covidlonghaulers history (Arctic Shift), restricted to
    Nov-Dec 2021 (2,739 posts + 39,559 comments)
  - Codable pool: 22,915 units after the 100-800 char filter
  - Stratification: 50/30/20 -> 250 ai_found_drug + 150 keyword_match + 100
    random; ai_found_drug stratum is ~12% of the codable pool
  - Seed 43 (deliberately differs from the 300-pilot's seed=42)
  - Includes a tighter paper-ready prose version below the reference doc,
    which can be lifted into a Methods section verbatim

Both pilots use the same scripts/sample_for_coding.py, the same
max_upstream_depth=2 cap on parent context, and the same blinding
(stratum membership in analyst-only ai_labels.csv).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…1.4 schema

Brings the IRR sample artifacts and codebook into version control under
docs/, separate from data/ which stays gitignored. Reviewers can see the
samples coders received and the rules they coded against, without needing
S3 download links.

What landed
-----------

1. **300-pilot artifacts** added under docs/irr_pilot/:
   - coding_input.csv  (300 sampled units distributed to coders)
   - ai_labels.csv     (analyst-only stratum labels + AI pipeline output)
   - coder_output_template.csv  (blank form, v1.4 schema)
   - CODING_INSTRUCTIONS.md     (codebook, updated to v1.4)
   - CODING_INSTRUCTIONS.pdf    (regenerated from v1.4 md)
   - SAMPLING_METHODOLOGY.md    (already existed; updated to link the new
                                 tracked artifacts)

2. **500-pilot artifacts** added under docs/irr_pilot_500/:
   - coding_input.csv  (500 sampled units)
   - ai_labels.csv     (analyst stratum labels)
   - coder_output_template.csv  (blank form, v1.4 schema)
   - SAMPLING_METHODOLOGY.md    (already existed; updated to link tracked
                                 artifacts and to point at the shared
                                 codebook in docs/irr_pilot/)

3. **Codebook bumped v1.3 -> v1.4.** Schema rationalization:
   - Removed: side_effects_reported, side_effects_description columns and
     their full Step-5 section. Per-drug side-effects coding turned out
     to add coder cognitive load without changing analytical conclusions.
   - Added: per-drug personal_use (yes/no) column. Replaces the
     "sentiment = neutral implies no personal use" implicit encoding
     with an explicit yes/no decision. Cleaner separation: personal_use
     is the inclusion flag; sentiment is only filled when included.
   - STOP rule: if personal_use=no, blank sentiment and signal_strength
     (still record confidence). Non-personal-use rows contribute to IRR
     on the personal_use decision but don't enter per-drug sentiment
     analysis.
   - Updated: multi-drug example, all 8 worked examples, decision tree.

4. **Both templates now have identical headers** (sample_id, coder_id,
   drug_mention_verbatim, personal_use, sentiment, signal_strength,
   confidence, notes). The 500-pilot template was already on this
   schema; the 300-pilot template was hand-regenerated from its
   sample_ids list to match.

5. **OneDrive nesting fix.** docs/irr_pilot_500/ had been folded under
   docs/irr_pilot/irr_pilot_500/ by OneDrive sync at some point; moved
   files back to the top-level docs/irr_pilot_500/ where git expected
   them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
One README in each pilot folder explaining what's inside, the file
list, and the run status:

- docs/irr_pilot/README.md         — 300-sample pilot, marked CODED
                                     (11 coders done, α report in
                                     data/irr_pilot/ gitignored).
- docs/irr_pilot_500/README.md     — 500-sample pilot, marked NOT YET
                                     CODED. Samples drawn and ready
                                     but no coder outputs yet.

Both READMEs cross-link and note that the codebook lives under
docs/irr_pilot/ only (the 500-pilot reuses it rather than duplicating).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces documentation and data artifacts for two inter-coder reliability (IRR) pilots (300 and 500 samples) used to validate the PatientPunk drug-extraction and sentiment pipeline. The changes include coding instructions, sampling methodologies, and AI-generated labels for comparison. Feedback focuses on data integrity issues in the 300-sample pilot where 'weak' was incorrectly mapped to the sentiment column, inconsistencies in the coding instructions regarding the handling of 'n/a' values for non-personal use cases, and the use of absolute local file paths in the methodology documentation which hinders reproducibility.

irr-pilot-038,t1_oezyt0p,no_ai_drug_keyword_match,0,,,,yes
irr-pilot-039,t1_o9ykeur,ai_found_drug,1,taurine,positive,weak,no
irr-pilot-040,t1_oa3d8x6,ai_found_drug,1,taurine,positive,moderate,yes
irr-pilot-041,t1_oa1m0ln,ai_found_drug,1,hydroxyzine,weak,weak,yes
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The ai_sentiments column contains the invalid value weak in several rows (e.g., lines 42, 126, 179). According to the codebook (v1.4) defined in CODING_INSTRUCTIONS.md, valid sentiment values are positive, negative, mixed, or neutral. The value weak is a valid entry for signal_strength, not sentiment. This indicates a data mapping error in the AI pipeline output or the export process for the 300-sample pilot, which will invalidate IRR calculations for the sentiment dimension.

We scraped one month of r/covidlonghaulers posting history into a single JSON
file:

- `~/OneDrive/Documents/Projects/PatientPunk_data/subreddit_posts_month_1081posts.json`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The methodology documentation references an absolute local file path (~/OneDrive/Documents/Projects/...) here and in the reproducibility script (line 176). This limits the reproducibility of the sampling process for other contributors. It is recommended to use a relative path from the repository root or a generic placeholder.

Comment on lines +27 to +31
> **⚠️ STOP rule.** If `personal_use = no`, leave `sentiment` and
> `signal_strength` blank (or set to `n/a`) and move on. The classifier and
> the analysis only count personal-use reports, so non-personal-use rows
> contribute to inter-coder reliability on the `personal_use` decision and
> nothing further.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The instructions for the STOP rule are inconsistent with the column definitions and worked examples. Line 28 suggests leaving both sentiment and signal_strength blank or setting them to n/a. However, line 23 explicitly defines n/a as a valid value for signal_strength, while line 22 does not include n/a for sentiment. Worked examples 3 (line 146) and 8 (line 167) show sentiment as blank and signal_strength as n/a. To ensure high inter-coder reliability, the instructions should be standardized: sentiment should be left blank and signal_strength should be set to n/a when personal_use = no.

Shaun and others added 2 commits May 4, 2026 14:28
v1.4 (in the prior commit on this branch) experimentally dropped the
per-drug side_effects_reported / side_effects_description columns in
favour of just adding personal_use. Reviewer feedback was to keep both:
add personal_use AND keep side-effects.

v1.5 final schema (10 coder-filled columns):
  sample_id, coder_id, drug_mention_verbatim, personal_use,
  sentiment, signal_strength, confidence,
  side_effects_reported, side_effects_description, notes

Changes from v1.4 -> v1.5:
- Restored Step 6 (side effects) section in CODING_INSTRUCTIONS.md
  (yes/no flag + free-text description, only filled when
  personal_use = yes).
- Restored both side_effects_* columns in both
  coder_output_template.csv files (300-pilot and 500-pilot — both
  regenerated with the v1.5 schema; both have identical headers).
- STOP rule updated: when personal_use = no, blank sentiment,
  signal_strength, AND both side_effects_* columns. Confidence still
  always filled.
- Worked examples (8 cases) updated to show side_effects_reported and
  side_effects_description on every row.
- Decision tree updated with the side-effects branch.
- Multi-drug example updated with side_effects columns.
- Changelog rewritten to describe the v1.3 -> v1.4 (draft) -> v1.5
  transition honestly.
- PDF regenerated from v1.5 markdown.

Verified: both template headers identical:
  sample_id,coder_id,drug_mention_verbatim,personal_use,sentiment,
  signal_strength,confidence,side_effects_reported,
  side_effects_description,notes

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…coded)

The v1.5 commit on this branch added side_effects_reported and
side_effects_description columns "back" to the codebook + both templates,
on the (incorrect) assumption that the v1.3 codebook reflected what was
actually coded in the 300-pilot. It didn't.

Verified by inspecting every file in data/irr_pilot/:
  - human_coder_polina.csv, human_coder_shaun.csv:
      6 cols, no side_effects, has personal_use
  - all 12 ai_coder_*.csv:
      8 cols, no side_effects, has personal_use
  - merged_long.csv:
      8 cols + canonical_drug, no side_effects, has personal_use
  - alpha_report.md and alpha_pairwise_*.csv:
      α computed on drug_extracted, personal_use, sentiment,
      signal_strength; no side_effects α reported
  - coder_output_template.no_side_effects.csv.bak:
      original schema backup (personal_use, no side_effects)
  - The ONLY file in data/irr_pilot/ with side_effects_* columns was
    the in-data/ blank template — never populated by any coder.

So side_effects_reported and side_effects_description appeared only in
the v1.3 codebook + an unused template. The actual study data uses
personal_use (yes/no) and no side_effects. v1.5 was adding columns to
the codebook that don't exist in the data.

This commit reverts to v1.4 schema (personal_use, no side_effects) which
matches every actual coder output. Changelog rewritten to honestly
describe v1.4 as a drift correction rather than a new feature.

Also: per Shaun's request, README now explicitly states that per-coder
outputs (human_coder_*, merged_long, alpha_report, etc.) are
intentionally not redistributed in this repo — coder identity paired
with coding decisions is treated as private to the team. data/ is
already gitignored at the repo level so this is policy + documentation,
not a new technical safeguard.

Verified: both template headers identical (8 cols), PDF regenerated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant