Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
225 changes: 225 additions & 0 deletions docs/irr_pilot/CODING_INSTRUCTIONS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,225 @@
# Coding Instructions — PatientPunk IRR Pilot (v1.4)

**Workflow:**
1. Open `reading_packet.html` in a browser — this has all 300 posts rendered as cards, each labeled with its `sample_id`.
2. Open `coder_output_template.csv` alongside (in Excel, Sheets, or your editor of choice). The CSV has one row pre-filled per `sample_id`.
3. For each sample, look up the matching `sample_id` in the HTML packet, read the post, and:
- **If the post mentions one drug** → fill in the pre-filled row.
- **If the post mentions multiple drugs** → use the pre-filled row for the first drug, then **add additional rows with the same `sample_id`** for each additional drug. One row per drug.
- **If the post mentions no drug** → set `drug_mention_verbatim = NONE` and leave the rest of the annotation columns blank.

**One row per drug per sample.** Same `sample_id` repeats across rows when multiple drugs are mentioned. `sample_id` is what joins back to the post; `drug_mention_verbatim` is what distinguishes rows within a sample.

Rules below are adapted from the AI pipeline's prompts. Disagreement between coders should come from genuinely ambiguous posts, not from different interpretations of these rules.

## Columns you fill in

| Column | Values | Captures |
|---|---|---|
| `coder_id` | your name (`eli`, `tj`) | Who coded this row |
| `drug_mention_verbatim` | free text, or `NONE` | Drug/treatment as written or referenced in the post (don't canonicalize). Use `NONE` if no drug mentioned. |
| `personal_use` | `yes` / `no` | Did the author personally use this drug? `yes` only when the author describes their own experience with this specific drug. `no` for questions, hearsay, advice to others, hypothetical or planned use, or citing studies. |
| `sentiment` | `positive` / `negative` / `mixed` / `neutral` | Author's sentiment about THIS DRUG specifically. Only filled when `personal_use = yes`. Use `neutral` only when personal use is unclear or the author explicitly takes no position despite using the drug. |
| `signal_strength` | `strong` / `moderate` / `weak` / `n/a` | **About the post:** how emphatic/specific/quantified is the author's report on this drug? Use `n/a` when `personal_use = no`. |
| `confidence` | 1–5 | **About your coding:** how sure are you the labels you picked are right? Always filled, even when `personal_use = no`. |
| `notes` | free text | Reasoning, ambiguity flags, anything worth recording (optional) |

> **⚠️ STOP rule.** If `personal_use = no`, leave `sentiment` and
> `signal_strength` blank (or set to `n/a`) and move on. The classifier and
> the analysis only count personal-use reports, so non-personal-use rows
> contribute to inter-coder reliability on the `personal_use` decision and
> nothing further.

> **⚠️ Don't confuse `signal_strength` with `confidence`.** They are independent.
> `signal_strength` rates **the post's language** — a brief mention is `weak` even
> if it's trivial to code. `confidence` rates **your certainty in your own labels** —
> a very emphatic post can still be hard to code if the outcome is genuinely
> ambiguous.

---

## Multi-drug example (read this first)

Sample `irr-pilot-003` says: *"I'm on LDN and magnesium. About 40% better than a year ago."*

Two rows, both with `sample_id = irr-pilot-003`:

| sample_id | coder_id | drug_mention_verbatim | personal_use | sentiment | signal_strength | confidence | notes |
|---|---|---|---|---|---|---|---|
| irr-pilot-003 | eli | LDN | yes | positive | weak | 4 | in stack |
| irr-pilot-003 | eli | magnesium | yes | positive | weak | 4 | in stack |

Same sample_id, one row per drug, each drug gets its own per-drug
`personal_use` / `sentiment` / `signal_strength`.

---

## Step 1 — Identify each drug mention

**Include:** prescription drugs (LDN, gabapentin), OTC (ibuprofen, Tylenol), supplements (magnesium, B12), enzymes (DAO, nattokinase), drug *categories* (antihistamines, SSRIs), generic references ("an oral antibiotic"), non-drug treatments (PT, infrared sauna, compression, specific diets), devices (CPAP, IV saline).

**Exclude:** vague references ("medication", "something"), condition names (unless being treated), food (unless framed therapeutically).

**Record verbatim — written or referenced.** If the author wrote a name, use their exact wording ("LDN" stays "LDN" — don't expand to "low dose naltrexone"). If they only referenced indirectly ("the shot I got", "her prescription for the nerve pain"), record the best short phrase from the sample.

**One row per distinct drug.** Same drug mentioned multiple times in the same sample = ONE row. Different drugs = different rows. A drug class and a specific drug from it (e.g., "antihistamines, specifically Zyrtec") = TWO rows.

**Reply chains** — the reading packet shows upstream context if the sample is a reply. Use it only to resolve pronouns ("it" = whatever the parent was about). The signal must come from the reply itself, not the upstream.

---

## Step 2 — personal_use (per drug)

Decide first: did the author personally use this drug?

- **`yes`** — author describes their own experience with this drug (taking it, having taken it, side effects from it, results from it).
- **`no`** — questions to others, hearsay, advice, citing studies, hypothetical or planned use, mentions of someone else's experience, or a reply that doesn't itself express personal use even if the parent did.

If `personal_use = no`, **stop here for this drug-row**: leave `sentiment` and `signal_strength` blank (or `n/a`), still record `confidence` (how sure are you the personal-use call is right?), and move on. Non-personal-use rows are useful for IRR on the personal-use decision but don't enter the per-drug sentiment analysis.

## Step 3 — sentiment (per drug, only when personal_use = yes)

Pick one for each drug-row:

- **`positive`** — author personally used this drug and it helped them. Includes partial improvement ("helped but wasn't a miracle" = **positive, not mixed**).
- **`negative`** — author personally used this drug and it didn't help, made things worse, or they stopped because it wasn't working.
- **`mixed`** — genuinely conflicting outcomes ("helped pain but worsened sleep"), author explicitly can't decide. Use sparingly.
- **`neutral`** — author personally used this drug but explicitly takes no position on whether it helped (rare; usually means `personal_use = no` was the right call).

---

## Step 4 — signal_strength (per drug, only when personal_use = yes)

*A property of the **post**, not of your coding. Measures how informative the author's report on this drug is to a reader.*

- **`strong`** — any one of: quantified improvement, named specific symptom improving, clear temporal attribution, dramatic outcome, emphatic endorsement ("game changer", "wish I started sooner", "nothing else worked"), or emphatic effect language ("helps a lot", "did nothing"). **Hedging doesn't downgrade** — "I'm still sick but LDN changed my life" is `strong`.
- **`moderate`** — simple affirm/deny without emphasis ("it works for me", "yes", "it helps").
- **`weak`** — drug named in a stack without specific credit, slight or uncertain effect, or still using without complaint.
- **`n/a`** — when `personal_use = no` (no personal use to rate).

---

## Step 5 — confidence (1–5)

*A property of **your coding**, not of the post. Independent of signal strength. Always filled, including when `personal_use = no`.*

| Situation | `personal_use` | `signal_strength` | `confidence` |
|---|---|---|---|
| "LDN 4.5mg cut my fatigue 70% in 6 weeks" | yes | `strong` | 5 |
| "LDN helped some things but I honestly can't tell if it's the drug or pacing" | yes | `strong` | 2 |
| "I take LDN, magnesium, and H1 blockers. Doing okay." | yes | `weak` | 5 |
| "My doctor mentioned LDN but I'm not sure if I started yet" | no | `n/a` | 2 |

Use the full range — **5** unambiguous, **3** reasonable people could disagree, **1** very uncertain.

---

## Special cases

**Causal-context drugs** — author blames a treatment for causing their condition (e.g., "the Moderna shot is what gave me long COVID"): `personal_use = yes` (they did receive it), `sentiment = negative`, `signal_strength = weak`, `notes = causal-context`.

**Multi-drug stacks** — one row per drug. Each gets its own `personal_use` and (if yes) its own sentiment based on what the author says about that specific drug. If overall improvement is mentioned but no drug is specifically credited, default to `personal_use = yes` / `positive` / `weak` per drug.

**Questions / advice / hearsay** — `personal_use = no`, sentiment / signal_strength blank.

**No treatments mentioned at all** — single row with `drug_mention_verbatim = NONE`, `personal_use` blank, other annotation fields blank, `confidence = 5`.

**Irrelevant mentions** (drug mentioned only in a subreddit name, URL, signature) — don't code.

---

## Worked examples

**1 — personal use, positive, strong:**
*"I started LDN 4.5mg 6 months ago and my fatigue dropped by probably 70%."*
→ LDN / yes / positive / strong / 5 / *quantified improvement*

**2 — multi-drug stack (one row per drug):**
*"I'm on LDN, H1 blockers, and magnesium. Probably 40% better than a year ago."*
→ Three rows, same sample_id:
- LDN / yes / positive / weak / 4 / in stack
- H1 blockers / yes / positive / weak / 4 / in stack
- magnesium / yes / positive / weak / 4 / in stack

**3 — question, no personal use:**
*"Has anyone tried paxlovid late — like, more than 5 days after onset?"*
→ paxlovid / no / (blank) / n/a / 5 / *question to others*

**4 — mixed:**
*"LDN definitely helped my pain — but tanked my sleep for the first 2 months. Still worth it."*
→ LDN / yes / mixed / strong / 5 / *helped pain, hurt sleep*

**5 — causal-context:**
*"I was fine until my second Pfizer shot. That's when everything started."*
→ Pfizer / yes / negative / weak / 4 / *causal-context*

**6 — negative:**
*"Took Paxlovid for 5 days. Didn't notice any improvement."*
→ Paxlovid / yes / negative / strong / 5 / *no perceived effect*

**7 — no drugs:**
*"Today was really rough. Spent most of the day in bed."*
→ NONE / (blank) / (blank) / (blank) / 5 / *no drug mentioned*

**8 — reply without personal use:**
Parent: "LDN has been a game-changer for my PEM."
Reply: "How did you get your doctor to prescribe it?"
→ LDN / no / (blank) / n/a / 5 / *reply doesn't express personal use*

---

## Process reminders

Code blind — don't look at AI coder outputs, don't discuss with other coders until everyone's done. Code in one sitting if possible. When stuck, code your best guess with `confidence = 1 or 2` plus a note explaining the ambiguity.

## Decision tree per drug-row

```
For each drug in the post:

personal_use = ?
├── no (question / hearsay / advice / hypothetical / → no / sentiment blank
│ reply doesn't express personal use) / signal_strength = n/a
│ / record confidence
└── yes (author describes own experience) → yes / fill all below

sentiment = ?
├── this drug helped (even partially) → positive
├── this drug didn't help or made worse → negative
├── opposing effects from this drug → mixed
└── personal use but no position taken → neutral (rare)

signal_strength = ?
├── quantified / named symptom / dramatic /
│ emphatic endorsement / emphatic effect → strong
├── simple affirm or deny, no detail → moderate
└── in a stack, no specific credit → weak

confidence = 1..5
(always — including for personal_use = no)
```

---

## v1.4 — drift correction

v1.4 is a correction, not a redesign. The earlier v1.3 codebook drifted away from the schema actually used to code the 300-pilot study. Specifically, v1.3 described `side_effects_reported` and `side_effects_description` as coder-filled columns, and dropped `personal_use` from the columns table. But the human and AI coder outputs (`human_coder_*.csv`, `ai_coder_*.csv` in `data/irr_pilot/`) and the published Krippendorff's α report all use a different schema:

```
sample_id, coder_id, drug_mention_verbatim, personal_use,
sentiment, signal_strength, confidence, notes
```

- `personal_use` (yes/no) was actually collected.
- `side_effects_reported` and `side_effects_description` were never populated by any coder (human or AI). They lived only in the v1.3 codebook + an unused `coder_output_template.csv` file.

v1.4 brings the codebook back into sync with the data:

- **Removed**: `side_effects_reported` and `side_effects_description` columns and the corresponding step. They were not actually coded; describing them in the codebook was misleading.
- **Restored**: `personal_use` (yes/no), with explicit semantics. Was previously implicit (encoded as `sentiment = neutral` in v1.3 prose), but the actual coded data has the explicit column.
- **Tightened**: `sentiment` and `signal_strength` are now described as conditional on `personal_use = yes`, matching how coders actually treated them.
- **Single-schema templates.** Both the 300-pilot and 500-pilot `coder_output_template.csv` files share this schema, matching what was actually coded in the 300-pilot.
- Updated worked examples and decision tree.

*v1.4 · matches `coder_output_template.csv` schema in both `docs/irr_pilot/` and `docs/irr_pilot_500/`, and matches the schema of the existing `human_coder_*.csv` and `ai_coder_*.csv` outputs the α report was computed from.*
Binary file added docs/irr_pilot/CODING_INSTRUCTIONS.pdf
Binary file not shown.
41 changes: 41 additions & 0 deletions docs/irr_pilot/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# IRR Pilot — 300 samples

The original inter-coder reliability (IRR) sample pack used to compare
the PatientPunk pipeline's drug-extraction and sentiment classifications
against human coders. 300 stratified samples drawn from a 1-month
r/covidlonghaulers corpus (2026-03-11 → 2026-04-10), distributed to
11 coders (2 human, 9 AI models).

## Files

- [`SAMPLING_METHODOLOGY.md`](./SAMPLING_METHODOLOGY.md) — how the 300
samples were drawn (corpus, codable pool, stratification, seed)
- [`CODING_INSTRUCTIONS.md`](./CODING_INSTRUCTIONS.md) /
[`.pdf`](./CODING_INSTRUCTIONS.pdf) — codebook (v1.4) distributed to
coders. Defines the `personal_use` / `sentiment` / `signal_strength` /
`confidence` schema and the per-drug-row decision tree.
- [`coding_input.csv`](./coding_input.csv) — the 300 sampled units
(sample_id, subreddit, post_date, unit_type, title, parent_context,
post_text). One row per sample.
- [`ai_labels.csv`](./ai_labels.csv) — analyst-only file with stratum
labels and AI-pipeline output for each sample. Not distributed to
coders (would break blinding).
- [`coder_output_template.csv`](./coder_output_template.csv) — blank
form coders fill in (one row per sample, one row added per additional
drug for multi-drug samples).

## Status

**Coded.** All 11 coders (2 human + 9 AI panel) returned outputs;
Krippendorff's α was computed on `personal_use`, `sentiment`,
`signal_strength`, and `drug_extracted`. The α report and per-coder
output CSVs (`human_coder_*.csv`, `ai_coder_*.csv`, `merged_long.csv`,
`alpha_report.md`, `alpha_pairwise_*.csv`) live outside this docs folder
in `data/irr_pilot/`, which is gitignored at the repo level.
**Per-coder outputs are intentionally not redistributed** in this
repo — coder identity is paired with coding decisions in those files,
and we treat that pairing as private to the team.

For the more recent 500-sample IRR pilot — drawn from the Nov–Dec 2021
window matching the historical-validation paper's analysis window —
see [`../irr_pilot_500/`](../irr_pilot_500/).
Loading
Loading