ohdearquant · ohdearquant · May 31, 2026 · May 31, 2026
diff --git a/docs/effort-calibration.md b/docs/effort-calibration.md
@@ -0,0 +1,92 @@
+# Effort Calibration Table
+
+**Calibration date**: 2026-05-31
+**Data source**: `gh pr list --repo ohdearquant/khive --state merged --limit 100 --json number,title,additions,deletions,changedFiles,mergedAt,labels`
+**Sample size**: 100 merged PRs
+
+---
+
+## S / M / L / XL Calibration Table
+
+| Bucket | Assignment rule                                             | PR count | Changed files (range, median) | Changed lines (range, median) | Example PRs                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | Wall-clock estimate                                 |
+| ------ | ----------------------------------------------------------- | :------: | ----------------------------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------- |
+| **S**  | Lines ≤ 536 **and** files ≤ 5                               |    35    | 1–5, median 3                 | 5–528, median 242             | #610 fix(knowledge): relax flaky rerank test for CI — 10 lines, 1 file<br>#543 feat(gtd): add namespace column to lifecycle audit table — 57 lines, 2 files<br>#541 feat(storage): VectorStore::batch_exists + reindex filter_unembedded — 411 lines, 3 files<br>#448 docs(packs): populate HandlerDef params for comm + schedule — 528 lines, 3 files                                                                                                                              | Not derivable from available data (see Methodology) |
+| **M**  | Lines > S **or** files > 5, up to 1 276 lines and 12 files  |    42    | 2–12, median 7                | 157–1 276, median 725         | #525 refactor(gate): make RegoGate a proper opt-in reference backend — 157 lines, 6 files<br>#601 feat(knowledge): normalize scores, default rerank=true, FTS5 hardening, search bench — 926 lines, 10 files<br>#562 feat(knowledge): suggest/compose verbs + FTS5 escaping + real embedding_coverage — 1 002 lines, 7 files<br>#467 fix(runtime,mcp): Wave 4 — OSS actor config default namespace — 1 276 lines, 8 files                                                           | Not derivable from available data (see Methodology) |
+| **L**  | Lines > M **or** files > 12, up to 2 266 lines and 22 files |    17    | 3–22, median 14               | 284–2 181, median 1 634       | #608 docs(adr): ADR-049 acceptance, brain/GTD catch-up, retrieval stack guide — 648 lines, 14 files<br>#473 fix(knowledge): Wave 5 — topic shape + domain filter + doc accuracy — 713 lines, 21 files<br>#510 feat(brain): section posteriors with Thompson Sampling + persistence (ADR-048 Phase 1) — 1 510 lines, 9 files<br>#504 feat(vamana): implement khive-vamana ANN crate (ADR-048) — 2 181 lines, 15 files                                                                | Not derivable from available data (see Methodology) |
+| **XL** | Lines > 2 266 **or** files > 22                             |    6     | 4–59, median 32               | 998–3 245, median 2 463       | #611 release: v0.2.3 — 998 lines, 59 files<br>#472 refactor: eliminate importance, use salience throughout (ADR-021 §2 rewrite) — 1 987 lines, 48 files<br>#547 feat(tests+marketplace): smoke tests for 4 packs + comm/schedule plugins — 2 661 lines, 16 files<br>#423 test(eval): harder corpus + MRR/P@k discriminating metrics for recall tuning — 3 014 lines, 4 files<br>#470 feat(knowledge): port lore retrieval (atoms + domains + TF-IDF + fold) — 3 245 lines, 13 files | Not derivable from available data (see Methodology) |
+
+---
+
+## Methodology
+
+### Data collection
+
+The calibration uses the last 100 merged PRs pulled directly from the `ohdearquant/khive` repository via `gh pr list`. Per-PR total changed lines equals additions plus deletions. Category labels are inferred from title prefix or labels: `fix`/`security` → bug fix; `feat` → feature; `docs` → docs; `refactor` → refactor; and `chore`/`test`/`style`/`release`/unknown → chore.
+
+### Bucket boundaries
+
+Bucket thresholds were derived using the Jenks natural-breaks method (k = 4) computed separately on the total-changed-lines and changed-files distributions. A PR is assigned to the larger of the two resulting size classes, because a large patch or a broad file spread independently increases planning and review scope.
+
+The key observed gaps that define each boundary are:
+
+| Axis          | S/M boundary                        | M/L boundary                            | L/XL boundary                           |
+| ------------- | ----------------------------------- | --------------------------------------- | --------------------------------------- |
+| Changed lines | #526 (536 lines) → #453 (579 lines) | #467 (1 276 lines) → #510 (1 510 lines) | #471 (2 266 lines) → #547 (2 661 lines) |
+| Changed files | 5 files → 6 files                   | 12 files → 13 files                     | #477/#501 (22 files) → #472 (48 files)  |
+
+### Wall-clock estimates
+
+Wall-clock per-PR duration cannot be derived from the available `gh pr list` output, which contains only `mergedAt`. Computing reliable per-PR effort would require at minimum the PR creation timestamp, first-commit timestamp, first ready-for-review timestamp, review round count, and CI retry count. All wall-clock estimate cells are marked "Not derivable" to avoid introducing misleading numbers. This limitation is intentional and documented.
+
+### Confidence
+
+- **High** for size ranges and medians: computed directly from the 100-row JSON.
+- **Medium** for bucket assignment: Jenks breaks are data-derived, but changed lines and files are proxies for effort. Generated code, large test data files, and mechanical renames require human adjustment.
+- **Low** for wall-clock estimates: not computed.
+
+---
+
+## How to Use During Play Planning
+
+Apply the following checklist for each `li play` proposal.
+
+1. **Predict both axes**: estimate the expected changed-files count and expected total changed-lines count before assigning an effort label.
+2. **Assign the larger class**: map each axis to the natural-break class (S / M / L / XL) and take the larger result.
+3. **Check for outlier type**: if one axis is two or more size classes above the other, annotate the play with an outlier reason — `mechanical`, `docs-heavy`, `test-data-heavy`, or `deep-local` — and decide whether to adjust or split.
+4. **Split XL unless mechanical**: any play predicted above the XL threshold (> 2 266 lines or > 22 files) should be split into subplays by crate, pack, docs section, or migration stage unless the change is clearly mechanical and has an explicit verification plan.
+5. **Route reviewers early for broad plays**: if the predicted file count exceeds 12, schedule L or higher and assign reviewers by subsystem or docs owner before work begins.
+6. **Record the estimate and rationale**: note the assigned bucket, both predicted metrics, and any override reason in the play plan so the estimate is auditable.
+
+**Bucket thresholds for play planning**:
+
+| Bucket | Assign when                                                                           |
+| ------ | ------------------------------------------------------------------------------------- |
+| S      | Predicted lines ≤ 536 **and** predicted files ≤ 5                                     |
+| M      | Predicted lines > 536 **or** predicted files > 5, and lines ≤ 1 276 and files ≤ 12    |
+| L      | Predicted lines > 1 276 **or** predicted files > 12, and lines ≤ 2 266 and files ≤ 22 |
+| XL     | Predicted lines > 2 266 **or** predicted files > 22                                   |
+
+If a prediction falls in an unobserved gap (e.g., 23–47 files or 2 267–2 660 lines), treat it as XL or split.
+
+**Verification budget by bucket**:
+
+- **S**: targeted test or formatting check.
+- **M**: focused crate or package tests.
+- **L**: subsystem integration or cross-pack tests.
+- **XL**: workspace-level tests plus migration and backward-compatibility checks; docs-heavy XL also requires citation and link checks.
+
+---
+
+## Outliers
+
+The following PRs do not fit their buckets cleanly. The assigned bucket reflects the larger axis; the planning interpretation notes when the standard complexity inference from that bucket should be adjusted.
+
+|   PR | Observed mismatch                    | Why it does not fit cleanly                                                                                                                                 | Planning interpretation                                                                                                                      |
+| ---: | ------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- |
+| #611 | M by lines (998), XL by files (59)   | Release PR spreads metadata across many files; file count overstates implementation complexity but still affects review and check surface.                  | Plan as XL for coordination scope. Evaluate whether release automation makes the per-file effort mechanical.                                 |
+| #477 | S by lines (302), L by files (22)    | Bulk marketplace documentation refresh; broad file coverage with modest code-change depth.                                                                  | Plan as docs-heavy L for review routing, not as equivalent to a cross-crate feature.                                                         |
+| #423 | XL by lines (3 014), S by files (4)  | Evaluation corpus and metric changes are concentrated in few files; line count captures data and test volume rather than integration breadth.               | Plan as XL for validation burden. Coordination cost is lower than a typical XL; split by eval scenario rather than by module.                |
+| #461 | L by lines (1 818), S by files (3)   | Pack safety hardening is deep and highly localized; low file count understates reasoning and regression risk.                                               | Plan as L because line volume and behavioral risk dominate. Do not reduce to M.                                                              |
+| #471 | L by lines (2 266), XL by files (58) | Repository-wide verb namespace migration; broad file touch pattern is the primary effort signal.                                                            | Plan as XL and split by subsystem if possible.                                                                                               |
+| #472 | L by lines (1 987), XL by files (48) | Repository-wide terminology refactor from `importance` to `salience`; broad mechanical and semantic blast radius.                                           | Plan as XL. If verified as fully mechanical and covered by generated tooling, the coordination cost may be lower than a standard XL feature. |
+| #608 | M by lines (648), L by files (14)    | ADR and retrieval-stack documentation spans many source files; runtime risk is likely lower, but source-traceability and citation-checking burden is broad. | Plan as L for document review and citation checks.                                                                                           |