Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 92 additions & 0 deletions docs/effort-calibration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# Effort Calibration Table

**Calibration date**: 2026-05-31
**Data source**: `gh pr list --repo ohdearquant/khive --state merged --limit 100 --json number,title,additions,deletions,changedFiles,mergedAt,labels`
**Sample size**: 100 merged PRs

---

## S / M / L / XL Calibration Table

| Bucket | Assignment rule | PR count | Changed files (range, median) | Changed lines (range, median) | Example PRs | Wall-clock estimate |
| ------ | ----------------------------------------------------------- | :------: | ----------------------------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------- |
| **S** | Lines ≀ 536 **and** files ≀ 5 | 35 | 1–5, median 3 | 5–528, median 242 | #610 fix(knowledge): relax flaky rerank test for CI β€” 10 lines, 1 file<br>#543 feat(gtd): add namespace column to lifecycle audit table β€” 57 lines, 2 files<br>#541 feat(storage): VectorStore::batch_exists + reindex filter_unembedded β€” 411 lines, 3 files<br>#448 docs(packs): populate HandlerDef params for comm + schedule β€” 528 lines, 3 files | Not derivable from available data (see Methodology) |
| **M** | Lines > S **or** files > 5, up to 1 276 lines and 12 files | 42 | 2–12, median 7 | 157–1 276, median 725 | #525 refactor(gate): make RegoGate a proper opt-in reference backend β€” 157 lines, 6 files<br>#601 feat(knowledge): normalize scores, default rerank=true, FTS5 hardening, search bench β€” 926 lines, 10 files<br>#562 feat(knowledge): suggest/compose verbs + FTS5 escaping + real embedding_coverage β€” 1 002 lines, 7 files<br>#467 fix(runtime,mcp): Wave 4 β€” OSS actor config default namespace β€” 1 276 lines, 8 files | Not derivable from available data (see Methodology) |
| **L** | Lines > M **or** files > 12, up to 2 266 lines and 22 files | 17 | 3–22, median 14 | 284–2 181, median 1 634 | #608 docs(adr): ADR-049 acceptance, brain/GTD catch-up, retrieval stack guide β€” 648 lines, 14 files<br>#473 fix(knowledge): Wave 5 β€” topic shape + domain filter + doc accuracy β€” 713 lines, 21 files<br>#510 feat(brain): section posteriors with Thompson Sampling + persistence (ADR-048 Phase 1) β€” 1 510 lines, 9 files<br>#504 feat(vamana): implement khive-vamana ANN crate (ADR-048) β€” 2 181 lines, 15 files | Not derivable from available data (see Methodology) |
| **XL** | Lines > 2 266 **or** files > 22 | 6 | 4–59, median 32 | 998–3 245, median 2 463 | #611 release: v0.2.3 β€” 998 lines, 59 files<br>#472 refactor: eliminate importance, use salience throughout (ADR-021 Β§2 rewrite) β€” 1 987 lines, 48 files<br>#547 feat(tests+marketplace): smoke tests for 4 packs + comm/schedule plugins β€” 2 661 lines, 16 files<br>#423 test(eval): harder corpus + MRR/P@k discriminating metrics for recall tuning β€” 3 014 lines, 4 files<br>#470 feat(knowledge): port lore retrieval (atoms + domains + TF-IDF + fold) β€” 3 245 lines, 13 files | Not derivable from available data (see Methodology) |

---

## Methodology

### Data collection

The calibration uses the last 100 merged PRs pulled directly from the `ohdearquant/khive` repository via `gh pr list`. Per-PR total changed lines equals additions plus deletions. Category labels are inferred from title prefix or labels: `fix`/`security` β†’ bug fix; `feat` β†’ feature; `docs` β†’ docs; `refactor` β†’ refactor; and `chore`/`test`/`style`/`release`/unknown β†’ chore.

### Bucket boundaries

Bucket thresholds were derived using the Jenks natural-breaks method (k = 4) computed separately on the total-changed-lines and changed-files distributions. A PR is assigned to the larger of the two resulting size classes, because a large patch or a broad file spread independently increases planning and review scope.

The key observed gaps that define each boundary are:

| Axis | S/M boundary | M/L boundary | L/XL boundary |
| ------------- | ----------------------------------- | --------------------------------------- | --------------------------------------- |
| Changed lines | #526 (536 lines) β†’ #453 (579 lines) | #467 (1 276 lines) β†’ #510 (1 510 lines) | #471 (2 266 lines) β†’ #547 (2 661 lines) |
| Changed files | 5 files β†’ 6 files | 12 files β†’ 13 files | #477/#501 (22 files) β†’ #472 (48 files) |

### Wall-clock estimates

Wall-clock per-PR duration cannot be derived from the available `gh pr list` output, which contains only `mergedAt`. Computing reliable per-PR effort would require at minimum the PR creation timestamp, first-commit timestamp, first ready-for-review timestamp, review round count, and CI retry count. All wall-clock estimate cells are marked "Not derivable" to avoid introducing misleading numbers. This limitation is intentional and documented.

### Confidence

- **High** for size ranges and medians: computed directly from the 100-row JSON.
- **Medium** for bucket assignment: Jenks breaks are data-derived, but changed lines and files are proxies for effort. Generated code, large test data files, and mechanical renames require human adjustment.
- **Low** for wall-clock estimates: not computed.

---

## How to Use During Play Planning

Apply the following checklist for each `li play` proposal.

1. **Predict both axes**: estimate the expected changed-files count and expected total changed-lines count before assigning an effort label.
2. **Assign the larger class**: map each axis to the natural-break class (S / M / L / XL) and take the larger result.
3. **Check for outlier type**: if one axis is two or more size classes above the other, annotate the play with an outlier reason β€” `mechanical`, `docs-heavy`, `test-data-heavy`, or `deep-local` β€” and decide whether to adjust or split.
4. **Split XL unless mechanical**: any play predicted above the XL threshold (> 2 266 lines or > 22 files) should be split into subplays by crate, pack, docs section, or migration stage unless the change is clearly mechanical and has an explicit verification plan.
5. **Route reviewers early for broad plays**: if the predicted file count exceeds 12, schedule L or higher and assign reviewers by subsystem or docs owner before work begins.
6. **Record the estimate and rationale**: note the assigned bucket, both predicted metrics, and any override reason in the play plan so the estimate is auditable.

**Bucket thresholds for play planning**:

| Bucket | Assign when |
| ------ | ------------------------------------------------------------------------------------- |
| S | Predicted lines ≀ 536 **and** predicted files ≀ 5 |
| M | Predicted lines > 536 **or** predicted files > 5, and lines ≀ 1 276 and files ≀ 12 |
| L | Predicted lines > 1 276 **or** predicted files > 12, and lines ≀ 2 266 and files ≀ 22 |
| XL | Predicted lines > 2 266 **or** predicted files > 22 |

If a prediction falls in an unobserved gap (e.g., 23–47 files or 2 267–2 660 lines), treat it as XL or split.

**Verification budget by bucket**:

- **S**: targeted test or formatting check.
- **M**: focused crate or package tests.
- **L**: subsystem integration or cross-pack tests.
- **XL**: workspace-level tests plus migration and backward-compatibility checks; docs-heavy XL also requires citation and link checks.

---

## Outliers

The following PRs do not fit their buckets cleanly. The assigned bucket reflects the larger axis; the planning interpretation notes when the standard complexity inference from that bucket should be adjusted.

| PR | Observed mismatch | Why it does not fit cleanly | Planning interpretation |
| ---: | ------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- |
| #611 | M by lines (998), XL by files (59) | Release PR spreads metadata across many files; file count overstates implementation complexity but still affects review and check surface. | Plan as XL for coordination scope. Evaluate whether release automation makes the per-file effort mechanical. |
| #477 | S by lines (302), L by files (22) | Bulk marketplace documentation refresh; broad file coverage with modest code-change depth. | Plan as docs-heavy L for review routing, not as equivalent to a cross-crate feature. |
| #423 | XL by lines (3 014), S by files (4) | Evaluation corpus and metric changes are concentrated in few files; line count captures data and test volume rather than integration breadth. | Plan as XL for validation burden. Coordination cost is lower than a typical XL; split by eval scenario rather than by module. |
| #461 | L by lines (1 818), S by files (3) | Pack safety hardening is deep and highly localized; low file count understates reasoning and regression risk. | Plan as L because line volume and behavioral risk dominate. Do not reduce to M. |
| #471 | L by lines (2 266), XL by files (58) | Repository-wide verb namespace migration; broad file touch pattern is the primary effort signal. | Plan as XL and split by subsystem if possible. |
| #472 | L by lines (1 987), XL by files (48) | Repository-wide terminology refactor from `importance` to `salience`; broad mechanical and semantic blast radius. | Plan as XL. If verified as fully mechanical and covered by generated tooling, the coordination cost may be lower than a standard XL feature. |
| #608 | M by lines (648), L by files (14) | ADR and retrieval-stack documentation spans many source files; runtime risk is likely lower, but source-traceability and citation-checking burden is broad. | Plan as L for document review and citation checks. |
Loading