From 0148b1aba40663d286bf863d5e60fb5eda4fb36b Mon Sep 17 00:00:00 2001 From: OceanLi <122793010+ohdearquant@users.noreply.github.com> Date: Sun, 31 May 2026 13:42:53 -0400 Subject: [PATCH 1/2] docs: add effort calibration table from historical PR data (#596) --- docs/effort-calibration.md | 92 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 92 insertions(+) create mode 100644 docs/effort-calibration.md diff --git a/docs/effort-calibration.md b/docs/effort-calibration.md new file mode 100644 index 00000000..a41b3104 --- /dev/null +++ b/docs/effort-calibration.md @@ -0,0 +1,92 @@ +# Effort Calibration Table + +**Calibration date**: 2026-05-31 +**Data source**: `gh pr list --repo ohdearquant/khive --state merged --limit 100 --json number,title,additions,deletions,changedFiles,mergedAt,labels` +**Sample size**: 100 merged PRs + +--- + +## S / M / L / XL Calibration Table + +| Bucket | Assignment rule | PR count | Changed files (range, median) | Changed lines (range, median) | Example PRs | Wall-clock estimate | +|--------|----------------|:--------:|-------------------------------|-------------------------------|-------------|---------------------| +| **S** | Lines ≤ 536 **and** files ≤ 5 | 35 | 1–5, median 3 | 5–528, median 242 | #610 fix(knowledge): relax flaky rerank test for CI — 10 lines, 1 file
#543 feat(gtd): add namespace column to lifecycle audit table — 57 lines, 2 files
#541 feat(storage): VectorStore::batch_exists + reindex filter_unembedded — 411 lines, 3 files
#448 docs(packs): populate HandlerDef params for comm + schedule — 528 lines, 3 files | Not derivable from available data (see Methodology) | +| **M** | Lines > S **or** files > 5, up to 1 276 lines and 12 files | 42 | 2–12, median 7 | 157–1 276, median 725 | #525 refactor(gate): make RegoGate a proper opt-in reference backend — 157 lines, 6 files
#601 feat(knowledge): normalize scores, default rerank=true, FTS5 hardening, search bench — 926 lines, 10 files
#562 feat(knowledge): suggest/compose verbs + FTS5 escaping + real embedding_coverage — 1 002 lines, 7 files
#467 fix(runtime,mcp): Wave 4 — OSS actor config default namespace — 1 276 lines, 8 files | Not derivable from available data (see Methodology) | +| **L** | Lines > M **or** files > 12, up to 2 266 lines and 22 files | 17 | 3–22, median 14 | 284–2 181, median 1 634 | #608 docs(adr): ADR-049 acceptance, brain/GTD catch-up, retrieval stack guide — 648 lines, 14 files
#473 fix(knowledge): Wave 5 — topic shape + domain filter + doc accuracy — 713 lines, 21 files
#510 feat(brain): section posteriors with Thompson Sampling + persistence (ADR-048 Phase 1) — 1 510 lines, 9 files
#504 feat(vamana): implement khive-vamana ANN crate (ADR-048) — 2 181 lines, 15 files | Not derivable from available data (see Methodology) | +| **XL** | Lines > 2 266 **or** files > 22 | 6 | 4–59, median 32 | 998–3 245, median 2 463 | #611 release: v0.2.3 — 998 lines, 59 files
#472 refactor: eliminate importance, use salience throughout (ADR-021 §2 rewrite) — 1 987 lines, 48 files
#547 feat(tests+marketplace): smoke tests for 4 packs + comm/schedule plugins — 2 661 lines, 16 files
#423 test(eval): harder corpus + MRR/P@k discriminating metrics for recall tuning — 3 014 lines, 4 files
#470 feat(knowledge): port lore retrieval (atoms + domains + TF-IDF + fold) — 3 245 lines, 13 files | Not derivable from available data (see Methodology) | + +--- + +## Methodology + +### Data collection + +The calibration uses the last 100 merged PRs pulled directly from the `ohdearquant/khive` repository via `gh pr list`. Per-PR total changed lines equals additions plus deletions. Category labels are inferred from title prefix or labels: `fix`/`security` → bug fix; `feat` → feature; `docs` → docs; `refactor` → refactor; and `chore`/`test`/`style`/`release`/unknown → chore. + +### Bucket boundaries + +Bucket thresholds were derived using the Jenks natural-breaks method (k = 4) computed separately on the total-changed-lines and changed-files distributions. A PR is assigned to the larger of the two resulting size classes, because a large patch or a broad file spread independently increases planning and review scope. + +The key observed gaps that define each boundary are: + +| Axis | S/M boundary | M/L boundary | L/XL boundary | +|------|-------------|-------------|--------------| +| Changed lines | #526 (536 lines) → #453 (579 lines) | #467 (1 276 lines) → #510 (1 510 lines) | #471 (2 266 lines) → #547 (2 661 lines) | +| Changed files | 5 files → 6 files | 12 files → 13 files | #477/#501 (22 files) → #472 (48 files) | + +### Wall-clock estimates + +Wall-clock per-PR duration cannot be derived from the available `gh pr list` output, which contains only `mergedAt`. Computing reliable per-PR effort would require at minimum the PR creation timestamp, first-commit timestamp, first ready-for-review timestamp, review round count, and CI retry count. All wall-clock estimate cells are marked "Not derivable" to avoid introducing misleading numbers. This limitation is intentional and documented. + +### Confidence + +- **High** for size ranges and medians: computed directly from the 100-row JSON. +- **Medium** for bucket assignment: Jenks breaks are data-derived, but changed lines and files are proxies for effort. Generated code, large test data files, and mechanical renames require human adjustment. +- **Low** for wall-clock estimates: not computed. + +--- + +## How to Use During Play Planning + +Apply the following checklist for each `li play` proposal. + +1. **Predict both axes**: estimate the expected changed-files count and expected total changed-lines count before assigning an effort label. +2. **Assign the larger class**: map each axis to the natural-break class (S / M / L / XL) and take the larger result. +3. **Check for outlier type**: if one axis is two or more size classes above the other, annotate the play with an outlier reason — `mechanical`, `docs-heavy`, `test-data-heavy`, or `deep-local` — and decide whether to adjust or split. +4. **Split XL unless mechanical**: any play predicted above the XL threshold (> 2 266 lines or > 22 files) should be split into subplays by crate, pack, docs section, or migration stage unless the change is clearly mechanical and has an explicit verification plan. +5. **Route reviewers early for broad plays**: if the predicted file count exceeds 12, schedule L or higher and assign reviewers by subsystem or docs owner before work begins. +6. **Record the estimate and rationale**: note the assigned bucket, both predicted metrics, and any override reason in the play plan so the estimate is auditable. + +**Bucket thresholds for play planning**: + +| Bucket | Assign when | +|--------|------------| +| S | Predicted lines ≤ 536 **and** predicted files ≤ 5 | +| M | Predicted lines > 536 **or** predicted files > 5, and lines ≤ 1 276 and files ≤ 12 | +| L | Predicted lines > 1 276 **or** predicted files > 12, and lines ≤ 2 266 and files ≤ 22 | +| XL | Predicted lines > 2 266 **or** predicted files > 22 | + +If a prediction falls in an unobserved gap (e.g., 23–47 files or 2 267–2 660 lines), treat it as XL or split. + +**Verification budget by bucket**: + +- **S**: targeted test or formatting check. +- **M**: focused crate or package tests. +- **L**: subsystem integration or cross-pack tests. +- **XL**: workspace-level tests plus migration and backward-compatibility checks; docs-heavy XL also requires citation and link checks. + +--- + +## Outliers + +The following PRs do not fit their buckets cleanly. The assigned bucket reflects the larger axis; the planning interpretation notes when the standard complexity inference from that bucket should be adjusted. + +| PR | Observed mismatch | Why it does not fit cleanly | Planning interpretation | +|---:|-------------------|----------------------------|------------------------| +| #611 | M by lines (998), XL by files (59) | Release PR spreads metadata across many files; file count overstates implementation complexity but still affects review and check surface. | Plan as XL for coordination scope. Evaluate whether release automation makes the per-file effort mechanical. | +| #477 | S by lines (302), L by files (22) | Bulk marketplace documentation refresh; broad file coverage with modest code-change depth. | Plan as docs-heavy L for review routing, not as equivalent to a cross-crate feature. | +| #423 | XL by lines (3 014), S by files (4) | Evaluation corpus and metric changes are concentrated in few files; line count captures data and test volume rather than integration breadth. | Plan as XL for validation burden. Coordination cost is lower than a typical XL; split by eval scenario rather than by module. | +| #461 | L by lines (1 818), S by files (3) | Pack safety hardening is deep and highly localized; low file count understates reasoning and regression risk. | Plan as L because line volume and behavioral risk dominate. Do not reduce to M. | +| #471 | L by lines (2 266), XL by files (58) | Repository-wide verb namespace migration; broad file touch pattern is the primary effort signal. | Plan as XL and split by subsystem if possible. | +| #472 | L by lines (1 987), XL by files (48) | Repository-wide terminology refactor from `importance` to `salience`; broad mechanical and semantic blast radius. | Plan as XL. If verified as fully mechanical and covered by generated tooling, the coordination cost may be lower than a standard XL feature. | +| #608 | M by lines (648), L by files (14) | ADR and retrieval-stack documentation spans many source files; runtime risk is likely lower, but source-traceability and citation-checking burden is broad. | Plan as L for document review and citation checks. | From c0c0a5ff1f77b3c04c124f024da17fc4702d8968 Mon Sep 17 00:00:00 2001 From: OceanLi <122793010+ohdearquant@users.noreply.github.com> Date: Sun, 31 May 2026 13:54:37 -0400 Subject: [PATCH 2/2] style: deno fmt docs/effort-calibration.md --- docs/effort-calibration.md | 48 +++++++++++++++++++------------------- 1 file changed, 24 insertions(+), 24 deletions(-) diff --git a/docs/effort-calibration.md b/docs/effort-calibration.md index a41b3104..7aba9693 100644 --- a/docs/effort-calibration.md +++ b/docs/effort-calibration.md @@ -8,12 +8,12 @@ ## S / M / L / XL Calibration Table -| Bucket | Assignment rule | PR count | Changed files (range, median) | Changed lines (range, median) | Example PRs | Wall-clock estimate | -|--------|----------------|:--------:|-------------------------------|-------------------------------|-------------|---------------------| -| **S** | Lines ≤ 536 **and** files ≤ 5 | 35 | 1–5, median 3 | 5–528, median 242 | #610 fix(knowledge): relax flaky rerank test for CI — 10 lines, 1 file
#543 feat(gtd): add namespace column to lifecycle audit table — 57 lines, 2 files
#541 feat(storage): VectorStore::batch_exists + reindex filter_unembedded — 411 lines, 3 files
#448 docs(packs): populate HandlerDef params for comm + schedule — 528 lines, 3 files | Not derivable from available data (see Methodology) | -| **M** | Lines > S **or** files > 5, up to 1 276 lines and 12 files | 42 | 2–12, median 7 | 157–1 276, median 725 | #525 refactor(gate): make RegoGate a proper opt-in reference backend — 157 lines, 6 files
#601 feat(knowledge): normalize scores, default rerank=true, FTS5 hardening, search bench — 926 lines, 10 files
#562 feat(knowledge): suggest/compose verbs + FTS5 escaping + real embedding_coverage — 1 002 lines, 7 files
#467 fix(runtime,mcp): Wave 4 — OSS actor config default namespace — 1 276 lines, 8 files | Not derivable from available data (see Methodology) | -| **L** | Lines > M **or** files > 12, up to 2 266 lines and 22 files | 17 | 3–22, median 14 | 284–2 181, median 1 634 | #608 docs(adr): ADR-049 acceptance, brain/GTD catch-up, retrieval stack guide — 648 lines, 14 files
#473 fix(knowledge): Wave 5 — topic shape + domain filter + doc accuracy — 713 lines, 21 files
#510 feat(brain): section posteriors with Thompson Sampling + persistence (ADR-048 Phase 1) — 1 510 lines, 9 files
#504 feat(vamana): implement khive-vamana ANN crate (ADR-048) — 2 181 lines, 15 files | Not derivable from available data (see Methodology) | -| **XL** | Lines > 2 266 **or** files > 22 | 6 | 4–59, median 32 | 998–3 245, median 2 463 | #611 release: v0.2.3 — 998 lines, 59 files
#472 refactor: eliminate importance, use salience throughout (ADR-021 §2 rewrite) — 1 987 lines, 48 files
#547 feat(tests+marketplace): smoke tests for 4 packs + comm/schedule plugins — 2 661 lines, 16 files
#423 test(eval): harder corpus + MRR/P@k discriminating metrics for recall tuning — 3 014 lines, 4 files
#470 feat(knowledge): port lore retrieval (atoms + domains + TF-IDF + fold) — 3 245 lines, 13 files | Not derivable from available data (see Methodology) | +| Bucket | Assignment rule | PR count | Changed files (range, median) | Changed lines (range, median) | Example PRs | Wall-clock estimate | +| ------ | ----------------------------------------------------------- | :------: | ----------------------------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------- | +| **S** | Lines ≤ 536 **and** files ≤ 5 | 35 | 1–5, median 3 | 5–528, median 242 | #610 fix(knowledge): relax flaky rerank test for CI — 10 lines, 1 file
#543 feat(gtd): add namespace column to lifecycle audit table — 57 lines, 2 files
#541 feat(storage): VectorStore::batch_exists + reindex filter_unembedded — 411 lines, 3 files
#448 docs(packs): populate HandlerDef params for comm + schedule — 528 lines, 3 files | Not derivable from available data (see Methodology) | +| **M** | Lines > S **or** files > 5, up to 1 276 lines and 12 files | 42 | 2–12, median 7 | 157–1 276, median 725 | #525 refactor(gate): make RegoGate a proper opt-in reference backend — 157 lines, 6 files
#601 feat(knowledge): normalize scores, default rerank=true, FTS5 hardening, search bench — 926 lines, 10 files
#562 feat(knowledge): suggest/compose verbs + FTS5 escaping + real embedding_coverage — 1 002 lines, 7 files
#467 fix(runtime,mcp): Wave 4 — OSS actor config default namespace — 1 276 lines, 8 files | Not derivable from available data (see Methodology) | +| **L** | Lines > M **or** files > 12, up to 2 266 lines and 22 files | 17 | 3–22, median 14 | 284–2 181, median 1 634 | #608 docs(adr): ADR-049 acceptance, brain/GTD catch-up, retrieval stack guide — 648 lines, 14 files
#473 fix(knowledge): Wave 5 — topic shape + domain filter + doc accuracy — 713 lines, 21 files
#510 feat(brain): section posteriors with Thompson Sampling + persistence (ADR-048 Phase 1) — 1 510 lines, 9 files
#504 feat(vamana): implement khive-vamana ANN crate (ADR-048) — 2 181 lines, 15 files | Not derivable from available data (see Methodology) | +| **XL** | Lines > 2 266 **or** files > 22 | 6 | 4–59, median 32 | 998–3 245, median 2 463 | #611 release: v0.2.3 — 998 lines, 59 files
#472 refactor: eliminate importance, use salience throughout (ADR-021 §2 rewrite) — 1 987 lines, 48 files
#547 feat(tests+marketplace): smoke tests for 4 packs + comm/schedule plugins — 2 661 lines, 16 files
#423 test(eval): harder corpus + MRR/P@k discriminating metrics for recall tuning — 3 014 lines, 4 files
#470 feat(knowledge): port lore retrieval (atoms + domains + TF-IDF + fold) — 3 245 lines, 13 files | Not derivable from available data (see Methodology) | --- @@ -29,10 +29,10 @@ Bucket thresholds were derived using the Jenks natural-breaks method (k = 4) com The key observed gaps that define each boundary are: -| Axis | S/M boundary | M/L boundary | L/XL boundary | -|------|-------------|-------------|--------------| +| Axis | S/M boundary | M/L boundary | L/XL boundary | +| ------------- | ----------------------------------- | --------------------------------------- | --------------------------------------- | | Changed lines | #526 (536 lines) → #453 (579 lines) | #467 (1 276 lines) → #510 (1 510 lines) | #471 (2 266 lines) → #547 (2 661 lines) | -| Changed files | 5 files → 6 files | 12 files → 13 files | #477/#501 (22 files) → #472 (48 files) | +| Changed files | 5 files → 6 files | 12 files → 13 files | #477/#501 (22 files) → #472 (48 files) | ### Wall-clock estimates @@ -59,12 +59,12 @@ Apply the following checklist for each `li play` proposal. **Bucket thresholds for play planning**: -| Bucket | Assign when | -|--------|------------| -| S | Predicted lines ≤ 536 **and** predicted files ≤ 5 | -| M | Predicted lines > 536 **or** predicted files > 5, and lines ≤ 1 276 and files ≤ 12 | -| L | Predicted lines > 1 276 **or** predicted files > 12, and lines ≤ 2 266 and files ≤ 22 | -| XL | Predicted lines > 2 266 **or** predicted files > 22 | +| Bucket | Assign when | +| ------ | ------------------------------------------------------------------------------------- | +| S | Predicted lines ≤ 536 **and** predicted files ≤ 5 | +| M | Predicted lines > 536 **or** predicted files > 5, and lines ≤ 1 276 and files ≤ 12 | +| L | Predicted lines > 1 276 **or** predicted files > 12, and lines ≤ 2 266 and files ≤ 22 | +| XL | Predicted lines > 2 266 **or** predicted files > 22 | If a prediction falls in an unobserved gap (e.g., 23–47 files or 2 267–2 660 lines), treat it as XL or split. @@ -81,12 +81,12 @@ If a prediction falls in an unobserved gap (e.g., 23–47 files or 2 267–2 660 The following PRs do not fit their buckets cleanly. The assigned bucket reflects the larger axis; the planning interpretation notes when the standard complexity inference from that bucket should be adjusted. -| PR | Observed mismatch | Why it does not fit cleanly | Planning interpretation | -|---:|-------------------|----------------------------|------------------------| -| #611 | M by lines (998), XL by files (59) | Release PR spreads metadata across many files; file count overstates implementation complexity but still affects review and check surface. | Plan as XL for coordination scope. Evaluate whether release automation makes the per-file effort mechanical. | -| #477 | S by lines (302), L by files (22) | Bulk marketplace documentation refresh; broad file coverage with modest code-change depth. | Plan as docs-heavy L for review routing, not as equivalent to a cross-crate feature. | -| #423 | XL by lines (3 014), S by files (4) | Evaluation corpus and metric changes are concentrated in few files; line count captures data and test volume rather than integration breadth. | Plan as XL for validation burden. Coordination cost is lower than a typical XL; split by eval scenario rather than by module. | -| #461 | L by lines (1 818), S by files (3) | Pack safety hardening is deep and highly localized; low file count understates reasoning and regression risk. | Plan as L because line volume and behavioral risk dominate. Do not reduce to M. | -| #471 | L by lines (2 266), XL by files (58) | Repository-wide verb namespace migration; broad file touch pattern is the primary effort signal. | Plan as XL and split by subsystem if possible. | -| #472 | L by lines (1 987), XL by files (48) | Repository-wide terminology refactor from `importance` to `salience`; broad mechanical and semantic blast radius. | Plan as XL. If verified as fully mechanical and covered by generated tooling, the coordination cost may be lower than a standard XL feature. | -| #608 | M by lines (648), L by files (14) | ADR and retrieval-stack documentation spans many source files; runtime risk is likely lower, but source-traceability and citation-checking burden is broad. | Plan as L for document review and citation checks. | +| PR | Observed mismatch | Why it does not fit cleanly | Planning interpretation | +| ---: | ------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- | +| #611 | M by lines (998), XL by files (59) | Release PR spreads metadata across many files; file count overstates implementation complexity but still affects review and check surface. | Plan as XL for coordination scope. Evaluate whether release automation makes the per-file effort mechanical. | +| #477 | S by lines (302), L by files (22) | Bulk marketplace documentation refresh; broad file coverage with modest code-change depth. | Plan as docs-heavy L for review routing, not as equivalent to a cross-crate feature. | +| #423 | XL by lines (3 014), S by files (4) | Evaluation corpus and metric changes are concentrated in few files; line count captures data and test volume rather than integration breadth. | Plan as XL for validation burden. Coordination cost is lower than a typical XL; split by eval scenario rather than by module. | +| #461 | L by lines (1 818), S by files (3) | Pack safety hardening is deep and highly localized; low file count understates reasoning and regression risk. | Plan as L because line volume and behavioral risk dominate. Do not reduce to M. | +| #471 | L by lines (2 266), XL by files (58) | Repository-wide verb namespace migration; broad file touch pattern is the primary effort signal. | Plan as XL and split by subsystem if possible. | +| #472 | L by lines (1 987), XL by files (48) | Repository-wide terminology refactor from `importance` to `salience`; broad mechanical and semantic blast radius. | Plan as XL. If verified as fully mechanical and covered by generated tooling, the coordination cost may be lower than a standard XL feature. | +| #608 | M by lines (648), L by files (14) | ADR and retrieval-stack documentation spans many source files; runtime risk is likely lower, but source-traceability and citation-checking burden is broad. | Plan as L for document review and citation checks. |