Skip to content
This repository was archived by the owner on Mar 6, 2026. It is now read-only.

Latest commit

 

History

History
1919 lines (1538 loc) · 82.4 KB

File metadata and controls

1919 lines (1538 loc) · 82.4 KB

Data Science Session Record

Append-only log of session activity. Updated on each ss (snapshot).


Snapshot 1 — 2026-02-28T12:20Z

Context at session start

  • Picked up from prior session that had built the full repo from scratch
  • Prior session created: kit (3 conditions, 4 agent files), harness (run.sh, audit.js, 2 mock services), cost model, paper scaffold, README
  • Prior session ran 7 batches of test runs across various prefix experiments (ds-, apex-, lore-)
  • All prior runs are INVALID — see data_science_context.md "Test Run History"

Work done this session

  1. Reviewed current state of all condition files and agent files

    • baseline.md still had apex- prefix — updated to lore-
    • enforcement.md and full.md already had lore- prefix
    • Agent files were lore-*.md but missing YAML frontmatter
  2. Discovered agent registration mechanism

    • Checked ~/Github/lore/.claude/agents/ — the production Lore harness
    • Found that Claude Code requires YAML frontmatter (name, description, model) in .claude/agents/*.md for agent registration
    • Without frontmatter, claude agents list only shows 4 built-in agents
    • This was the root cause of ALL prior test failures — not prompt content
  3. Added YAML frontmatter to all 4 agent files

    • lore-default.md: name=lore-default, model=sonnet
    • lore-explore.md: name=lore-explore, model=haiku
    • lore-fast.md: name=lore-fast, model=haiku
    • lore-power.md: name=lore-power, model=opus
    • Verified: claude agents list from temp dir shows all 4 project agents
  4. Ran 1 validation (full condition, lore- prefix, WITH frontmatter)

    • Result: agent registration error gone — model correctly used lore-default
    • BUT: second bug revealed — sandbox permissions block curl in non-interactive mode
    • Workers hit "This command requires approval" and couldn't execute
    • Orchestrator then fell back to direct execution (also blocked)
    • Need --dangerously-skip-permissions in run.sh
  5. Principles-based rollback of condition files

    • User flagged concern about too many variables introduced
    • Re-read prompt engineering principles document (all 183 lines)
    • Identified 3 additions that violated P2/P7/P8:
      • Role identity line ("You are the Lore orchestrator...")
      • Routing Examples section (6 lines)
      • Escalation on Bail-Out section (11 lines)
    • Stripped all 3 from enforcement.md and full.md
    • Final state: baseline=23 lines, enforcement=34 lines, full=49 lines
    • Clean one-variable-per-step differential confirmed
  6. Created data_science_context.md — full knowledge dump for agent continuity

Decisions made

  • Use lore- as constant anchor across all conditions (not a variable)
  • Strip to minimal, measure, add back only what data justifies (per P2, P7, P8)
  • All prior .runs/ data is invalid and should not be used

Open items (in priority order)

  1. Fix run.sh: add --dangerously-skip-permissions to claude -p
  2. Run 1 validation of full condition with both fixes applied
  3. If valid: N=3 confirmation batch across all 3 conditions
  4. If confirmed: N=10+ scale run

Snapshot 2 — 2026-02-28T17:30Z

Work done this session

  1. Assessed data gaps — bare N=3 complete (localhost), baseline only T1-T2 (localhost), full N=3 missing entirely.

  2. Wrote thorough comms to CHAPPiE — full 54-job battery spec with matrix definition, env vars, source files, output path structure, verification steps, auth method. Sent to ~/Desktop/ds-to-chappie.txt.

  3. Set up bidirectional polling — monitored chappie-to-ds.txt for turn-based comms. CHAPPiE confirmed receipt, deployed in 6 waves of 9 jobs.

  4. While waiting: audited existing local data

    • Bare N=3 (localhost): 11.1% delegation, 0% violation-free — consistent pattern
    • Baseline T1-T2 (localhost): T1 direct, T2 2/3 delegated
  5. CHAPPiE completed 54-job K3s battery (2026-02-28T17:12Z)

    • 54/54 succeeded, zero failures
    • 270/270 files collected and verified
    • 6 waves, ~20 min total wall time
    • T6 (SKU surplus analysis) took longest at 73s-4m02s
  6. Audited all 3 conditions with audit-battery.js:

    • bare: 16.7% delegation (T2 only, via built-in Explore), 0% violation-free
    • baseline: 83.3% delegation (all except T1), 55.6% violation-free
    • full: 100% delegation, 94.4% violation-free, 100% schema compliance
  7. Ran cross-condition analysis with analyze-all.js:

    • Staircase pattern confirmed: bare < baseline < full on all metrics
    • T1 cutoff confirmed: bare/baseline direct, full delegates to lore-fast
    • Schema: bare n/a, baseline 0/24, full 34/34
    • Tier routing: baseline uses lore-default everywhere, full uses lore-fast for T1/T4/T6
  8. Acknowledged CHAPPiE, closed comms loop. No further K3s work needed now.

  9. Updated data_science_context.md with all K3s battery results and findings.

Key results table

Metric bare (N=18) baseline (N=18) full (N=18)
Delegation rate 16.7% 83.3% 100%
Violation-free 0.0% 55.6% 94.4%
Schema compliance n/a 0/24 (0%) 34/34 (100%)
Mean cost/run $0.17 $0.34 $0.39

Open items (in priority order)

  1. Run statistical tests (Fisher's exact) on N=3 data
  2. Identify cells needing more N for significance
  3. Targeted scale-up via CHAPPiE if needed
  4. Revise paper with 3-layer model and real data
  5. Studies 2 and 3

Snapshot 3 — 2026-02-28T18:30Z

Work done this session

  1. Built stats.js — Fisher's exact test, Cramér's V effect sizes, and power analysis for all pairwise condition comparisons. Pure Node.js, no dependencies.

  2. Ran stats on N=3 data — identified 2 comparisons not significant:

    • baseline vs full delegation (p=0.2286) — needs N=34 per group
    • T1 baseline vs full delegation (p=0.10) — needs N=4 per task
  3. Designed targeted scale-up — 35 additional jobs (not blanket N=10):

    • baseline T1, T2, T5, T6: each from N=3 → N=10 (28 jobs)
    • full T1: N=3 → N=10 (7 jobs)
    • Rationale: spend tokens only where statistical power is insufficient
  4. Sent targeted spec to CHAPPiE — clear job matrix, env vars, output paths. CHAPPiE deployed immediately (had already read the targeted spec before a clarification message crossed in the mail).

  5. CHAPPiE completed 35-job scale-up (2026-02-28T17:50Z)

    • 35/35 succeeded, zero failures, 175/175 files
    • Timestamp: 20260228-174500
    • ~10 min wall time
  6. Audited scale-up data — consistent with N=3 patterns:

    • baseline T1: still 0% delegation (0/7 new runs)
    • baseline T2: 100% delegation, 71% violation-free (5/7)
    • baseline T5: 100% delegation, 100% violation-free (7/7)
    • baseline T6: 100% delegation, 29% violation-free (2/7)
    • full T1: 100% delegation, 86% violation-free (6/7)
  7. Re-ran stats with combined data — ALL key comparisons now significant:

    • baseline vs full delegation: p=0.0115 (was 0.2286)
    • T1 baseline vs full delegation: p<0.0001 (was 0.10)
    • T1 baseline vs full viol-free: p=0.0007 (was 0.40)
    • baseline vs full viol-free: p=0.0006 (was 0.018, now stronger)
  8. Updated stats.js to merge across timestamps automatically (was hardcoded to single K3s timestamp).

  9. Acknowledged CHAPPiE — Study 1 data collection complete. No more K3s runs needed. Infrastructure warm for Studies 2/3.

  10. Updated data_science_context.md with final sample sizes, statistical results, and scale-up data inventory.

  11. Comprehensive paper revision — rewrote paper/paper.md from scratch:

    • New Section 3: Prompt Engineering Framework with full complaint-to-constraint synthesis methodology (6-stage pipeline, empirical complaint set, 9 principles, mapping to study interventions, how principles prevented errors during study design)
    • Revised Introduction for 3-layer model
    • Revised Background Section 2.2 for layered delegation problem
    • New Hypotheses (H1-H4) matching layer model
    • Complete Method (Section 6) with task battery, conditions, agent defs, mixed-N design with power analysis rationale
    • Complete Results (Section 7) with 8 tables of real data and p-values
    • Discussion (Section 8) interpreting each finding
    • Context Growth Cost Model (Section 9) with formal derivation
    • Limitations, Future Work, Conclusion
    • Appendices with full text of all condition files and agent definitions

Final data inventory

Batch Source Jobs Files Timestamp
N=3 battery K3s 54/54 270 20260228-170000
Scale-up K3s 35/35 175 20260228-174500
Total 89 445

Final sample sizes

Cell bare baseline full
T1 3 10 10
T2 3 10 3
T3 3 3 3
T4 3 3 3
T5 3 10 3
T6 3 10 3
Total 18 46 25

Final statistical results

Comparison Delegation p Viol-free p
bare vs baseline <0.0001 <0.0001
bare vs full <0.0001 <0.0001
baseline vs full 0.0115 0.0006
T1 baseline vs full (deleg) <0.0001
T1 baseline vs full (viol) 0.0007

Harness files created this session

File Purpose
harness/stats.js Fisher's exact, Cramér's V, power analysis, scale-up recommendations

Open items (in priority order)

  1. Final review pass on paper coherence
  2. Study 2 (Worker Cost Profiles) — designed, not executed
  3. Study 3 (Context Growth Economics) — designed, not executed
  4. Fill cost model numerical table with computed values
  5. Scale to N=10 across all cells if reviewers request it

Snapshot 4 — 2026-02-28T18:45Z

Work done

  1. Final coherence review — read paper end-to-end, cross-checked all numbers against actual audit/stats output. Found and fixed 5 issues:

    • Schema count: baseline 0/56 → 0/59 (3 tables + discussion)
    • Cost medians: baseline $0.35 → $0.26, full $0.23 → $0.20
    • Abstract violation-free: 94.4%/55.6% → 92.0%/52.2% (mixed-N aggregates)
    • Conclusion typo: "Layers 1" → "Layer 1"
    • Added limitation: Layers 1 and 2 confounded (full adds both enforcement and template; enforcement-only condition excluded from battery)
  2. Corrected Section 3.4 mapping table — intervention effects now accurately note the combined nature of the full condition rather than attributing effects to individual interventions.

  3. Updated data_science_context.md — Study 1 marked COMPLETE, harness file inventory added, next studies listed.

Study 1 — COMPLETE

  • 89 sessions (54 N=3 + 35 targeted scale-up)
  • All key comparisons p < 0.05
  • Paper written with real data, 1054 lines
  • Complaint-to-constraint synthesis methodology documented
  • All supporting scripts working (audit, analyze, stats)
  • K3s infrastructure warm for future studies

Open items

  1. Study 2 (Worker Cost Profiles)
  2. Study 3 (Context Growth Economics)
  3. Fill cost model numerical table
  4. Warm-start delegation experiments

Snapshot 5 — 2026-02-28T19:15Z

Work done

  1. Created GitHub repolorehq/delegation-study (private), pushed via drewswiredin account. 3 commits:

    • be254e4 — Study 1 complete (42 files, 14,517 lines)
    • cbe71d0 — Meta-methodology notes (meta/study-orchestration-notes.md)
    • ef1576b — Reviewer prompt (meta/reviewer-prompt.md)
  2. Wrote meta/study-orchestration-notes.md — documents the novel research process for future write-up:

    • Multi-agent research orchestration (Opus 4.6 data science + GPT-5.3 Codex infrastructure)
    • Inter-agent async comms via plain text files
    • Separation of concerns (neither agent can do the other's work)
    • Maker-in-the-loop approval gates
    • Cross-model collaboration on methodology/framework
    • Iterative study evolution driven by empirical findings
    • Rationale: keep published paper clean, save meta-methodology for separate publication when asked
  3. Wrote meta/reviewer-prompt.md — structured peer review prompt applying the 9-principle framework:

    • P1: pass/fail criteria for the review itself
    • P4: exact output contract (errors/weaknesses/suggestions structure)
    • P5: cite evidence, flag uncertainty as "unverified"
    • P6: 6 staged review passes (stats → pipeline → design → claims → framework → structure)
    • P9: explicit sections
    • Anti-sycophancy framing ("do not validate — scrutinize")
  4. Scoping decision — Studies 2 and 3 are separate follow-up publications, not prerequisites for Study 1 paper. Paper is ready for peer review and publication as-is (References section excepted).

Repository state

Commit Description
be254e4 Study 1 complete — paper, harness, conditions, agents, K8s, model
cbe71d0 Meta-methodology notes for future write-up
ef1576b Structured reviewer prompt

Remote: git@github.com:lorehq/delegation-study.git (private)

Current status

  • Paper: under peer review (reviewers launched by maker)
  • Data: 89 sessions, 445 files, all audited, all stats significant
  • Infrastructure: K3s warm, CHAPPiE idle, ready for Studies 2/3
  • Standing by for reviewer feedback

Open items

  1. Address reviewer feedback on paper
  2. Populate References section
  3. Study 2 (Worker Cost Profiles) — separate publication
  4. Study 3 (Context Growth Economics) — separate publication
  5. Meta-methodology paper — when asked about process

Snapshot 6 — 2026-02-28T~21:00Z

Context

  • Two independent peer reviews received (Opus and GPT cold reviews)
  • 10 errors, 8 weaknesses, and ~10 missing citations identified
  • Previous session (Snapshots 5→6 gap) applied most error/weakness fixes
  • This session: remaining items — verify stats.js, fill cost table, add citations, final coherence pass, commit and push

Work done this session

  1. Verified stats.js — runs cleanly after dynamic rate changes. All pairwise comparisons significant, output matches paper figures. Power analysis correctly renamed to "minimum-N heuristic" throughout.

  2. Filled cost model numerical table (model/context-growth-cost.md)

    • Computed values for T=1,3,5,10,20,50 using the formal model
    • Direct cost: $0.07 (T=1) → $95.19 (T=50), O(T²) growth
    • Delegated cost: $0.05 (T=1) → $4.19 (T=50), ~O(T) growth
    • Savings: 29.1% (T=1) → 95.6% (T=50) — confirms quadratic savings
    • Added explanatory paragraph after table
  3. Added 10 references to the paper (paper/paper.md References section):

    • [1] AutoGen (Wu et al., 2023)
    • [2] CrewAI (Moura, 2024)
    • [3] LangGraph (LangChain, 2024)
    • [4] FrugalGPT (Chen et al., 2023)
    • [5] RouteLLM (Ong et al., 2024)
    • [6] Claude Code docs (Anthropic, 2025)
    • [7] Claude API pricing (Anthropic, 2025)
    • [8] Categorical Data Analysis (Agresti, 2002) — Fisher's, Cramér's V
    • [9] OpenAI Assistants API (2024)
    • [10] LLM Alignment Survey (Shen et al., 2024)
    • Inline citations added at all 10 flagged locations
  4. Final coherence pass — agent-assisted full read found 5 issues:

    • Table 1 bare schema compliance: n/a0% (0/3 prompts)
    • Section cross-references: 3.2 → 3.3 for P2/P7/P8 definitions ✓
    • "eliminates" → "effectively overrides" in Section 8.3 ✓
    • Appendix F "power analysis" → "minimum-N estimation" ✓
    • Added borderline V=0.298 footnote on Table 2 ✓
  5. Clarified complaint-to-constraint synthesis originality — changed "we call" to "we introduce here" (Section 3.1)

  6. Committed and pushedc1fc221 to lorehq/delegation-study

Review fix scorecard (all items from both reviewers)

Category Total Fixed Remaining
Errors 10 10 0
Weaknesses 8 8 0
Missing citations 10 10 0
Suggestions (nice-to-have) 9 0 9 (deferred)

Repository state

Commit Description
be254e4 Study 1 complete
cbe71d0 Meta-methodology notes
ef1576b Structured reviewer prompt
3dc239a Snapshot 5: repo published
c1fc221 Apply peer review fixes: citations, cost model, coherence

Remote: git@github.com:lorehq/delegation-study.git (private, up to date)

Current status

  • Paper: all must-fix and should-fix review items resolved. 9 nice-to-have suggestions remain (bar chart, CIs, commit trail, formalize opaque test, refine cost model rates, etc.) — deferred.
  • Data: 89 sessions, 445 files, all audited, all stats significant
  • Cost model: numerical table complete, breakeven analysis confirmed
  • References: 10 citations populated, all inline markers placed
  • Paper is publishable — no blocking items remain

Open items

  1. Nice-to-have suggestions from reviewers (9 items, deferred)
  2. Study 2 (Worker Cost Profiles) — separate publication
  3. Study 3 (Context Growth Economics) — separate publication
  4. Meta-methodology paper — when asked about process
  5. Clean up paper/review.md duplicate (copy of Opus review in wrong dir)

Snapshot 7 — 2026-02-28T~22:00Z

Context

  • Round 2 peer reviews received (fresh Opus + GPT sessions, same prompt)
  • Round 1 fixes all landed cleanly — no stale figures or terminology issues
  • Round 2 found 5 new errors, 10 weaknesses, 8 new citation needs
  • Most substantive finding: multiple comparisons correction needed

Work done this session

  1. Fixed 5 errors:

    • Intro "executes directly 100%" → acknowledges bare T2 Explore delegation
    • Abstract p=0.012 → p=0.0115 (consistent with Table 2)
    • Violation detection scope: documented gap between enforcement text ("curl, wget, http") and audit regex (requires URL syntax)
    • "Clean causal attribution" removed from Section 3.4 mapping table
    • "Cost-aware routing absent" → "sporadic" in baseline, "systematic" in full
  2. Implemented Holm-Bonferroni multiple comparisons correction:

    • Added Section 5 to stats.js with full Holm-Bonferroni output
    • All 6 primary comparisons survive correction (critical finding)
    • baseline→full delegation (p=0.0115) would fail Bonferroni (α/6=0.0083) but passes Holm-Bonferroni (rank 6, threshold=0.0500)
    • Added correction results and Bonferroni note to paper Section 7.2
    • Added multiple comparisons limitation to Section 10
  3. Addressed 10 weaknesses:

    • Abstract/Conclusion Layer 0 hedged as "bundle" (registration + guidance)
    • Min-N heuristic balanced-group assumption documented in Section 6.8
    • Cost model parameter provenance added (manual JSONL inspection)
    • Section 3.5 evidence trail strengthened: "consistent with" not causal
    • Derivation method scope adaptation (coding → prompt-engineering) noted
    • Delegation metric split: Table 1 now has both "any" and "custom-worker"
    • Delegation conflation added as explicit limitation in Section 10
    • Framework derivation reproducibility limitation added to Section 3.1
    • Section 8.6 generalizability hedged as hypothesis pending replication
    • Opaque-name test: already adequately hedged (N=1, no artifact)
  4. Added 3 new references:

    • [10] Meyer (1992) — Design by Contract (replaced LLM alignment survey)
    • [11] Anthropic prompt caching docs
    • [12] Bai et al. (2022) — Constitutional AI
    • [13] Stamatis (2003) — FMEA methodology Total references: 13
  5. Committed and pushed: 6014baf

Round 2 review fix scorecard

Category Total Fixed Remaining
Errors 5 5 0
Weaknesses 10 10 0
Missing citations 8 8 0
Suggestions (nice-to-have) 9 0 9 (deferred)

Cumulative review scorecard (rounds 1+2)

Category R1 R2 Total fixed
Errors 10 5 15
Weaknesses 8 10 18
Citations 10 8 18 (13 unique refs)

Repository state

Commit Description
be254e4 Study 1 complete
cbe71d0 Meta-methodology notes
ef1576b Structured reviewer prompt
3dc239a Snapshot 5: repo published
c1fc221 Round 1 review fixes
4c38653 Snapshot 6
6014baf Round 2 review fixes: Holm-Bonferroni, metric separation, hedging

Remote: git@github.com:lorehq/delegation-study.git (private, up to date)

Current status

  • Paper: all must-fix and should-fix items from both review rounds resolved
  • Stats: Holm-Bonferroni implemented, all 6 primary comparisons survive
  • References: 13 citations, all inline markers placed
  • Paper is publishable — no blocking items remain
  • Nice-to-have suggestions deferred (bar charts, CIs, 4-condition design, reproducibility script, etc.)

Open items

  1. Nice-to-have suggestions from reviewers (~15 total across both rounds)
  2. Study 2 (Worker Cost Profiles) — separate publication
  3. Study 3 (Context Growth Economics) — separate publication
  4. Meta-methodology paper — when asked about process

Snapshot 8 — 2026-02-28T~23:00Z

Context

  • Round 3 peer reviews received (fresh sessions, same prompt)
  • Opus verdict upgraded to "needs minor revision" (from "needs revision")
  • GPT still says "major issues" but primarily about reproducibility artifacts
  • Round 3 found 6 new errors, 7 weaknesses, 8 new citation needs

Work done this session

  1. Fixed 6 errors:

    • Cost model single-rate simplification: disclosed Haiku is 17x cheaper than Opus, meaning savings estimates are conservative
    • Abstract now uses custom-worker delegation (0%→78.3%) not inclusive metric (16.7%→78.3%) — makes the Layer 0 finding stronger
    • Data availability note added: raw JSONL/output.json available on request, regen commands documented
    • "single-variable" language removed from Section 3.5 P8 paragraph
    • Abstract/Conclusion causal claims reframed as "larger observed increase" not "drives more than"
    • Breakeven criterion inconsistency resolved between model file and paper (single-task derivation, multi-task note)
  2. Addressed 7 weaknesses:

    • Cramér's V ceiling effect: explicit confound note in Section 8.1 ("V is mechanically compressed near ceilings")
    • Section 3.5 active-agent language → passive ("consistent with", "aligns with") to match the post-hoc rationalization hedge
    • Full T1 violation pattern discussed: 2/10 sessions delegate AND violate — enforcement less effective on trivial supplementary calls
    • Holm-Bonferroni step-down bug fixed: running max for adjusted p-values + rejection propagation. Results unchanged (bug was latent)
    • "additive chain" → "cumulative layers" throughout Intro/Methods
    • Demand characteristics rewritten as explicit confound (not symmetric framing) — notes success-criteria redefinition changes the DV
    • Table 7 quantified with exact worker-type counts per condition (baseline: 57/2/0, full: 25/20/1)
  3. Added 3 new references:

    • [14] Cohen (1988) — effect size benchmarks
    • [15] White et al. (2023) — prompt pattern catalog
    • [16] Glaser & Strauss (1967) — grounded theory Total references: 16
  4. Section 2.2 residual causal claim fixed: "drives delegation more than" → "produces a larger observed increase in delegation rate than"

  5. Committed and pushed: b1ed7b3

Round 3 review fix scorecard

Category Total Fixed Remaining
Errors 6 6 0
Weaknesses 7 7 0
Missing citations 8 5 3 (already cited or handled)
Suggestions (nice-to-have) 9 0 9 (deferred)

Cumulative review scorecard (rounds 1+2+3)

Category R1 R2 R3 Total fixed
Errors 10 5 6 21
Weaknesses 8 10 7 25
Citations 10 8 5 23 (16 unique refs)

Repository state

Commit Description
be254e4 Study 1 complete
cbe71d0 Meta-methodology notes
ef1576b Structured reviewer prompt
3dc239a Snapshot 5
c1fc221 Round 1 review fixes
4c38653 Snapshot 6
6014baf Round 2 review fixes
0654bb2 Snapshot 7
b1ed7b3 Round 3 review fixes

Remote: git@github.com:lorehq/delegation-study.git (private, up to date)

Current status

  • Paper: 3 rounds of review, 21 errors / 25 weaknesses / 16 refs fixed
  • Opus verdict: "needs minor revision" — converging toward acceptance
  • Stats: Holm-Bonferroni correct (step-down fixed), all comparisons hold
  • Claims: fully hedged — observational language, ceiling effects noted, demand characteristics explicit, causal overreach eliminated
  • Table 7: quantified with exact routing counts
  • Cost model: single-rate limitation disclosed, breakeven consistent

Convergence assessment

The reviews are converging. Round 1→2→3 error severity is declining:

  • R1: stale figures, wrong stats, missing variables
  • R2: precision, scope gaps, multiple comparisons
  • R3: metric conflation, causal hedging, code correctness

Remaining nice-to-have items are genuinely optional (CIs, bar charts, 4-condition design, preregistration, data release). The paper is publishable.

Open items

  1. Nice-to-have suggestions (~20 total across 3 rounds, all deferred)
  2. Publish raw data release (frozen bundle with checksums)
  3. Study 2 (Worker Cost Profiles) — separate publication
  4. Study 3 (Context Growth Economics) — separate publication
  5. Meta-methodology paper — when asked about process

Snapshot 9 — 2026-02-28T~23:45Z

Context

  • 3 rounds of peer review complete, all errors/weaknesses fixed
  • Maker asked: "are our tests fundamentally flawed?"
  • Honest assessment: not flawed, but bundled variables limit causal claims
  • Decision: redesign to 5-condition hierarchy for proper variable isolation

Work done this session

  1. Design critique and redesign:

    • Identified that bare→baseline bundles 4 changes (registration + soft language + tier table + tool inventory change)
    • Identified that baseline→full bundles 5 changes (success criteria redefinition + enforcement language + violation naming + template + success criteria adds "cost-appropriate")
    • Identified that enforcement-only condition existed but was never run
    • Identified that a registration-only condition was needed to isolate the pure tool-visibility effect
  2. Reframed the study design:

    • bare demoted from "condition" to "infrastructure validation check"
    • registration-only = true behavioral baseline (agents visible, zero instructions)
    • baseline = Treatment 1 (soft prompting)
    • enforcement = Treatment 2 (enforcement without template)
    • full = Treatment 3 (enforcement + template)
    • enforcement→full is now the cleanest single-variable transition
  3. Created kit/conditions/registration-only.md:

    • Identical to bare.md (6 lines, no delegation language)
    • The only difference: agents directory is present when this runs
    • Tests pure tool visibility effect
  4. Updated harness/run-battery.sh:

    • Handles all 5 conditions
    • Only bare excludes agents; registration-only gets agents
  5. Updated harness/stats.js:

    • Loads registration-only and enforcement data when available
    • Pairwise comparisons auto-expand for available conditions
    • Existing 3-condition analysis unchanged
  6. Updated CHAPPiE's KB:

    • ~/Github/lore-CHAPPiE/docs/workflow/in-flight/items/delegation-study-batch-runner/index.md
    • 3 conditions → 5 conditions table
    • Phase 3 documented (36 jobs, N=3 exploratory)
    • Agent injection logic clarified (only bare excludes agents)
  7. Sent CHAPPiE Phase 3 request:

    • ~/Desktop/ds-to-chappie.txt — 36-job battery (2 conditions × 6 tasks × N=3)
    • Estimated ~$11, ~10 min wall time
  8. Rewrote data_science_context.md — full knowledge dump reflecting 5-condition design, Phase 3 status, all findings with caveats

  9. Committed and pushed: 714d6b4

N estimation for new conditions

  • registration-only vs bare: N=5 sufficient if rate ≥30%, N=3 if ≥50%
  • registration-only vs baseline: N=6-10 depending on observed rate
  • enforcement vs baseline violation-free: N=10 for ~90% vs 52%
  • Strategy: N=3 exploratory first, then scale up (same as Phase 1)

Key unknowns Phase 3 will resolve

  1. registration-only delegation rate — the most important unknown. If 0%: soft prompting drives delegation, not tool visibility. If 30-50%: tool visibility helps but prompting doubles it. If 70%+: tool visibility alone is nearly sufficient.

  2. enforcement delegation rate — expected ~100% (like full).

  3. enforcement violation-free rate — expected ~90% (like full).

  4. enforcement schema compliance — expected 0% (no template). If confirmed, this cleanly isolates the template effect.

Repository state

Commit Description
be254e4 Study 1 complete
... (6 intermediate commits)
b1ed7b3 Round 3 review fixes
47a45b0 Snapshot 8
714d6b4 Add registration-only condition, 5-condition design

Remote: git@github.com:lorehq/delegation-study.git (private, up to date)

Current status

  • Phase 3 battery: CHAPPiE message sent, KB updated, awaiting run
  • Paper: publishable with 3-condition data, will be restructured after Phase 3 data arrives
  • Stats code: ready to ingest new conditions automatically
  • data_science_context.md: fully rewritten for 5-condition design

Open items

  1. Await Phase 3 data from CHAPPiE
  2. Analyze Phase 3 results, determine scale-up targets
  3. Restructure paper for 5-condition hierarchy
  4. Re-run peer review after paper restructure
  5. Study 2 (Worker Cost Profiles) — separate publication
  6. Study 3 (Context Growth Economics) — separate publication

Snapshot 10 — 2026-02-28T~17:00Z

Context

  • Phase 3 data arrived from CHAPPiE (36/36 clean, 00:15Z)
  • Phase 3 audit confirmed: registration-only = 0% custom delegation (identical to bare), enforcement = 100% delegation / 94.4% violation-free
  • LANDMARK: registration alone does nothing — the model ignores agents without prompting
  • Operator challenged naming: "baseline" contained prompt engineering, which is not a baseline. OOTB Claude Code with agents IS the baseline.

Work done this session

  1. Audited Phase 3 data:

    • registration-only: 0% custom delegation, 16.7% any (Explore only), identical to bare on all metrics (p=1.000)
    • enforcement: 100% delegation, 94.4% violation-free, 0% schema compliance
    • Both confirm predictions from Snapshot 9
  2. Ran full statistical analysis (stats.js):

    • All pairwise comparisons computed for 5 conditions
    • Holm-Bonferroni with 18-test family: soft-guidance→enforcement delegation (p=0.0504) and soft-guidance→enforcement-contract delegation (p=0.0115) do NOT survive correction
    • Min-N heuristic: need N=21/group for soft-guidance→enforcement delegation
  3. Major naming correction — operator-driven:

    • "registration-only" renamed to "baseline" (OOTB Claude Code IS the baseline)
    • "baseline" renamed to "soft-guidance" (it contains prompt engineering)
    • "full" renamed to "enforcement-contract" (consistent naming scheme)
    • Applied across: condition files, .runs/battery/ directories, run-battery.sh, stats.js, data_science_context.md
  4. Expanded to 7-condition design — operator-driven:

    • Orch-worker contract (template) becomes independent variable crossed with each prompt level
    • New conditions: baseline-contract, soft-guidance-contract
    • Rationale: (a) test whether template suppresses delegation affinity due to perceived overhead, (b) pre-emptively address reviewer concerns about accuracy degradation from hand-offs
    • Created kit/conditions/baseline-contract.md — OOTB + template only
    • Created kit/conditions/soft-guidance-contract.md — soft + tier table + template
  5. Updated all harness code for 7 conditions:

    • run-battery.sh — 7 condition names in usage
    • stats.js — dynamic condition loading, 2-axis pair comparisons (prompt chain + contract effect), all conditions optional
    • analyze-all.js — already dynamic, no changes needed
    • audit-battery.js — no condition name references, no changes needed
  6. Designed scale-up plan:

    • Variable N: N=8/task for N50 conditions, N=5/task for N30 conditions
    • bare: 18 (done)
    • baseline: 30 (+12 new)
    • baseline-contract: 30 (+30 new)
    • soft-guidance: 48 (+10 new, T3/T4 only)
    • soft-guidance-contract: 48 (+48 new)
    • enforcement: 48 (+30 new)
    • enforcement-contract: 48 (+25 new, T1 already at 10)
    • Total: 155 new sessions, ~$39, ~45-60 min wall time
  7. Sent Phase 4 request to CHAPPiE:

    • ~/Desktop/ds-to-chappie.txt — 155-job battery spec
    • Updated CHAPPiE's KB (index.md) for 7-condition design
    • Operator forwarded to CHAPPiE
  8. Updated data_science_context.md — full rewrite for 7-condition design

The 7 Conditions (final design)

# Condition Agents Prompt level Contract Target N
1 bare NO none NO 18 (done)
2 baseline YES none NO 30
3 baseline-contract YES none YES 30
4 soft-guidance YES soft NO 48
5 soft-guidance-contract YES soft YES 48
6 enforcement YES hard NO 48
7 enforcement-contract YES hard YES 48

Two analysis axes:

  • Axis 1 (Prompt level): bare → baseline → soft-guidance → enforcement
  • Axis 2 (Contract effect): with vs without at each prompt level

Key findings confirmed from Phase 3

Metric bare baseline p
Custom-worker delegation 0% (0/18) 0% (0/18) 1.000
Any delegation (incl. Explore) 16.7% (3/18) 16.7% (3/18) 1.000
Violation-free 0% (0/18) 0% (0/18) 1.000
Metric enforcement enforcement-contract p
Delegation 100% (18/18) 100% (25/25) 1.000
Violation-free 94.4% (17/18) 92.0% (23/25) 1.000
Schema compliance 0% (0/40) 100% (46/46)

Decisions made

  • "baseline" must mean OOTB Claude Code with agents — prompt engineering is NOT a baseline
  • Contract (template) tested as independent variable, not just final layer
  • Variable N per condition: N=50 where effects are close, N=30 elsewhere
  • Balanced per-task cells: N=8/task for 50-target, N=5/task for 30-target
  • Communication to CHAPPiE via ~/Desktop/ds-to-chappie.txt

Open items

  1. Await Phase 4 data from CHAPPiE (155 sessions)
  2. Full statistical analysis across all 7 conditions
  3. Restructure paper for 7-condition design with 2-axis analysis
  4. Re-run peer review after paper restructure
  5. Push repo to remote (DNS issues last attempt)
  6. Study 2 (Worker Cost Profiles) — separate publication
  7. Study 3 (Context Growth Economics) — separate publication

Snapshot 11 — 2026-02-28T~17:30Z

Context

  • Phase 4 delivered by CHAPPiE: 155/155 jobs, 775/775 files
  • Audit revealed ALL 155 sessions invalid: OAuth token expired (401)
  • claude -p exits 0 on auth failure, so CHAPPiE's file count check passed
  • Zero API calls, zero cost, zero valid data from Phase 4
  • Notified CHAPPiE: token refresh + re-run needed
  • While waiting: operator proposed context growth projections for the paper

Work done this session

  1. Audited Phase 4 data — found total failure:

    • All 155 output.json files contain auth error, not results
    • "is_error": true, "total_cost_usd": 0 across the board
    • Root cause: Max subscription OAuth token expired between Phase 3 and 4
    • Existing audit script doesn't check for auth errors (bug)
    • Sent urgent message to CHAPPiE with diagnosis and re-run request
    • Recommended canary job before full re-run
  2. Discussed study findings with operator:

    • Confirmed baseline (OOTB + agents) produces 0% custom delegation
    • Findings are strong: clear staircase from baseline→soft-guidance→enforcement
    • Caveats: single model (opus), single task domain (API queries), single framework (Claude Code)
  3. Planned context growth projection (not yet implemented):

    • Operator proposed including projected context growth graphs in paper
    • Rationale: delegated workflows keep orchestrator context clean — raw execution noise (HTTP responses, curl output, error traces) lives in worker contexts, never touches orchestrator
    • This potentially extends effective session length (more work before compaction) and improves orchestrator reasoning accuracy (context purity)
    • See design sketch below

Context Growth Projection — Design Sketch

What we can measure from existing data (125 valid sessions)

From session.jsonl files we can extract per-session:

  • Orchestrator total token usage (input + output)
  • Number of orchestrator-direct tool calls vs delegated tool calls
  • Context growth rate (tokens per turn)
  • Whether compaction events occurred
  • Ratio of "reasoning tokens" to "execution noise tokens"

Key comparisons:

  • bare/baseline (all direct execution) vs enforcement/enforcement-contract (all delegated) — should show dramatically different orchestrator context growth rates

What we'd project (modeled, not measured)

Extrapolation to longer sessions (50-100 turns, realistic production):

  • Direct execution growth curve: steep, frequent compaction (sawtooth)
  • Delegated growth curve: shallow slope, long periods between compaction
  • Breakeven point: how many turns of useful work before first compaction

Proposed figures for paper

Figure 1: Empirical staircase — bar charts of delegation rate, violation-free rate, schema compliance across all 7 conditions. The data we already have.

Figure 2: Context growth model — two curves:

  • Red (direct execution): steep token growth, sawtooth from frequent compaction events. Based on bare/baseline token data extrapolated.
  • Blue (delegated execution): slow token growth, much longer between compactions. Based on enforcement token data extrapolated.
  • X-axis: orchestrator turns. Y-axis: orchestrator context size (tokens).
  • Compaction threshold marked as horizontal line.
  • Key insight: delegated workflow gets N× more useful turns before compaction.

Figure 3: Context purity — stacked bar or area chart showing orchestrator context composition:

  • "Planning/reasoning" tokens vs "execution noise" tokens
  • Direct execution: mostly noise (raw API responses in context)
  • Delegated: mostly reasoning (only worker summaries in context)
  • Connects to accuracy argument: cleaner context → better high-level reasoning

Where this fits

  • Study 1 paper: include ONE projection figure in "Practical Implications" section, clearly labeled as modeled. Gives practitioners the "so what."
  • Study 3 (Context Growth Economics): rigorous measurement with longer sessions, actual compaction tracking, formal cost model. Separate pub.

Implementation plan (when ready)

  1. Write harness/token-analysis.js — extracts per-session token data from session.jsonl files
  2. Compare orchestrator token usage: direct vs delegated conditions
  3. Build projection model with assumptions documented
  4. Generate figures (likely Python matplotlib or similar)
  5. Add to paper as Figure N in Practical Implications section

Decisions made

  • Phase 4 data is invalid, needs full re-run after token refresh
  • Context growth projections will be included in Study 1 paper (one modeled figure, clearly labeled) — implementation deferred until Phase 4 data arrives
  • Audit script needs auth validation check (TODO)

Notes for follow-up: Market impact and publication strategy

Cost impact framing (for paper):

  • We have per-session cost data across conditions but cannot responsibly estimate total market impact without external data on deployment volumes
  • Per-session framing: enforcement-contract routes 43.5% to haiku (10-25x cheaper than opus). Cost savings are workload-dependent.
  • Paper should present per-session economics with explicit assumptions, not market-wide projections

Expected reactions on publication:

  1. Immediate practical adoption — finding is too simple/actionable to ignore. "Add 23 lines, go from 0%→78% delegation." Zero barrier to replication.
  2. Framework maintainers (LangChain, CrewAI, AutoGen) may start shipping default delegation prompts instead of leaving prompt layer to users.
  3. Anthropic product interest — empirical data about their model's behavior in their own framework. Could influence Claude Code default system prompt.
  4. Replication across models (GPT, Gemini, Llama) — the obvious next study. If pattern generalizes, much bigger story.
  5. Context purity argument gets picked up — reframes delegation from "cost optimization" to "accuracy preservation." More compelling for enterprise adoption than cost alone.
  6. Counterarguments: trivial tasks, single model, moderate N. All already acknowledged in paper. Study's strength is being first controlled measurement, not last word.

Follow-up publications to consider:

  • Cross-model replication (GPT-4, Gemini, open-source models)
  • Longer session studies measuring actual compaction events
  • Task complexity spectrum (our T1-T6 are all API queries)
  • Production deployment case study with real workload metrics
  • Market cost impact analysis with industry deployment data

Who will likely reach out:

  • Agent framework developers (LangChain, CrewAI, AutoGen) — prompt layer gap affects their product. Licensing or collaboration interest.
  • Enterprise teams running multi-agent deployments — "does this apply to our stack?" Applied guidance requests.
  • Anthropic — product implications for Claude Code defaults.
  • Researchers — replication and extension proposals.
  • AI media / newsletters — "LLMs ignore their tools" is a headline.

Pre-publication prep:

  • Have clear position on prompt pattern licensing (condition files are in public repo, people will copy regardless)
  • Prepare concise summary for social/blog amplification
  • Consider arxiv preprint + Twitter thread for maximum visibility

Open items

  1. Await Phase 4 re-run from CHAPPiE (token refresh needed first)
  2. Add auth validation to audit script
  3. Full statistical analysis across all 7 conditions (after valid Phase 4)
  4. Extract token data from 125 valid sessions for context growth model
  5. Build context growth projection figures
  6. Restructure paper for 7-condition design with 2-axis analysis
  7. Re-run peer review after paper restructure
  8. Study 2 (Worker Cost Profiles) — separate publication
  9. Study 3 (Context Growth Economics) — separate publication

Snapshot 12 — 2026-02-28T~18:00Z

Context

  • Phase 4 first attempt failed (OAuth 401, all 155 sessions invalid)
  • CHAPPiE notified, awaiting token refresh + re-run
  • Operator discussion: market implications, publication strategy

Work done

  • Diagnosed Phase 4 failure: all output.json contain auth error, zero API calls, zero cost. Root cause: expired OAuth token.
  • Sent CHAPPiE urgent re-run request with canary job recommendation
  • Discussed study findings confidence with operator:
    • Strong: registration does nothing, soft guidance is primary driver, enforcement improves compliance, template is schema-only intervention
    • Needs Phase 4: contract effect at each level, soft→enforcement at proper N
  • Discussed market cost impact — cannot responsibly estimate total market without external deployment data. Paper will use per-session framing.
  • Discussed expected publication reactions — noted in session log above
  • Discussed context growth projections — delegated workflows preserve orchestrator context purity, extend session length, improve reasoning. One modeled figure planned for Study 1, rigorous measurement in Study 3.
  • No code changes this snapshot — preserving context for Phase 4 analysis

Current state

  • 125 valid sessions (Phases 1-3)
  • 155 invalid sessions (Phase 4, auth failure) — awaiting re-run
  • 7 condition files ready, all harness code updated
  • Repo pushed to GitHub at 36d1cfd
  • Blocked on CHAPPiE token refresh

Paper writing principles:

  1. The data tells the story. No superlatives, no overselling. The numbers are striking enough without editorial amplification.
  2. Observational, not causal. Every finding stated as "we observed X under conditions Y" — never "X is true" or "X causes Y."
  3. Scope-bound every claim. This model (claude-opus-4-6), this framework (Claude Code), these tasks (API queries), these sample sizes. We did not prove generalization.
  4. Registration is a prerequisite, not a null. Never say "registration does nothing." Say "registration without prompting produced 0% custom delegation in our conditions." It enables the capability — prompting activates it.
  5. Avoid strong categorical language. Not "the template doesn't affect delegation" — say "we observed no significant difference in delegation rate between conditions with and without the template at these sample sizes." Leave room for effects we didn't detect.
  6. Modest wording protects credibility. If findings are robust, others will make the strong claims when they replicate. That's how it should work. Our job is to measure carefully and report honestly.
  7. Acknowledge what we didn't test. One model, one task domain, single-prompt sessions, moderate N. These are real limitations, not throat-clearing — state them as such.
  8. Let readers draw conclusions. Present the data, describe the conditions, report the statistics. The implications are obvious to anyone reading — we don't need to spell them out in bold.

Publication and distribution plan:

  • Research repo (lorehq/delegation-study) goes public — paper, data, harness, stats. Citation target for researchers.
  • Drop-in repo (new, e.g. lorehq/lore-delegation-kit) — minimal: one CLAUDE.md, four agent files, README linking to paper. Adoption target for practitioners. Gets starred/forked/shared.
  • Publish as Andrew [personal] with HSD-INC / Lore org affiliation. Standard "Author, Org" academic format.
  • Agent naming for drop-in kit: TBD. Must be distinctive and memorable without being ambiguous. Will test naming as a variable — note that baseline-opaque.md exists from an early one-off test with opaque agent names that was never followed up on. Naming could be a separate small study or an appendix finding.

Open items

  1. Await Phase 4 re-run from CHAPPiE (token refresh needed first)
  2. Add auth validation to audit script
  3. Full statistical analysis across all 7 conditions (after valid Phase 4)
  4. Extract token data for context growth model
  5. Build context growth projection figures
  6. Restructure paper for 7-condition 2-axis design
  7. Re-run peer review after paper restructure
  8. Agent naming test for drop-in kit
  9. Create drop-in repo after paper is final
  10. Study 2 (Worker Cost Profiles) — separate publication
  11. Study 3 (Context Growth Economics) — separate publication

Snapshot 13 — 2026-02-28T~20:00Z

Context

  • New conversation. Picked up from Snapshot 12 blocked state.
  • Phase 4 re-run data had arrived from CHAPPiE (155/155 valid sessions).
  • Data integrity investigation was in progress — Phase 4 merged rates looked dramatically different from Phase 3 preliminary data.

Work done this session

  1. Resolved data integrity crisis — root cause found and fixed:

    • Phase 4 enforcement showed 72.7% delegation (expected 100%)
    • Root cause: 133 empty directory stubs from the failed OAuth batch survived CHAPPiE's cleanup. Harness created 288 directories for the failed attempt, CHAPPiE cleaned file contents but left empty dirs. Re-run filled only 155 of them (the actual scale-up jobs).
    • Audit script counted empty dirs as "runs that didn't delegate"
    • Fix: deleted 133 empty run directories (rmdir)
    • Verified: 155 valid session.jsonl files exactly match CHAPPiE's report (baseline=12, baseline-contract=30, soft-guidance=10, soft-guidance-contract=48, enforcement=30, enforcement-contract=25)
  2. Re-ran all audits and analysis with clean data:

    • audit-battery.js for all 6 Phase 4 conditions
    • analyze-all.js cross-condition comparison
    • stats.js full statistical analysis with Holm-Bonferroni
  3. Merged research paper writing principles:

    • Read Opus fused file (498 lines, 9 principles, extensive citations)
    • Read GPT fused file (210 lines, 10 principles, compressed)
    • Merged into unified 9-principle set at meta/research-paper-writing-principles.md
    • Key merge decisions: GPT's sharper naming for P2, absorbed GPT's P9/P10 as directive+standalone, tightened checklist to 16 items
  4. Wrote new paper from scratch — 7-condition design:

    • Zero content from old 3-condition paper
    • Written purely from clean data governed by:
      • 9 research paper writing principles (merged)
      • 8 operator paper writing principles (Snapshot 12)
    • Structure: Abstract, Introduction (gap+contribution+scope), Method (6 subsections), Results (7 subsections with 9 tables), Discussion (observations + 9 limitations + 5 implications), Conclusion, Data Availability, References
    • Self-audited against 16-item principles checklist: 15/16 pass, 1 partial (no CIs on proportions)
    • Committed and pushed: 8e0e0f5
  5. Added task correctness measurement:

    • Built harness/score-correctness.js — regex-based ground-truth checker for all 6 tasks against deterministic mock service answers
    • Ground truth: T1=5 orders, T2=5 orders (discovery), T3=WIDGET-A stock 50, T4=both, T5=all sufficient, T6=WIDGET-B below 20% surplus, reorder 5
    • Results: 278/280 correct (99.3%)
    • Only 2 failures, both in enforcement conditions:
      • enforcement/T5/run-06: worker missed X-Warehouse header hint
      • enforcement-contract/T1/run-03: infrastructure failure (service not ready)
    • Non-enforcement conditions: 100% correct (212/212)
  6. Integrated correctness throughout the paper:

    • Not added as a separate section — woven into every relevant part
    • Updated: Abstract, Scope, Contribution, Measures (now 3 binary measures), Table 1 (added correctness column), prompt-level axis results, contract-template results, new per-task Table 8, Discussion observations, Limitations (reframed from "no correctness" to "coarse correctness"), Practical Implications (new item 5), Conclusion, Data Availability
    • Committed and pushed: be8301e

Clean data — final aggregate rates

Condition N Delegation Violation-free Correctness Cost
bare 18 16.7% 0.0% 100.0% $0.17
baseline 30 20.0% 0.0% 100.0% $0.17
baseline-contract 30 46.7% 16.7% 100.0% $0.24
soft-guidance 56 82.1% 60.7% 100.0% $0.38
soft-guidance-contract 48 89.6% 72.9% 100.0% $0.31
enforcement 48 100.0% 97.9% 97.9% $0.41
enforcement-contract 50 100.0% 96.0% 98.0% $0.37

Statistical significance (Holm-Bonferroni corrected)

  • 12/20 pairwise comparisons survive correction
  • All prompt-level axis comparisons survive (except bare vs baseline)
  • No contract-template comparisons survive
  • baseline→soft-guidance delegation: p<0.001, V=0.61 (large)
  • soft-guidance→enforcement delegation: p=0.002, V=0.30 (medium)
  • baseline→enforcement delegation: p<0.001, V=0.84 (large)

Repository state

Commit Description
44db74c Snapshot 12
8e0e0f5 New 7-condition paper + merged writing principles
be8301e Integrate correctness throughout paper + scoring script

Remote: git@github.com:lorehq/delegation-study.git (private, up to date)

Current status

  • Paper: new clean paper written from scratch, correctness integrated, pushed at be8301e. Ready for peer review.
  • Data: 280 valid sessions across 7 conditions, all clean
  • Stats: Holm-Bonferroni corrected, 12/20 comparisons significant
  • Correctness: 278/280 (99.3%), 2 failures traced to worker/infra errors
  • Writing principles: merged Opus+GPT into unified 9-principle set
  • Awaiting: operator review feedback

Open items

  1. Address peer review feedback on new paper
  2. Context growth projection figures
  3. Add auth validation to audit script
  4. Create drop-in repo (lorehq/lore-delegation-kit)
  5. Agent naming decision for drop-in kit
  6. Study 2 (Worker Cost Profiles) — separate publication
  7. Study 3 (Context Growth Economics) — separate publication

Snapshot 14 — 2026-02-28T~21:00Z

Context

New conversation session. Operator asked to continue with next steps. No peer review feedback provided yet. Proceeded with Wilson CIs (flagged gap from self-audit) and duration data.

Work done

  1. Wilson score 95% confidence intervals — stats.js

    • Added wilsonCI(successes, n, z) function
    • Added formatCI() helper for display
    • New Section 0 output: Wilson 95% CIs for delegation, violation-free, and correctness across all 7 conditions
    • All computed values verified against manual cross-check
  2. Wilson CIs integrated throughout paper.md

    • Abstract: CIs on endpoint delegation and violation-free rates
    • Section 2.6: Added paragraph describing Wilson score method with rationale (better coverage than Wald for small N / boundary proportions)
    • Table 1: Added 95% CI columns for delegation and violation-free
    • Tables 2-3: CIs on individual proportions in prompt-level comparisons
    • Tables 4-5: CIs on individual proportions in contract-template comparisons
    • Section 3.1 prose: CIs on key cited rates
    • Section 3.2 prose: CIs on correctness rates with overlap commentary
    • Section 3.3 prose: CI overlap discussion supporting non-significance
    • Section 5 (Conclusion): CIs on endpoint rates
    • References: Added Wilson (1927) citation
    • Fixed stale "Task correctness is not measured" in Section 4.2
  3. Duration data extraction — stats.js

    • Added loadDurations(condition) — reads output.json files for duration_ms and duration_api_ms
    • Added durationStats() — mean, median, SD, min, max
    • New Section 0b output: duration summary table
    • Confirmed no stdin wait contamination: all sessions were claude -p with stdin from /dev/null, so duration_ms = pure execution time
  4. Duration integrated into paper.md (aggregate only)

    • Section 2.5: Added "Session duration" measure description
    • Table 1: Added median duration (s) and SD (s) columns
    • Section 3.1: Added descriptive paragraph noting range (40.1s–65.2s median), higher variance in delegating conditions, explicitly flagged as descriptive (not pre-specified, not statistically tested)

Key duration observations

Condition Median (s) SD (s)
bare 46.4 45.5
baseline 40.1 16.6
baseline-contract 47.0 40.8
soft-guidance 59.6 86.6
soft-guidance-contract 50.3 43.2
enforcement 65.2 64.9
enforcement-contract 62.7 66.6

Duration not tested for significance — confounded with task complexity and delegation behavior, not pre-specified.

Paper writing principles compliance

Wilson CIs close the remaining gap from Snapshot 13 self-audit:

  • Checklist item "Effect sizes + CIs accompany every inferential test" — NOW SATISFIED
  • All 16 checklist items now pass

Current status

  • Paper: Wilson CIs + durations added, all checklist items satisfied
  • Awaiting: peer review feedback from operator
  • Not yet committed — changes are local only

Open items

  1. Address peer review feedback on paper (NEXT — awaiting operator)
  2. Context growth projection figures
  3. Add auth validation to audit script
  4. Create drop-in repo (lorehq/lore-delegation-kit)
  5. Agent naming decision for drop-in kit
  6. Study 2 (Worker Cost Profiles) — separate publication
  7. Study 3 (Context Growth Economics) — separate publication

Snapshot 15 — 2026-02-28T~22:00Z

Context

Operator provided two peer reviews of the paper. Review 1 focused on scientific critiques (causal language, binary measures, regex detection, temporal confound, compliance-vs-benefit gap, cost data). Review 2 was an adversarial defense review (12 points) written from the perspective of someone rebutting misuse of the paper's findings.

Both reviews were written before Wilson CIs, durations, and correctness were added — some items were already addressed by Snapshot 14 work.

Work done — peer review revisions

  1. Softened causal language throughout paper

    • Replaced "produced" with "was associated with" / "we observed" in Sections 3.2, 4.1, 4.4
    • Added explicit confound acknowledgment in 4.1: conditions differ in multiple respects (text, length, constraint specificity), so attribution to any single component is limited
  2. New Section 4.2: "Compliance Is Not Benefit"

    • 3 paragraphs distinguishing policy compliance from engineering quality
    • Explicit cost comparison: $0.17 (baseline) vs $0.41 (enforcement) — enforced delegation was more expensive in this task domain
    • T1 alternative interpretation: orchestrator's refusal to delegate a trivial fetch may reflect sound engineering judgment
    • Clear statement that delegation compliance and cost efficiency are separate questions
  3. Strengthened limitations (Section 4.3)

    • Binary measures: added specific examples (one exploratory curl = same coding as full direct execution), noted raw data contains counts, suggested graded delegation measure
    • Temporal confound: added phase ordering, non-interleaving, non-randomization, and specific bare-only-in-Phase-1 example
    • New limitation: regex-based violation detection with false-positive and false-negative examples
    • Section renumbered (4.2→4.3 Limitations, 4.3→4.4 Practical Implications) to accommodate new section
  4. Phase 4 integrity statement (Section 2.4)

    • Explicitly states no data from failed batch was inspected or used to inform re-run
    • States same job matrix re-executed with no changes
  5. Fixed [repository URL] placeholder

    • Now: https://github.com/lorehq/delegation-study
  6. Cost model caveats (model/context-growth-cost.md)

    • 6 numbered caveats with impact assessments:
      1. Prompt caching (could reduce savings 50-80%)
      2. Handoff cost underestimate (real h = 5-8k, shifts break-even)
      3. Worker failure rate (1% noise, 10% material)
      4. T=50 unrealistic (lead with T=3-10)
      5. Compaction as competing strategy (biggest unknown)
      6. Model-tier arbitrage separate from context savings
    • Revised expectation: 40-70% savings (down from 77-96%)
    • Expanded Study 3 validation approach: 4 → 8 items (adds compaction condition, cache-hit measurement, savings decomposition, worker failure tracking)

Review items mapped

Review Point Action Status
R1 Soften causal language Associational framing throughout Done
R1 Separate compliance/benefit New Section 4.2 Done
R1 Violation regex limits New limitation entry Done
R1 Binary coding intensity Strengthened limitation Done
R1 Fix repository URL Replaced placeholder Done
R1 Token waste/cost gap Section 4.2 + cost comparison Done
R1 Temporal confound Strengthened with phase details Done
R2 Cost undermines waste narrative Section 4.2 cost comparison Done
R2 T1 intelligent behavior Section 4.2 T1 paragraph Done
R2 Phase 4 integrity Section 2.4 explicit statement Done
R2 Binary measures nuance Strengthened limitation Done
R2 Regex detection New limitation entry Done
R1 CIs in headline tables Already done (Snapshot 14) Done

Repository state

Commit Description
be8301e Integrate correctness + scoring script
d06f949 Wilson CIs, durations, peer review revisions, cost model caveats

Remote: git@github.com:lorehq/delegation-study.git (private, up to date)

Current status

  • Paper: All review items addressed, pushed at d06f949
  • Cost model: 6 caveats added, revised savings expectation
  • Awaiting: additional review from operator
  • Paper line count: 422 lines (up from ~380 pre-review)
  • Limitations count: 10 (was 9, added regex-based violation detection)

Open items

  1. Receive and address further review feedback
  2. Context growth projection figures
  3. Add auth validation to audit script
  4. Create drop-in repo (lorehq/lore-delegation-kit)
  5. Agent naming decision for drop-in kit
  6. Study 2 (Worker Cost Profiles) — separate publication
  7. Study 3 (Context Growth Economics) — separate publication

Snapshot 16 — 2026-02-28T~23:00Z

Context

Operator provided reviews 3 and 4. Review 3 was a balanced peer review (method critiques with rebuttals, business-relevance critiques, writing critiques). Review 4 was a thorough adversarial defense review (12 points defending against "negligent waste of subscription tokens" narrative).

Work done — reviews 3 and 4

  1. Review 3 (2 edits)

    • Abstract: Added opening sentence — "This is a behavioral compliance study, not an optimization or efficiency study."
    • Section 2.1: Added terminology note — "enforcement" and "violation" are operational labels for prompt-behavior correspondence, not normative judgments about system quality.
    • All other points confirmed as already addressed by Snapshots 14-15.
  2. Review 4 (2 edits)

    • Section 4.2: Sharpened cost-regime caveat — explicitly notes tasks are trivial enough that coordination overhead dominates; names the regime where delegation costs would invert (multi-file code gen, large refactoring, research synthesis).
    • Section 4.2: New paragraph listing missing comparison metrics (total token consumption, cost per correct answer, latency-adjusted cost, context window preservation) and noting these are outside stated scope, not oversights.
    • All other points (8 of 8) confirmed as already addressed.
  3. Reviewer prompt rewrite

    • meta/reviewer-prompt.md rewritten from 160-line staged template to 12-line simple prompt: "read the paper, provide defensible rebuttals, scientific critiques, and writing critiques."
  4. Cost model caveats (model/context-growth-cost.md)

    • 6 caveats added in Snapshot 15, unchanged this snapshot.
    • Revised savings expectation: 40-70% (down from 77-96%).
  5. Prompt engineering principles loaded

    • 9-principle framework at /home/andrew/Desktop/PromptEngineeringCollab/prompt-engineering/prompt-engineering-principles.md
    • Key principles active: P2 (Economy), P7 (Match Technique to Task), P8 (Evaluate, Version, Iterate)

Review disposition summary (all 4 reviews)

Review Points Already addressed New edits Total addressed
R1 7 critiques 0 7 7/7
R2 12 points 6 6 12/12
R3 11 critiques 9 2 11/11
R4 8 points 6 2 8/8

Paper changes across all reviews: ~17 distinct edits to paper.md, resulting in new Section 4.2, strengthened limitations (10 total), associational language, terminology note, abstract framing, cost-regime caveats, and missing-metrics paragraph.

Repository state

Commit Description
d06f949 Wilson CIs, durations, R1+R2 revisions, cost model caveats
5b5d1d0 Review 3: terminology note + abstract framing
81bb5ec Review 4: cost-regime caveat + missing-metrics paragraph
525b4d5 Rewrite reviewer prompt

Remote: git@github.com:lorehq/delegation-study.git (private, up to date)

Current status

  • Paper: 4 reviews addressed, all pushed. ~430 lines.
  • Cost model: 6 caveats, revised savings expectation (40-70%)
  • Reviewer prompt: rewritten, simple 3-part structure
  • Principles: 9-principle PE framework loaded in context
  • All 16 writing principles checklist items pass

Open items

  1. Context growth projection figures
  2. Add auth validation to audit script
  3. Create drop-in repo (lorehq/lore-delegation-kit)
  4. Agent naming decision for drop-in kit
  5. Study 2 (Worker Cost Profiles) — separate publication
  6. Study 3 (Context Growth Economics) — separate publication

Snapshot 17 — 2026-03-01T~02:42Z

Context

New conversation session. Picked up from Snapshot 16 context dump and the "IN PROGRESS — UNCOMMITTED CHANGES FROM REVIEW 5" section, which documented three local-only fixes from the most thorough peer review (R5) that had not yet been committed or pushed.

Work done — commit and push Review 5 fixes

  1. Verified uncommitted changes — 3 files modified:

    • paper/paper.md (58 lines changed)
    • harness/audit-battery.js (8 lines changed)
    • harness/score-correctness.js (16 lines changed)
  2. Reviewed full diff — all changes consistent with Review 5 fixes:

    • Model identity correction: claude-sonnet-4-20250514claude-opus-4-6 throughout paper (abstract, scope, threats, agent table). This was the most significant factual error in the paper — the orchestrator was always opus, verified from modelUsage field in all 280 output.json files.
    • Violation regex fix: audit-battery.js now excludes which/type/command -v prefixes from violation detection. The enforcement T4 "violation" was a false positive — which curl wget python3... is a tool-availability check, not a network request. Post-fix: enforcement 100%/100%, enforcement-contract 100%/100% (was 97.9%/96.0%).
    • T6 scorer exclusion: score-correctness.js now checks for false positives where WIDGET-A or GADGET-X are incorrectly flagged as below threshold (tight-window regex, {0,40} chars). No correctness numbers changed (no session actually had this error), but the scorer is now more robust.
    • Paper updates: violation-free rates corrected in Tables 1, 5, 7; Cramer's V values updated; wording improvements throughout (observational language, scope-bound claims, cached-state caveat, all-text aggregation caveat, column abbreviation note in Table 6).
  3. Committed and pushed:

    • Commit d892a72: "Review 5: fix model identity (opus not sonnet), violation regex false positive, T6 scorer exclusion"
    • Pushed to origin/master, verified up to date.

Corrected aggregate rates (final, post-Review 5)

Condition N Delegation Violation-free Correctness Cost
bare 18 16.7% 0.0% 100.0% $0.17
baseline 30 20.0% 0.0% 100.0% $0.17
baseline-contract 30 46.7% 16.7% 100.0% $0.24
soft-guidance 56 82.1% 60.7% 100.0% $0.38
soft-guidance-contract 48 89.6% 72.9% 100.0% $0.31
enforcement 48 100.0% 100.0% 97.9% $0.41
enforcement-contract 50 100.0% 100.0% 98.0% $0.37

Review disposition summary (all 5 reviews)

Review Points Key findings Status
R1 7 Causal language, compliance≠benefit, temporal confound All addressed
R2 12 Adversarial defense, cost data, T1 judgment All addressed
R3 11 Terminology framing, abstract clarity All addressed
R4 8 Cost regime, missing metrics All addressed
R5 12 Model identity wrong, violation false positive, scorer bug, multiplicity framing All addressed

Repository state

Commit Description
d06f949 Wilson CIs, durations, R1+R2 revisions, cost model caveats
5b5d1d0 Review 3: terminology note + abstract framing
81bb5ec Review 4: cost-regime caveat + missing-metrics paragraph
525b4d5 Rewrite reviewer prompt
27cb20a Snapshot 16 + context dump
d892a72 Review 5: model identity, violation regex, T6 scorer

Remote: git@github.com:lorehq/delegation-study.git (private, up to date)

Current status

  • Paper: 5 reviews addressed, all pushed. All Review 5 fixes committed.
  • Enforcement conditions: Now 100%/100% delegation and violation-free (was 97.9%/96.0% before false positive fix)
  • All 3 critical data/code fixes from R5 resolved and pushed

Open items

  1. Update data_science_context.md with Review 5 outcomes
  2. Context growth projection figures for paper
  3. Add auth validation to audit script
  4. Create drop-in repo (lorehq/lore-delegation-kit)
  5. Agent naming decision for drop-in kit
  6. Study 2 (Worker Cost Profiles) — separate publication
  7. Study 3 (Context Growth Economics) — separate publication

Snapshot 18 — 2026-03-01T~03:01Z

Context

Same conversation session as Snapshot 17. Operator requested "one more review" — conducted Review 6 (self-review) and then "fix everything." This was the most data-driven review: audited all 611 Bash calls across the dataset to verify regex coverage, discovered node http.get() calls missed by the violation regex, and quantified the enforcement conditions' near-zero Bash activity.

Review 6 findings

6 scientific critiques:

  1. R6-C1: Violation regex coverage gap — The regex misses node -e "http.get(...)" calls (9-14% of actual network calls in non-enforcement conditions). Bare had 10 missed, baseline had 26 missed. However, enforcement conditions had ZERO missed because they had only 5 total Bash calls across 98 sessions — all tool-availability checks (which curl, ls /usr/bin/). No actual network calls in enforcement.

  2. R6-C2: Violation-free is vacuous in enforcement — Enforcement averaged <0.1 Bash calls/session. The 100% violation-free rate is a consequence of the absence of Bash activity, not an independent measure of constraint adherence. Paper now discloses this in Section 2.5.

  3. R6-C3: T2 inflates baseline delegation — T2 was delegated under ALL conditions (built-in Explore agent). Excluding T2: bare 0.0% (was 16.7%), baseline 4.0% (was 20.0%). Paper now notes this in Section 3.4.

  4. R6-C4: T5 scorer regex fragility — Greedy .* pattern matches across full text, not per-sentence. Could produce false negatives if affirmative and negative conclusions co-occur. No actual errors found. Noted in coarse correctness limitation.

  5. R6-C5: Mean delegation count missing — Paper now reports mean worker delegations per session: bare 0.17, baseline 0.20, baseline-contract 0.73, soft-guidance 1.32, soft-guidance-contract 1.38, enforcement 2.04, enforcement-contract 1.92. Added to Section 3.1.

  6. R6-C6: Decision timeline unspecified — Pre-registration limitation now specifies: Phase 1 had bare/soft-guidance/enforcement-contract; baseline/enforcement added Phase 3; baseline-contract/soft-guidance- contract added Phase 4. Metrics defined before Phase 1. Both Section 2.6 and Section 4.3 updated.

6 writing critiques:

  1. R6-W1: Abstract reordered — Gap/context first, "behavioral compliance study" framing moved after first result sentence.

  2. R6-W2: Full Holm-Bonferroni table — Was 9 "selected rows," now shows all 20 rows with corrected p-values from stats.js output.

  3. R6-W3: Condition descriptions trimmed — Bare, baseline, baseline-contract compressed from 3 paragraphs to 1 compact paragraph.

  4. R6-W4: "Observed" synonyms — 6 instances varied to "measured," "showed," "appeared," "changed" to reduce rhythmic monotony.

  5. R6-W5: Holm-Bonferroni summary — Added: "In short: prompt-level effects are robust to multiplicity correction; contract-template effects are not."

  6. R6-W6: Prompt header confound — Enforcement conditions use "Lore Operating Instructions" header; all others use "Project Instructions." Noted in Section 2.1 and added as new limitation in Section 4.3.

Key data discovery

Full Bash call audit across all 280 sessions:

Condition Total Bash Network (caught) Network (missed) Tool checks
bare 132 101 10 ~21
baseline 221 165 26 ~30
baseline-contract 135 94 12 ~29
soft-guidance 74 55 4 ~15
soft-guidance-contract 44 31 4 ~9
enforcement 2 0 0 2
enforcement-contract 3 0 0 3

The enforcement conditions' near-zero Bash activity (5 calls total, all which/ls tool checks) is the strongest evidence that the enforcement prompt effectively eliminated direct execution. The violation-free metric is a consequence of this, not an independent finding.

Repository state

Commit Description
56de4af Snapshot 17 + context dump
357eecb Review 6: 12 fixes (regex audit, vacuous violation-free, etc.)

Remote: git@github.com:lorehq/delegation-study.git (private, up to date)

Review disposition summary (all 6 reviews)

Review Points Key findings Status
R1 7 Causal language, compliance≠benefit, temporal confound All addressed
R2 12 Adversarial defense, cost data, T1 judgment All addressed
R3 11 Terminology framing, abstract clarity All addressed
R4 8 Cost regime, missing metrics All addressed
R5 12 Model identity wrong, violation false positive, scorer bug All addressed
R6 12 Regex coverage, vacuous violation-free, T2 inflation, header confound All addressed

Current status

  • Paper: 6 reviews addressed, all pushed. 437 lines.
  • Limitations: Now 11 specific threats (added prompt header confound)
  • New data added: Mean delegation counts, T2-exclusion rates, Bash call audit

Open items

  1. Update data_science_context.md with Review 6 outcomes
  2. Context growth projection figures for paper
  3. Add auth validation to audit script
  4. Create drop-in repo (lorehq/lore-delegation-kit)
  5. Agent naming decision for drop-in kit
  6. Study 2 (Worker Cost Profiles) — separate publication
  7. Study 3 (Context Growth Economics) — separate publication

Snapshot 19 — 2026-03-01T~03:17Z

Context

Same conversation session as Snapshots 17-18. Operator provided Review 7 (21 items across 3 sections). Triaged against prior reviews: 7 items already addressed by R5/R6, 14 new items requiring fixes.

Triage — already addressed by prior reviews

R7 item Prior fix Notes
R7-2.4 (enforcement multi-element) R6-W6 Header confound noted
R7-2.5 (bare T2 inflation) R6-C3 T2 exclusion rates added
R7-2.7 (violation non-Bash) R6-C1 611 Bash call audit, 9-14% gap
R7-2.8 (no correctness) Already exists Correctness scoring built, reported throughout
R7-3.2 (violation term) R5 Terminology note at line 67
R7-3.3 (table abbreviations) R5 Legend added to Table 6 caption

Work done — 14 new fixes

R7-2.1: Raw session data not in repo

R7-2.2: Unreported conditions (baseline-opaque, cedar/flint/marble)

  • Confirmed: baseline-opaque.md exists in kit/conditions/, cedar.md/ flint.md/marble.md exist in kit/.claude/agents/. No data was collected.
  • Disclosed in Section 2.1: designed for future study on agent-name descriptiveness, retained for transparency, not part of current analysis.

R7-2.3: Prompt length confound

  • Added prompt length discussion to Section 4.1: conditions range 6-49 lines, length increases monotonically with specificity.
  • Merged with header confound in Section 4.3 as single "Prompt length and header confounds" limitation. Noted bare-vs-baseline (both 6 lines, no delegation difference) as partial evidence against length-only effect.

R7-2.6: Violation-free applied anachronistically

  • Expanded terminology note: violation-free is anachronistic for non-enforcement conditions, where direct network calls are expected default behavior. Clarified interpretation guidance.

R7-2.9: Per-cell N=3 fragility

  • Added explicit caution note to Section 3.4 with Wilson CI example (0/3 → [0.0, 56.2]). Noted per-task patterns should not be interpreted as individually significant.

R7-2.10: Temporal confound unanalyzed

  • Ran cross-phase delegation rate analysis. Results: enforcement 100% in both Phase 3 (30/30) and Phase 4 (18/18); soft-guidance 75-100% across three batches; baseline 17-25% across two batches.
  • Added these rates to the temporal confound limitation as "partial reassurance" with note that formal phase-effect analysis was not done.

R7-2.11: Missing bare-contract cell

  • Explained in Section 2.1: omitted because the template references worker agents that don't exist in the bare condition.

R7-2.12: Min-N heuristic misleading

  • Removed "suggesting effect is small" inference from Section 3.3. Replaced with explicit caveat that these are deterministic estimates, not formal power calculations.

R7-3.1: Abstract V=0.84 mismatch with rate range

  • Fixed: rate range 16.7%→100% spans bare→enforcement, so now cites V=0.89 (bare-vs-enforcement) instead of V=0.84 (baseline-vs-enforcement). Computed and verified: V=0.89 for the 18×48 table.

R7-3.4: Section 3.6 vague quantities → exact counts

  • Replaced all qualitative descriptions with exact worker type counts from full audit. Example: "lore-default 75/98 (76.5%) under enforcement."

R7-3.5: README stale

  • Complete rewrite. Old: 3 conditions, N=14, wrong file paths, referenced Study 2/3 structure. New: 7 conditions, N=280, correct file paths, links to GitHub Release for session data.

R7-3.6: Gap claim unsupported

  • Softened to "to our knowledge" framing. Added explicit disclosure: "We did not conduct a formal systematic literature review."

R7-3.7: Cost column unexplained

  • Added session cost definition to Section 2.5: extracted from output.json total_cost_usd field, includes all orchestrator and worker token costs at respective model-tier rates.

R7-3.8: --dangerously-skip-permissions prominence

  • Elevated to Section 1.3 Scope. Explicitly states: fully-permissioned environment, delegation is behavioral judgment not access-restricted, real deployment permission systems might force/constrain delegation.

Repository state

Commit Description
357eecb Review 6: 12 fixes
1a13805 Snapshot 18 + context dump
0fd1aa9 Review 7: 14 new fixes + session data release + README rewrite

GitHub Release: v1.0 (session-data-v1.tar.gz, 280 sessions)

Review disposition summary (all 7 reviews)

Review Points Key findings Status
R1 7 Causal language, compliance≠benefit, temporal confound All addressed
R2 12 Adversarial defense, cost data, T1 judgment All addressed
R3 11 Terminology framing, abstract clarity All addressed
R4 8 Cost regime, missing metrics All addressed
R5 12 Model identity wrong, violation false positive, scorer bug All addressed
R6 12 Regex coverage, vacuous violation-free, T2 inflation, header confound All addressed
R7 21 Session data release, unreported conditions, README, abstract V, 14 more All addressed

Current status

  • Paper: 7 reviews addressed, all pushed. 444 lines.
  • Limitations: Now 12 specific threats (prompt length + header merged)
  • Session data: Released as GitHub Release v1.0
  • README: Fully rewritten to match current study
  • Unreported conditions: Disclosed in paper

Open items

  1. Context growth projection figures for paper
  2. Add auth validation to audit script
  3. Create drop-in repo (lorehq/lore-delegation-kit)
  4. Agent naming decision for drop-in kit
  5. Study 2 (Worker Cost Profiles) — separate publication
  6. Study 3 (Context Growth Economics) — separate publication

Snapshot 20 — 2026-03-01T~03:22Z

Context

Same session. Operator provided Review 8 (18 items). Triaged: 17 items already addressed by R5-R7, 1 genuinely new fix.

Triage

R8 Item Status Prior fix
1.3 Model identity Already R5 Paper says opus, not sonnet
2.1 Prompt length confound Already R7-2.3 Discussion + limitation
2.2 Bare contaminated by Explore Already R6-C3/R7-2.5 T2 exclusion rates
2.3 Brittle violation regex Already R6-C1 611-call audit
2.4 No task correctness Factual error Correctness IS measured (Tables 1, 8)
2.5 Temporal batching Already R7-2.10 Cross-phase rates added
3.1 Violation-free terminology Already R5/R7-2.6 Terminology note expanded
3.2 T1 prose/table mismatch Reviewer misread Paper correctly distinguishes soft-guidance from soft-guidance-contract
3.3 Null speculation NEW — fixed Removed mechanism theory for non-significant template effect
3.4 Enforcement ambiguity Already R5/R7-3.8 Terminology note + Scope section

Work done

R8-3.3: Removed speculative mechanism ("template indirectly signals workers exist — functioning as implicit delegation cue") for the non-significant contract template result. Replaced with: "At current sample sizes, the template is not a statistically detectable control lever for delegation behavior."

Notes on R8 quality

This review was based on an older version of the paper — it references claude-sonnet-4-20250514 as the orchestrator (corrected in R5) and claims the paper doesn't measure correctness (it does, since the scorer was built and integrated before the paper was written). The T1 prose/table "inconsistency" was a misread: the paper says "Under bare, baseline, baseline-contract, and soft-guidance, T1 was never delegated" and separately notes "soft-guidance-contract produced only 3/8." These are different conditions.

Review disposition summary (all 8 reviews)

Review Points New fixes Status
R1 7 7 All addressed
R2 12 12 All addressed
R3 11 11 All addressed
R4 8 8 All addressed
R5 12 12 All addressed
R6 12 12 All addressed
R7 21 14 (7 prior) All addressed
R8 18 1 (17 prior/invalid) All addressed

Total: 101 review items, 77 unique fixes applied.

Repository state

3fdeaf6 — Review 8: remove null-result speculation

Open items

  1. Context growth projection figures for paper
  2. Add auth validation to audit script
  3. Create drop-in repo (lorehq/lore-delegation-kit)
  4. Agent naming decision for drop-in kit
  5. Study 2 (Worker Cost Profiles) — separate publication
  6. Study 3 (Context Growth Economics) — separate publication