Last updated: 2026-03-01T03:22Z (Snapshot 20)
You are the data scientist and benchtester for this study. You are grounded
in the 9-principle prompt engineering framework at:
/home/andrew/Desktop/PromptEngineeringCollab/prompt-engineering/prompt-engineering-principles.md
Read that file before making any prompt design decisions. Key principles that have already bitten us: P2 (Economy), P7 (Match Technique to Task), P8 (Evaluate, Version, Iterate — one variable at a time).
Build a reproducible test kit and paper for a scientific study measuring prompt-driven delegation enforcement in orchestrator-worker LLM agent systems. The study tests two independent variables: (1) prompt engineering level (none → soft guidance → hard enforcement), and (2) presence/absence of a structured orch-worker hand-off contract (template).
| # | Condition | CLAUDE.md | Agents | Role |
|---|---|---|---|---|
| 1 | bare |
6 lines, no delegation | NO | Infrastructure floor |
| 2 | baseline |
6 lines, no delegation (same as bare) | YES | OOTB Claude Code baseline |
| 3 | baseline-contract |
~20 lines, bare + hand-off template | YES | Does template alone trigger delegation? |
| 4 | soft-guidance |
23 lines, soft language + tier table | YES | Treatment 1: soft prompting |
| 5 | soft-guidance-contract |
~37 lines, soft + tier table + template | YES | Treatment 1 + contract |
| 6 | enforcement |
34 lines, hard enforcement + violations | YES | Treatment 2: enforcement without contract |
| 7 | enforcement-contract |
49 lines, enforcement + template | YES | Treatment 2 + contract |
Axis 1 — Prompt level (no contract): bare → baseline → soft-guidance → enforcement
Axis 2 — Contract effect at each prompt level: baseline vs baseline-contract soft-guidance vs soft-guidance-contract enforcement vs enforcement-contract
3 conditions (bare/soft-guidance/enforcement-contract) × 6 tasks × N=3.
Timestamp: 20260228-170000
soft-guidance T1/T2/T5/T6 to N=10, enforcement-contract T1 to N=10.
Timestamp: 20260228-174500
baseline + enforcement, 6 tasks × N=3 each.
Timestamp: 20260301-000000
Scale-up across all 6 agent-bearing conditions.
Timestamp: 20260228-184408
NOTE: First attempt failed (OAuth 401). 133 empty directory stubs from
failed batch were cleaned. Re-run placed 155 valid sessions.
| Condition | N/task | Total N |
|---|---|---|
| bare | 3 | 18 |
| baseline | 5 | 30 |
| baseline-contract | 5 | 30 |
| soft-guidance | 8-10* | 56 |
| soft-guidance-contract | 8 | 48 |
| enforcement | 8 | 48 |
| enforcement-contract | 8-10* | 50 |
*Unequal per-task due to Phase 2 targeted scale-up.
| Condition | N | Delegation | 95% CI | Violation-free | 95% CI | Correctness | Cost | Duration med (s) |
|---|---|---|---|---|---|---|---|---|
| bare | 18 | 16.7% | [5.8, 39.2] | 0.0% | [0.0, 17.6] | 100.0% | $0.17 | 46.4 |
| baseline | 30 | 20.0% | [9.5, 37.3] | 0.0% | [0.0, 11.4] | 100.0% | $0.17 | 40.1 |
| baseline-contract | 30 | 46.7% | [30.2, 63.9] | 16.7% | [7.3, 33.6] | 100.0% | $0.24 | 47.0 |
| soft-guidance | 56 | 82.1% | [70.2, 90.0] | 60.7% | [47.6, 72.4] | 100.0% | $0.38 | 59.6 |
| soft-guidance-contract | 48 | 89.6% | [77.8, 95.5] | 72.9% | [59.0, 83.4] | 100.0% | $0.31 | 50.3 |
| enforcement | 48 | 100.0% | [92.6, 100.0] | 100.0% | [92.6, 100.0] | 97.9% | $0.41 | 65.2 |
| enforcement-contract | 50 | 100.0% | [92.9, 100.0] | 100.0% | [92.9, 100.0] | 98.0% | $0.37 | 62.7 |
12/20 pairwise comparisons survive correction:
- All prompt-level axis comparisons survive (except bare vs baseline)
- No contract-template comparisons survive
Key comparisons:
- baseline→soft-guidance delegation: p<0.001, V=0.61 (large)
- soft-guidance→enforcement delegation: p=0.002, V=0.30 (medium)
- baseline→enforcement delegation: p<0.001, V=0.84 (large)
- baseline→enforcement violation-free: p<0.001, V=1.00 (large)
- baseline→baseline-contract delegation: p=0.054 (marginal, does NOT survive)
- soft-guidance→soft-guidance-contract: p=0.403 (not significant)
Only 2 failures:
- enforcement/T5/run-06: worker missed X-Warehouse header hint → wrong answer
- enforcement-contract/T1/run-03: infrastructure failure (service not ready) Non-enforcement conditions: 100% correct (212/212)
T1 (simplest task) discriminates: 0% delegation under all conditions below enforcement, but 100% under enforcement. T2 (discovery task) delegates under ALL conditions including bare (uses built-in Explore agent).
Paper at paper/paper.md, commit 0fd1aa9. 444 lines.
Eight peer reviews received and addressed (Snapshots 15-20).
- 7-condition design, 2-axis analysis
- 4 measures: delegation, violation-free, correctness, duration (descriptive)
- Mean delegation count per condition added (Section 3.1)
- 95% Wilson score CIs on all proportion estimates
- 9 tables (full 20-row Holm-Bonferroni table, was 9 selected rows)
- 12 specific limitations with affected-claim annotations
- Section 4.2 "Compliance Is Not Benefit" — cost comparison, T1 judgment, missing-metrics inventory
- Associational language throughout (causal language softened, synonyms varied)
- Terminology note: "enforcement"/"violation" are operational labels
- Abstract reordered: gap first, compliance framing after first result
- Abstract V=0.89 corrected (was 0.84 — wrong comparison for rate range)
- Session cost definition added (Section 2.5)
- --dangerously-skip-permissions elevated to Section 1.3 Scope
- Gap claim softened ("to our knowledge" + no systematic review disclosure)
- Per-cell N=3 caution note added (Section 3.4)
- Worker type distribution: exact counts replace vague qualifiers
- Unreported conditions (baseline-opaque, cedar/flint/marble) disclosed
- Missing bare-contract cell explained
- Min-N heuristic language tightened
- Cross-phase delegation rates added to temporal confound
- 6 practical implications
- All 16 writing principles checklist items pass
- Governed by
meta/research-paper-writing-principles.md
- Model identity: orchestrator is claude-opus-4-6, NOT claude-sonnet-4-20250514 (verified from modelUsage in all 280 output.json files)
- Violation false positive: enforcement T4
which curlwas tool check, not network call. Regex fixed in audit-battery.js. Enforcement now 100%/100%. - T6 scorer: added exclusion check for WIDGET-A/GADGET-X false positives
- Regex coverage gap disclosed: node http.get() calls missed by regex (9-14% of network calls in non-enforcement conditions; zero in enforcement)
- Vacuous violation-free noted: enforcement had 5 total Bash calls across 98 sessions, all tool-availability checks — violation-free is consequence of absent Bash activity, not independent constraint adherence
- T2 exclusion note: excluding T2, bare=0.0%, baseline=4.0%
- Decision timeline specified: which conditions added in which phase
- Prompt header confound: enforcement uses different document header
- Full Holm-Bonferroni: all 20 rows now shown (was 9 selected)
- The data tells the story. No superlatives, no overselling.
- Observational, not causal. "We observed X under conditions Y."
- Scope-bound every claim. One model, one framework, these tasks.
- Registration is a prerequisite, not a null.
- Avoid strong categorical language.
- Modest wording protects credibility.
- Acknowledge what we didn't test.
- Let readers draw conclusions.
| Path | Lines | What |
|---|---|---|
kit/conditions/bare.md |
6 | No delegation, no agents |
kit/conditions/baseline.md |
6 | Same as bare but agents registered |
kit/conditions/baseline-contract.md |
20 | OOTB + template only |
kit/conditions/soft-guidance.md |
23 | Soft delegation + tier table |
kit/conditions/soft-guidance-contract.md |
37 | Soft + tier table + template |
kit/conditions/enforcement.md |
34 | Hard enforcement + violations |
kit/conditions/enforcement-contract.md |
49 | Enforcement + template |
| Path | Model | Role |
|---|---|---|
kit/.claude/agents/lore-default.md |
sonnet | Default worker |
kit/.claude/agents/lore-explore.md |
haiku | Read-only explorer |
kit/.claude/agents/lore-fast.md |
haiku | Fast executor |
kit/.claude/agents/lore-power.md |
opus | Complex reasoning |
| File | Purpose |
|---|---|
harness/run-battery.sh |
Battery runner |
harness/audit-battery.js |
Per-batch auditor |
harness/analyze-all.js |
Cross-condition comparison |
harness/stats.js |
Fisher's exact, Cramer's V, Holm-Bonferroni, Wilson CIs, durations |
harness/score-correctness.js |
Ground-truth correctness scoring |
harness/task-battery.json |
6 task definitions |
harness/services/orders-service.js |
Mock API, port 8787 |
harness/services/inventory-service.js |
Mock API, port 8791 |
| File | Purpose |
|---|---|
paper/paper.md |
THE PAPER — new 7-condition version |
meta/research-paper-writing-principles.md |
Merged Opus+GPT writing principles |
meta/reviewer-prompt.md |
Structured review prompt |
meta/study-orchestration-notes.md |
Meta-methodology notes |
model/context-growth-cost.md |
Formal cost model (from old paper) |
| Path | What |
|---|---|
.runs/battery/bare/20260228-170000/ |
Phase 1 N=3 |
.runs/battery/baseline/20260301-000000/ |
Phase 3 N=3 |
.runs/battery/baseline/20260228-184408/ |
Phase 4 scale-up |
.runs/battery/baseline-contract/20260228-184408/ |
Phase 4 N=5 |
.runs/battery/soft-guidance/20260228-170000/ |
Phase 1 N=3 |
.runs/battery/soft-guidance/20260228-174500/ |
Phase 2 scale-up |
.runs/battery/soft-guidance/20260228-184408/ |
Phase 4 T3/T4 scale-up |
.runs/battery/soft-guidance-contract/20260228-184408/ |
Phase 4 N=8 |
.runs/battery/enforcement/20260301-000000/ |
Phase 3 N=3 |
.runs/battery/enforcement/20260228-184408/ |
Phase 4 scale-up |
.runs/battery/enforcement-contract/20260228-170000/ |
Phase 1 N=3 |
.runs/battery/enforcement-contract/20260228-174500/ |
Phase 2 T1 scale-up |
.runs/battery/enforcement-contract/20260228-184408/ |
Phase 4 scale-up |
| Path | What |
|---|---|
~/Desktop/ds-to-chappie.txt |
Messages TO CHAPPiE (last: Phase 4 auth failure report) |
~/Desktop/chappie-to-ds.txt |
Messages FROM CHAPPiE (last: Phase 4 re-run complete) |
| Commit | Description |
|---|---|
44db74c |
Snapshot 12 |
8e0e0f5 |
New 7-condition paper + merged writing principles |
be8301e |
Integrate correctness + scoring script |
d06f949 |
Wilson CIs, durations, R1+R2 revisions, cost model caveats |
5b5d1d0 |
Review 3: terminology note + abstract framing |
81bb5ec |
Review 4: cost-regime caveat + missing-metrics paragraph |
525b4d5 |
Rewrite reviewer prompt |
27cb20a |
Snapshot 16 + context dump |
d892a72 |
Review 5: model identity, violation regex, T6 scorer |
56de4af |
Snapshot 17 + context dump |
357eecb |
Review 6: regex audit, vacuous violation-free, header confound |
1a13805 |
Snapshot 18 + context dump |
0fd1aa9 |
Review 7: session data release, README rewrite, 14 fixes |
b77cd70 |
Snapshot 19 + context dump |
3fdeaf6 |
Review 8: remove null-result speculation |
GitHub Release: v1.0 (session-data-v1.tar.gz, 280 sessions, 1.3 MB)
Remote: git@github.com:lorehq/delegation-study.git (private, drewswiredin)
- Claude Code version: 2.1.63
- Orchestrator model: claude-opus-4-6
- Worker models: sonnet (lore-default), haiku (lore-explore, lore-fast), opus (lore-power)
- Platform: Linux (Ubuntu)
- K3s cluster: 3 nodes, 4 vCPU / 3 GB RAM each, Debian 12
- Context growth projection figures for paper
- Add auth validation to audit script
- Create drop-in repo (
lorehq/lore-delegation-kit) - Agent naming decision for drop-in kit
- README rewritten — verify no stale references remain
- Study 2 (Worker Cost Profiles) — separate publication
- Study 3 (Context Growth Economics) — separate publication