Skip to content
This repository was archived by the owner on Mar 6, 2026. It is now read-only.

Latest commit

 

History

History
285 lines (234 loc) · 12.8 KB

File metadata and controls

285 lines (234 loc) · 12.8 KB

Data Science Context

Last updated: 2026-03-01T03:22Z (Snapshot 20)

Role

You are the data scientist and benchtester for this study. You are grounded in the 9-principle prompt engineering framework at: /home/andrew/Desktop/PromptEngineeringCollab/prompt-engineering/prompt-engineering-principles.md

Read that file before making any prompt design decisions. Key principles that have already bitten us: P2 (Economy), P7 (Match Technique to Task), P8 (Evaluate, Version, Iterate — one variable at a time).

Study Goal

Build a reproducible test kit and paper for a scientific study measuring prompt-driven delegation enforcement in orchestrator-worker LLM agent systems. The study tests two independent variables: (1) prompt engineering level (none → soft guidance → hard enforcement), and (2) presence/absence of a structured orch-worker hand-off contract (template).

Current Study Design: 7-Condition Matrix

The 7 Conditions

# Condition CLAUDE.md Agents Role
1 bare 6 lines, no delegation NO Infrastructure floor
2 baseline 6 lines, no delegation (same as bare) YES OOTB Claude Code baseline
3 baseline-contract ~20 lines, bare + hand-off template YES Does template alone trigger delegation?
4 soft-guidance 23 lines, soft language + tier table YES Treatment 1: soft prompting
5 soft-guidance-contract ~37 lines, soft + tier table + template YES Treatment 1 + contract
6 enforcement 34 lines, hard enforcement + violations YES Treatment 2: enforcement without contract
7 enforcement-contract 49 lines, enforcement + template YES Treatment 2 + contract

Two Analysis Axes

Axis 1 — Prompt level (no contract): bare → baseline → soft-guidance → enforcement

Axis 2 — Contract effect at each prompt level: baseline vs baseline-contract soft-guidance vs soft-guidance-contract enforcement vs enforcement-contract

Data Status — ALL PHASES COMPLETE

Phase 1: N=3 battery (54 sessions) — COMPLETE

3 conditions (bare/soft-guidance/enforcement-contract) × 6 tasks × N=3. Timestamp: 20260228-170000

Phase 2: Targeted scale-up (35 sessions) — COMPLETE

soft-guidance T1/T2/T5/T6 to N=10, enforcement-contract T1 to N=10. Timestamp: 20260228-174500

Phase 3: New conditions (36 sessions) — COMPLETE

baseline + enforcement, 6 tasks × N=3 each. Timestamp: 20260301-000000

Phase 4: 7-condition scale-up (155 sessions) — COMPLETE

Scale-up across all 6 agent-bearing conditions. Timestamp: 20260228-184408 NOTE: First attempt failed (OAuth 401). 133 empty directory stubs from failed batch were cleaned. Re-run placed 155 valid sessions.

Final Sample Sizes (280 sessions total)

Condition N/task Total N
bare 3 18
baseline 5 30
baseline-contract 5 30
soft-guidance 8-10* 56
soft-guidance-contract 8 48
enforcement 8 48
enforcement-contract 8-10* 50

*Unequal per-task due to Phase 2 targeted scale-up.

Key Empirical Findings (280 sessions, all 7 conditions)

Aggregate Rates

Condition N Delegation 95% CI Violation-free 95% CI Correctness Cost Duration med (s)
bare 18 16.7% [5.8, 39.2] 0.0% [0.0, 17.6] 100.0% $0.17 46.4
baseline 30 20.0% [9.5, 37.3] 0.0% [0.0, 11.4] 100.0% $0.17 40.1
baseline-contract 30 46.7% [30.2, 63.9] 16.7% [7.3, 33.6] 100.0% $0.24 47.0
soft-guidance 56 82.1% [70.2, 90.0] 60.7% [47.6, 72.4] 100.0% $0.38 59.6
soft-guidance-contract 48 89.6% [77.8, 95.5] 72.9% [59.0, 83.4] 100.0% $0.31 50.3
enforcement 48 100.0% [92.6, 100.0] 100.0% [92.6, 100.0] 97.9% $0.41 65.2
enforcement-contract 50 100.0% [92.9, 100.0] 100.0% [92.9, 100.0] 98.0% $0.37 62.7

Statistical Significance (Holm-Bonferroni corrected, 20-test family)

12/20 pairwise comparisons survive correction:

  • All prompt-level axis comparisons survive (except bare vs baseline)
  • No contract-template comparisons survive

Key comparisons:

  • baseline→soft-guidance delegation: p<0.001, V=0.61 (large)
  • soft-guidance→enforcement delegation: p=0.002, V=0.30 (medium)
  • baseline→enforcement delegation: p<0.001, V=0.84 (large)
  • baseline→enforcement violation-free: p<0.001, V=1.00 (large)
  • baseline→baseline-contract delegation: p=0.054 (marginal, does NOT survive)
  • soft-guidance→soft-guidance-contract: p=0.403 (not significant)

Task Correctness (278/280 = 99.3%)

Only 2 failures:

  • enforcement/T5/run-06: worker missed X-Warehouse header hint → wrong answer
  • enforcement-contract/T1/run-03: infrastructure failure (service not ready) Non-enforcement conditions: 100% correct (212/212)

Key Per-Task Finding

T1 (simplest task) discriminates: 0% delegation under all conditions below enforcement, but 100% under enforcement. T2 (discovery task) delegates under ALL conditions including bare (uses built-in Explore agent).

Paper Status

Paper at paper/paper.md, commit 0fd1aa9. 444 lines. Eight peer reviews received and addressed (Snapshots 15-20).

  • 7-condition design, 2-axis analysis
  • 4 measures: delegation, violation-free, correctness, duration (descriptive)
  • Mean delegation count per condition added (Section 3.1)
  • 95% Wilson score CIs on all proportion estimates
  • 9 tables (full 20-row Holm-Bonferroni table, was 9 selected rows)
  • 12 specific limitations with affected-claim annotations
  • Section 4.2 "Compliance Is Not Benefit" — cost comparison, T1 judgment, missing-metrics inventory
  • Associational language throughout (causal language softened, synonyms varied)
  • Terminology note: "enforcement"/"violation" are operational labels
  • Abstract reordered: gap first, compliance framing after first result
  • Abstract V=0.89 corrected (was 0.84 — wrong comparison for rate range)
  • Session cost definition added (Section 2.5)
  • --dangerously-skip-permissions elevated to Section 1.3 Scope
  • Gap claim softened ("to our knowledge" + no systematic review disclosure)
  • Per-cell N=3 caution note added (Section 3.4)
  • Worker type distribution: exact counts replace vague qualifiers
  • Unreported conditions (baseline-opaque, cedar/flint/marble) disclosed
  • Missing bare-contract cell explained
  • Min-N heuristic language tightened
  • Cross-phase delegation rates added to temporal confound
  • 6 practical implications
  • All 16 writing principles checklist items pass
  • Governed by meta/research-paper-writing-principles.md

Critical fixes from Review 5 (Snapshot 17)

  • Model identity: orchestrator is claude-opus-4-6, NOT claude-sonnet-4-20250514 (verified from modelUsage in all 280 output.json files)
  • Violation false positive: enforcement T4 which curl was tool check, not network call. Regex fixed in audit-battery.js. Enforcement now 100%/100%.
  • T6 scorer: added exclusion check for WIDGET-A/GADGET-X false positives

Key additions from Review 6 (Snapshot 18)

  • Regex coverage gap disclosed: node http.get() calls missed by regex (9-14% of network calls in non-enforcement conditions; zero in enforcement)
  • Vacuous violation-free noted: enforcement had 5 total Bash calls across 98 sessions, all tool-availability checks — violation-free is consequence of absent Bash activity, not independent constraint adherence
  • T2 exclusion note: excluding T2, bare=0.0%, baseline=4.0%
  • Decision timeline specified: which conditions added in which phase
  • Prompt header confound: enforcement uses different document header
  • Full Holm-Bonferroni: all 20 rows now shown (was 9 selected)

Paper Writing Principles (operator-mandated, Snapshot 12)

  1. The data tells the story. No superlatives, no overselling.
  2. Observational, not causal. "We observed X under conditions Y."
  3. Scope-bound every claim. One model, one framework, these tasks.
  4. Registration is a prerequisite, not a null.
  5. Avoid strong categorical language.
  6. Modest wording protects credibility.
  7. Acknowledge what we didn't test.
  8. Let readers draw conclusions.

Repo: /home/andrew/Desktop/delegation-study/

Conditions

Path Lines What
kit/conditions/bare.md 6 No delegation, no agents
kit/conditions/baseline.md 6 Same as bare but agents registered
kit/conditions/baseline-contract.md 20 OOTB + template only
kit/conditions/soft-guidance.md 23 Soft delegation + tier table
kit/conditions/soft-guidance-contract.md 37 Soft + tier table + template
kit/conditions/enforcement.md 34 Hard enforcement + violations
kit/conditions/enforcement-contract.md 49 Enforcement + template

Agents

Path Model Role
kit/.claude/agents/lore-default.md sonnet Default worker
kit/.claude/agents/lore-explore.md haiku Read-only explorer
kit/.claude/agents/lore-fast.md haiku Fast executor
kit/.claude/agents/lore-power.md opus Complex reasoning

Harness

File Purpose
harness/run-battery.sh Battery runner
harness/audit-battery.js Per-batch auditor
harness/analyze-all.js Cross-condition comparison
harness/stats.js Fisher's exact, Cramer's V, Holm-Bonferroni, Wilson CIs, durations
harness/score-correctness.js Ground-truth correctness scoring
harness/task-battery.json 6 task definitions
harness/services/orders-service.js Mock API, port 8787
harness/services/inventory-service.js Mock API, port 8791

Paper & Meta

File Purpose
paper/paper.md THE PAPER — new 7-condition version
meta/research-paper-writing-principles.md Merged Opus+GPT writing principles
meta/reviewer-prompt.md Structured review prompt
meta/study-orchestration-notes.md Meta-methodology notes
model/context-growth-cost.md Formal cost model (from old paper)

Data

Path What
.runs/battery/bare/20260228-170000/ Phase 1 N=3
.runs/battery/baseline/20260301-000000/ Phase 3 N=3
.runs/battery/baseline/20260228-184408/ Phase 4 scale-up
.runs/battery/baseline-contract/20260228-184408/ Phase 4 N=5
.runs/battery/soft-guidance/20260228-170000/ Phase 1 N=3
.runs/battery/soft-guidance/20260228-174500/ Phase 2 scale-up
.runs/battery/soft-guidance/20260228-184408/ Phase 4 T3/T4 scale-up
.runs/battery/soft-guidance-contract/20260228-184408/ Phase 4 N=8
.runs/battery/enforcement/20260301-000000/ Phase 3 N=3
.runs/battery/enforcement/20260228-184408/ Phase 4 scale-up
.runs/battery/enforcement-contract/20260228-170000/ Phase 1 N=3
.runs/battery/enforcement-contract/20260228-174500/ Phase 2 T1 scale-up
.runs/battery/enforcement-contract/20260228-184408/ Phase 4 scale-up

Inter-Agent Communication

Path What
~/Desktop/ds-to-chappie.txt Messages TO CHAPPiE (last: Phase 4 auth failure report)
~/Desktop/chappie-to-ds.txt Messages FROM CHAPPiE (last: Phase 4 re-run complete)

Git History

Commit Description
44db74c Snapshot 12
8e0e0f5 New 7-condition paper + merged writing principles
be8301e Integrate correctness + scoring script
d06f949 Wilson CIs, durations, R1+R2 revisions, cost model caveats
5b5d1d0 Review 3: terminology note + abstract framing
81bb5ec Review 4: cost-regime caveat + missing-metrics paragraph
525b4d5 Rewrite reviewer prompt
27cb20a Snapshot 16 + context dump
d892a72 Review 5: model identity, violation regex, T6 scorer
56de4af Snapshot 17 + context dump
357eecb Review 6: regex audit, vacuous violation-free, header confound
1a13805 Snapshot 18 + context dump
0fd1aa9 Review 7: session data release, README rewrite, 14 fixes
b77cd70 Snapshot 19 + context dump
3fdeaf6 Review 8: remove null-result speculation

GitHub Release: v1.0 (session-data-v1.tar.gz, 280 sessions, 1.3 MB) Remote: git@github.com:lorehq/delegation-study.git (private, drewswiredin)

Environment

  • Claude Code version: 2.1.63
  • Orchestrator model: claude-opus-4-6
  • Worker models: sonnet (lore-default), haiku (lore-explore, lore-fast), opus (lore-power)
  • Platform: Linux (Ubuntu)
  • K3s cluster: 3 nodes, 4 vCPU / 3 GB RAM each, Debian 12

Next Steps

  1. Context growth projection figures for paper
  2. Add auth validation to audit script
  3. Create drop-in repo (lorehq/lore-delegation-kit)
  4. Agent naming decision for drop-in kit
  5. README rewritten — verify no stale references remain
  6. Study 2 (Worker Cost Profiles) — separate publication
  7. Study 3 (Context Growth Economics) — separate publication