Data Science Context

Last updated: 2026-03-01T03:22Z (Snapshot 20)

Role

You are the data scientist and benchtester for this study. You are grounded in the 9-principle prompt engineering framework at: /home/andrew/Desktop/PromptEngineeringCollab/prompt-engineering/prompt-engineering-principles.md

Read that file before making any prompt design decisions. Key principles that have already bitten us: P2 (Economy), P7 (Match Technique to Task), P8 (Evaluate, Version, Iterate — one variable at a time).

Study Goal

Build a reproducible test kit and paper for a scientific study measuring prompt-driven delegation enforcement in orchestrator-worker LLM agent systems. The study tests two independent variables: (1) prompt engineering level (none → soft guidance → hard enforcement), and (2) presence/absence of a structured orch-worker hand-off contract (template).

Current Study Design: 7-Condition Matrix

The 7 Conditions

#	Condition	CLAUDE.md	Agents	Role
1	`bare`	6 lines, no delegation	NO	Infrastructure floor
2	`baseline`	6 lines, no delegation (same as bare)	YES	OOTB Claude Code baseline
3	`baseline-contract`	~20 lines, bare + hand-off template	YES	Does template alone trigger delegation?
4	`soft-guidance`	23 lines, soft language + tier table	YES	Treatment 1: soft prompting
5	`soft-guidance-contract`	~37 lines, soft + tier table + template	YES	Treatment 1 + contract
6	`enforcement`	34 lines, hard enforcement + violations	YES	Treatment 2: enforcement without contract
7	`enforcement-contract`	49 lines, enforcement + template	YES	Treatment 2 + contract

Two Analysis Axes

Axis 1 — Prompt level (no contract): bare → baseline → soft-guidance → enforcement

Axis 2 — Contract effect at each prompt level: baseline vs baseline-contract soft-guidance vs soft-guidance-contract enforcement vs enforcement-contract

Data Status — ALL PHASES COMPLETE

Phase 1: N=3 battery (54 sessions) — COMPLETE

3 conditions (bare/soft-guidance/enforcement-contract) × 6 tasks × N=3. Timestamp: 20260228-170000

Phase 2: Targeted scale-up (35 sessions) — COMPLETE

soft-guidance T1/T2/T5/T6 to N=10, enforcement-contract T1 to N=10. Timestamp: 20260228-174500

Phase 3: New conditions (36 sessions) — COMPLETE

baseline + enforcement, 6 tasks × N=3 each. Timestamp: 20260301-000000

Phase 4: 7-condition scale-up (155 sessions) — COMPLETE

Scale-up across all 6 agent-bearing conditions. Timestamp: 20260228-184408 NOTE: First attempt failed (OAuth 401). 133 empty directory stubs from failed batch were cleaned. Re-run placed 155 valid sessions.

Final Sample Sizes (280 sessions total)

Condition	N/task	Total N
bare	3	18
baseline	5	30
baseline-contract	5	30
soft-guidance	8-10*	56
soft-guidance-contract	8	48
enforcement	8	48
enforcement-contract	8-10*	50

*Unequal per-task due to Phase 2 targeted scale-up.

Key Empirical Findings (280 sessions, all 7 conditions)

Aggregate Rates

Condition	N	Delegation	95% CI	Violation-free	95% CI	Correctness	Cost	Duration med (s)
bare	18	16.7%	[5.8, 39.2]	0.0%	[0.0, 17.6]	100.0%	$0.17	46.4
baseline	30	20.0%	[9.5, 37.3]	0.0%	[0.0, 11.4]	100.0%	$0.17	40.1
baseline-contract	30	46.7%	[30.2, 63.9]	16.7%	[7.3, 33.6]	100.0%	$0.24	47.0
soft-guidance	56	82.1%	[70.2, 90.0]	60.7%	[47.6, 72.4]	100.0%	$0.38	59.6
soft-guidance-contract	48	89.6%	[77.8, 95.5]	72.9%	[59.0, 83.4]	100.0%	$0.31	50.3
enforcement	48	100.0%	[92.6, 100.0]	100.0%	[92.6, 100.0]	97.9%	$0.41	65.2
enforcement-contract	50	100.0%	[92.9, 100.0]	100.0%	[92.9, 100.0]	98.0%	$0.37	62.7

Statistical Significance (Holm-Bonferroni corrected, 20-test family)

12/20 pairwise comparisons survive correction:

All prompt-level axis comparisons survive (except bare vs baseline)
No contract-template comparisons survive

Key comparisons:

baseline→soft-guidance delegation: p<0.001, V=0.61 (large)
soft-guidance→enforcement delegation: p=0.002, V=0.30 (medium)
baseline→enforcement delegation: p<0.001, V=0.84 (large)
baseline→enforcement violation-free: p<0.001, V=1.00 (large)
baseline→baseline-contract delegation: p=0.054 (marginal, does NOT survive)
soft-guidance→soft-guidance-contract: p=0.403 (not significant)

Task Correctness (278/280 = 99.3%)

Only 2 failures:

enforcement/T5/run-06: worker missed X-Warehouse header hint → wrong answer
enforcement-contract/T1/run-03: infrastructure failure (service not ready) Non-enforcement conditions: 100% correct (212/212)

Key Per-Task Finding

T1 (simplest task) discriminates: 0% delegation under all conditions below enforcement, but 100% under enforcement. T2 (discovery task) delegates under ALL conditions including bare (uses built-in Explore agent).

Paper Status

Paper at paper/paper.md, commit 0fd1aa9. 444 lines. Eight peer reviews received and addressed (Snapshots 15-20).

7-condition design, 2-axis analysis
4 measures: delegation, violation-free, correctness, duration (descriptive)
Mean delegation count per condition added (Section 3.1)
95% Wilson score CIs on all proportion estimates
9 tables (full 20-row Holm-Bonferroni table, was 9 selected rows)
12 specific limitations with affected-claim annotations
Section 4.2 "Compliance Is Not Benefit" — cost comparison, T1 judgment, missing-metrics inventory
Associational language throughout (causal language softened, synonyms varied)
Terminology note: "enforcement"/"violation" are operational labels
Abstract reordered: gap first, compliance framing after first result
Abstract V=0.89 corrected (was 0.84 — wrong comparison for rate range)
Session cost definition added (Section 2.5)
--dangerously-skip-permissions elevated to Section 1.3 Scope
Gap claim softened ("to our knowledge" + no systematic review disclosure)
Per-cell N=3 caution note added (Section 3.4)
Worker type distribution: exact counts replace vague qualifiers
Unreported conditions (baseline-opaque, cedar/flint/marble) disclosed
Missing bare-contract cell explained
Min-N heuristic language tightened
Cross-phase delegation rates added to temporal confound
6 practical implications
All 16 writing principles checklist items pass
Governed by meta/research-paper-writing-principles.md

Critical fixes from Review 5 (Snapshot 17)

Model identity: orchestrator is claude-opus-4-6, NOT claude-sonnet-4-20250514 (verified from modelUsage in all 280 output.json files)
Violation false positive: enforcement T4 which curl was tool check, not network call. Regex fixed in audit-battery.js. Enforcement now 100%/100%.
T6 scorer: added exclusion check for WIDGET-A/GADGET-X false positives

Key additions from Review 6 (Snapshot 18)

Regex coverage gap disclosed: node http.get() calls missed by regex (9-14% of network calls in non-enforcement conditions; zero in enforcement)
Vacuous violation-free noted: enforcement had 5 total Bash calls across 98 sessions, all tool-availability checks — violation-free is consequence of absent Bash activity, not independent constraint adherence
T2 exclusion note: excluding T2, bare=0.0%, baseline=4.0%
Decision timeline specified: which conditions added in which phase
Prompt header confound: enforcement uses different document header
Full Holm-Bonferroni: all 20 rows now shown (was 9 selected)

Paper Writing Principles (operator-mandated, Snapshot 12)

The data tells the story. No superlatives, no overselling.
Observational, not causal. "We observed X under conditions Y."
Scope-bound every claim. One model, one framework, these tasks.
Registration is a prerequisite, not a null.
Avoid strong categorical language.
Modest wording protects credibility.
Acknowledge what we didn't test.
Let readers draw conclusions.

Repo: `/home/andrew/Desktop/delegation-study/`

Conditions

Path	Lines	What
`kit/conditions/bare.md`	6	No delegation, no agents
`kit/conditions/baseline.md`	6	Same as bare but agents registered
`kit/conditions/baseline-contract.md`	20	OOTB + template only
`kit/conditions/soft-guidance.md`	23	Soft delegation + tier table
`kit/conditions/soft-guidance-contract.md`	37	Soft + tier table + template
`kit/conditions/enforcement.md`	34	Hard enforcement + violations
`kit/conditions/enforcement-contract.md`	49	Enforcement + template

Agents

Path	Model	Role
`kit/.claude/agents/lore-default.md`	sonnet	Default worker
`kit/.claude/agents/lore-explore.md`	haiku	Read-only explorer
`kit/.claude/agents/lore-fast.md`	haiku	Fast executor
`kit/.claude/agents/lore-power.md`	opus	Complex reasoning

Harness

File	Purpose
`harness/run-battery.sh`	Battery runner
`harness/audit-battery.js`	Per-batch auditor
`harness/analyze-all.js`	Cross-condition comparison
`harness/stats.js`	Fisher's exact, Cramer's V, Holm-Bonferroni, Wilson CIs, durations
`harness/score-correctness.js`	Ground-truth correctness scoring
`harness/task-battery.json`	6 task definitions
`harness/services/orders-service.js`	Mock API, port 8787
`harness/services/inventory-service.js`	Mock API, port 8791

Paper & Meta

File	Purpose
`paper/paper.md`	THE PAPER — new 7-condition version
`meta/research-paper-writing-principles.md`	Merged Opus+GPT writing principles
`meta/reviewer-prompt.md`	Structured review prompt
`meta/study-orchestration-notes.md`	Meta-methodology notes
`model/context-growth-cost.md`	Formal cost model (from old paper)

Data

Path	What
`.runs/battery/bare/20260228-170000/`	Phase 1 N=3
`.runs/battery/baseline/20260301-000000/`	Phase 3 N=3
`.runs/battery/baseline/20260228-184408/`	Phase 4 scale-up
`.runs/battery/baseline-contract/20260228-184408/`	Phase 4 N=5
`.runs/battery/soft-guidance/20260228-170000/`	Phase 1 N=3
`.runs/battery/soft-guidance/20260228-174500/`	Phase 2 scale-up
`.runs/battery/soft-guidance/20260228-184408/`	Phase 4 T3/T4 scale-up
`.runs/battery/soft-guidance-contract/20260228-184408/`	Phase 4 N=8
`.runs/battery/enforcement/20260301-000000/`	Phase 3 N=3
`.runs/battery/enforcement/20260228-184408/`	Phase 4 scale-up
`.runs/battery/enforcement-contract/20260228-170000/`	Phase 1 N=3
`.runs/battery/enforcement-contract/20260228-174500/`	Phase 2 T1 scale-up
`.runs/battery/enforcement-contract/20260228-184408/`	Phase 4 scale-up

Inter-Agent Communication

Path	What
`~/Desktop/ds-to-chappie.txt`	Messages TO CHAPPiE (last: Phase 4 auth failure report)
`~/Desktop/chappie-to-ds.txt`	Messages FROM CHAPPiE (last: Phase 4 re-run complete)

Git History

Commit	Description
`44db74c`	Snapshot 12
`8e0e0f5`	New 7-condition paper + merged writing principles
`be8301e`	Integrate correctness + scoring script
`d06f949`	Wilson CIs, durations, R1+R2 revisions, cost model caveats
`5b5d1d0`	Review 3: terminology note + abstract framing
`81bb5ec`	Review 4: cost-regime caveat + missing-metrics paragraph
`525b4d5`	Rewrite reviewer prompt
`27cb20a`	Snapshot 16 + context dump
`d892a72`	Review 5: model identity, violation regex, T6 scorer
`56de4af`	Snapshot 17 + context dump
`357eecb`	Review 6: regex audit, vacuous violation-free, header confound
`1a13805`	Snapshot 18 + context dump
`0fd1aa9`	Review 7: session data release, README rewrite, 14 fixes
`b77cd70`	Snapshot 19 + context dump
`3fdeaf6`	Review 8: remove null-result speculation

GitHub Release: v1.0 (session-data-v1.tar.gz, 280 sessions, 1.3 MB) Remote: git@github.com:lorehq/delegation-study.git (private, drewswiredin)

Environment

Claude Code version: 2.1.63
Orchestrator model: claude-opus-4-6
Worker models: sonnet (lore-default), haiku (lore-explore, lore-fast), opus (lore-power)
Platform: Linux (Ubuntu)
K3s cluster: 3 nodes, 4 vCPU / 3 GB RAM each, Debian 12

Next Steps

Context growth projection figures for paper
Add auth validation to audit script
Create drop-in repo (lorehq/lore-delegation-kit)
Agent naming decision for drop-in kit
README rewritten — verify no stale references remain
Study 2 (Worker Cost Profiles) — separate publication
Study 3 (Context Growth Economics) — separate publication

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Science Context

Role

Study Goal

Current Study Design: 7-Condition Matrix

The 7 Conditions

Two Analysis Axes

Data Status — ALL PHASES COMPLETE

Phase 1: N=3 battery (54 sessions) — COMPLETE

Phase 2: Targeted scale-up (35 sessions) — COMPLETE

Phase 3: New conditions (36 sessions) — COMPLETE

Phase 4: 7-condition scale-up (155 sessions) — COMPLETE

Final Sample Sizes (280 sessions total)

Key Empirical Findings (280 sessions, all 7 conditions)

Aggregate Rates

Statistical Significance (Holm-Bonferroni corrected, 20-test family)

Task Correctness (278/280 = 99.3%)

Key Per-Task Finding

Paper Status

Critical fixes from Review 5 (Snapshot 17)

Key additions from Review 6 (Snapshot 18)

Paper Writing Principles (operator-mandated, Snapshot 12)

Repo: `/home/andrew/Desktop/delegation-study/`

Conditions

Agents

Harness

Paper & Meta

Data

Inter-Agent Communication

Git History

Environment

Next Steps

FilesExpand file tree

data_science_context.md

Latest commit

History

data_science_context.md

File metadata and controls

Data Science Context

Role

Study Goal

Current Study Design: 7-Condition Matrix

The 7 Conditions

Two Analysis Axes

Data Status — ALL PHASES COMPLETE

Phase 1: N=3 battery (54 sessions) — COMPLETE

Phase 2: Targeted scale-up (35 sessions) — COMPLETE

Phase 3: New conditions (36 sessions) — COMPLETE

Phase 4: 7-condition scale-up (155 sessions) — COMPLETE

Final Sample Sizes (280 sessions total)

Key Empirical Findings (280 sessions, all 7 conditions)

Aggregate Rates

Statistical Significance (Holm-Bonferroni corrected, 20-test family)

Task Correctness (278/280 = 99.3%)

Key Per-Task Finding

Paper Status

Critical fixes from Review 5 (Snapshot 17)

Key additions from Review 6 (Snapshot 18)

Paper Writing Principles (operator-mandated, Snapshot 12)

Repo: /home/andrew/Desktop/delegation-study/

Conditions

Agents

Harness

Paper & Meta

Data

Inter-Agent Communication

Git History

Environment

Next Steps

Repo: `/home/andrew/Desktop/delegation-study/`