juxt · panayotovk · May 17, 2026 · May 17, 2026 · May 18, 2026 · May 18, 2026
diff --git a/.gitignore b/.gitignore
@@ -1 +1,2 @@
-.DS_Store
+.DS_Store
+eval/results/**/node_modules/
diff --git a/eval/README.md b/eval/README.md
@@ -0,0 +1,141 @@
+# A/B eval harness
+
+Diagnostic harness for the distill skill. Compares two plugin variants
+(`allium-baseline` and `allium-experimental`) against
+`fixtures/insurance-claims/` and produces a Markdown variance report.
+
+> **Note on the new architecture.** As of the plugin-self-containment
+> refactor, the `allium-experimental` plugin produces a deterministic
+> consensus spec on its own (no harness required) — see
+> `plugins/experimental/skills/distill/SKILL.md`. A normal user invokes
+> `allium-experimental:distill` and gets back `./allium-distilled/spec.allium`.
+>
+> This harness still exists for diagnostic work: measuring how much
+> per-sample inventory variance the LLM has, A/B-ing the baseline (LLM
+> writes spec) vs the experimental (LLM writes inventory + pipeline
+> writes spec) approaches, and exercising the deterministic pipeline
+> independently of the orchestrator skill.
+
+## The pipeline (canonical location: inside the plugin)
+
+The four-stage pipeline that produces a deterministic spec:
+
+```
+LLM × K → K  inventory.json
+            ↓ (canonicalize-inventory.mjs per sample)
+            K  inventory.canonical.json
+            ↓ (merge-inventories.mjs across all K)
+            1  inventory.merged.json
+            ↓ (inventory-to-spec.mjs)
+            1  spec.allium      ← deterministic deliverable
+```
+
+In the **plugin** (user-facing):
+- `plugins/experimental/skills/distill/SKILL.md` orchestrates everything by spawning K subagents.
+- `plugins/experimental/scripts/` holds the three pipeline scripts.
+- `plugins/experimental/skills/distill/references/inventory-schema.md` is the contract subagents follow.
+
+In **this harness** (diagnostic):
+- `eval/run.mjs` runs the LLM directly K times per variant and applies the same canonicalize/merge/translate steps afterwards.
+- `eval/canonicalize-inventory.mjs`, `eval/merge-inventories.mjs`, `eval/inventory-to-spec.mjs` are the same scripts (kept in `eval/` for the harness's standalone use; the plugin's copies are the canonical ones).
+
+**Same K canonical inventories → byte-identical spec, every time** — true for both the plugin and the harness paths.
+
+## Prereqs
+
+- `claude` on `PATH`
+- `allium` CLI (defaults to `/opt/homebrew/bin/allium`; override with
+  `ALLIUM_BIN=...` for `compare.mjs`)
+- both plugins resolvable from the repo's marketplace
+  (`.claude-plugin/marketplace.json`)
+
+## Usage
+
+```sh
+# Generate K samples per variant and the consensus spec for each.
+node eval/run.mjs --samples 6 --parallel
+
+# Open the consensus spec(s).
+open eval/results/<timestamp>/experimental/spec.consensus.allium
+
+# Inspect the per-sample variance and structural metrics.
+node eval/compare.mjs eval/results/<timestamp>
+open eval/results/<timestamp>/report.md
+```
+
+`run.mjs` prints the results dir on the first stderr line and the
+`compare.mjs` command on the last — copy/paste, don't retype.
+
+## Useful flags
+
+`run.mjs`:
+
+- `--samples N` — samples per variant (default 3; recommend ≥4 for consensus)
+- `--variants baseline,experimental` — restrict which variants to run
+- `--model haiku` — pin a specific Claude model for reproducibility
+- `--timeout 900000` — per-invocation timeout in ms (default 15 min)
+- `--parallel` — run all samples concurrently within a variant
+- `--fixture PATH` — point at a different fixture
+- `--out DIR` — override `eval/results/`
+
+`compare.mjs` takes one positional argument: the timestamped results dir.
+
+## Output layout
+
+```
+eval/results/<timestamp>/
+├── run-config.json                    # snapshot of CLI args + prompt template
+├── report.md                          # generated by compare.mjs
+├── baseline/                          # one dir per variant
+│   ├── sample-1/
+│   │   ├── inventory.json             # raw LLM output (the deliverable from distill)
+│   │   ├── inventory.canonical.json   # normalised by canonicalize-inventory.mjs
+│   │   ├── spec.allium                # translator output for this sample (debugging)
+│   │   ├── spec.llm.allium            # if the LLM also wrote a spec, kept for forensics
+│   │   ├── stdout.raw.txt
+│   │   ├── stderr.txt
+│   │   └── meta.json                  # invocation metadata + timing
+│   ├── sample-2/...
+│   ├── inventory.merged.json          # consensus across all samples in this variant
+│   └── spec.consensus.allium          ← THE deterministic deliverable for this variant
+└── experimental/
+    └── ...
+```
+
+`eval/results/` is gitignored.
+
+## What the report covers
+
+- Per-variant: `allium check` pass rate, median entity/rule/field counts,
+  per-sample diagnostics.
+- Intra-variant determinism: pairwise unified-diff line counts, Jaccard
+  similarity of entity-name sets and rule-name sets across samples. With
+  the consensus pipeline, intra-sample diff numbers are now mostly a
+  diagnostic of *inventory* drift, not spec drift — the consensus is
+  deterministic regardless.
+- Inter-variant: structural set-diff of entities/rules and a unified text
+  diff between sample-1 of each variant.
+
+For semantic coverage scoring, hand-mark the consensus spec against
+`reference/feature-coverage.md`.
+
+## Standalone use of pipeline components
+
+Each script is invokable on its own:
+
+```sh
+# Canonicalize one LLM inventory.
+node eval/canonicalize-inventory.mjs in.json out.json
+
+# Merge multiple canonical inventories into a consensus.
+node eval/merge-inventories.mjs out.json in1.json in2.json in3.json …
+
+# Translate any inventory (raw, canonical, or merged) to a spec.
+node eval/inventory-to-spec.mjs in.json out.allium
+```
+
+Reproducibility property: given the same canonical inventories,
+`merge-inventories.mjs` always produces byte-identical output;
+`inventory-to-spec.mjs` always produces byte-identical output from any given
+inventory. The only source of non-determinism is the LLM stage that produces
+the raw inventories.
diff --git a/eval/canonicalize-inventory.mjs b/eval/canonicalize-inventory.mjs
@@ -0,0 +1,230 @@
+#!/usr/bin/env node
+// Inventory canonicalizer.
+//
+// Reads an LLM-produced inventory.json and writes a normalized form
+// (inventory.canonical.json). The normalization is deterministic and
+// idempotent: two inventories that differ only in convention (nullability
+// encoding, array order, guidance whitespace) collapse to the same canonical
+// JSON.
+//
+// What we normalize:
+//   - Recursive alphabetical sort of every array of named records (by name,
+//     or by `path` for webhooks/routes, or by `method+path` for routes too).
+//   - Field nullability: prefer `type_hint: "T?"` and drop the `nullable: true`
+//     flag. Equivalent forms collapse to one canonical form.
+//   - Enum values: alphabetical.
+//   - String normalization: trim leading/trailing whitespace; collapse internal
+//     runs of whitespace to a single space; drop a trailing period for short
+//     prose (so "X." and "X" canonicalize the same way).
+//   - Guidance fields: same string normalization as above. NOT dropped (the
+//     user wants full feature coverage), but normalized.
+//   - JSON output: 2-space indent, sorted keys at every level — so two
+//     canonical inventories with the same content are byte-identical.
+//
+// What we DO NOT normalize:
+//   - Set membership (e.g., whether a derived property is present in one
+//     inventory but not another). That's model-choice variance, not
+//     convention drift. Use the SKILL.md tightenings to address it, or
+//     run consensus-voting in a separate tool.
+//
+// Usage:
+//   node eval/canonicalize-inventory.mjs <inventory.json> [<output.json>]
+
+import { readFileSync, writeFileSync } from "fs";
+
+function normString(s) {
+  if (typeof s !== "string") return s;
+  let v = s.trim().replace(/\s+/g, " ");
+  // Drop a single trailing period — short prose like "X." vs "X" should
+  // collapse. Don't strip from longer multi-sentence text (heuristic: only
+  // strip when there's no other period in the string).
+  if (v.endsWith(".") && v.indexOf(".") === v.length - 1) v = v.slice(0, -1);
+  return v;
+}
+
+function normField(field) {
+  const out = { ...field };
+  // Nullability convention: prefer suffix `?` on type_hint, drop nullable.
+  if (typeof out.type_hint === "string") {
+    const t = out.type_hint.trim();
+    const isNullable = out.nullable === true || t.endsWith("?");
+    const baseType = t.endsWith("?") ? t.slice(0, -1) : t;
+    out.type_hint = isNullable ? `${baseType}?` : baseType;
+    if ("nullable" in out) delete out.nullable;
+  }
+  return out;
+}
+
+function sortByKey(arr, ...keys) {
+  return [...arr].sort((a, b) => {
+    for (const k of keys) {
+      const av = String(a?.[k] ?? "");
+      const bv = String(b?.[k] ?? "");
+      const cmp = av.localeCompare(bv);
+      if (cmp !== 0) return cmp;
+    }
+    return 0;
+  });
+}
+
+function canonEntity(e) {
+  const out = { ...e };
+  if (Array.isArray(out.fields)) out.fields = sortByKey(out.fields.map(normField), "name");
+  if (out.status_enum?.values) {
+    out.status_enum = {
+      ...out.status_enum,
+      values: [...out.status_enum.values].sort(),
+    };
+  }
+  if (Array.isArray(out.relationships)) out.relationships = sortByKey(out.relationships, "name");
+  if (Array.isArray(out.derived_properties)) {
+    out.derived_properties = sortByKey(out.derived_properties.map((d) => ({
+      ...d,
+      expression: typeof d.expression === "string" ? d.expression.trim() : d.expression,
+    })), "name");
+  }
+  if (typeof out.guidance === "string") out.guidance = normString(out.guidance);
+  return out;
+}
+
+function canonTransition(t) {
+  const out = { ...t };
+  if (Array.isArray(out.called_from)) out.called_from = [...out.called_from].sort();
+  if (out.body) {
+    const body = { ...out.body };
+    if (Array.isArray(body.params)) body.params = sortByKey(body.params.map(normField), "name");
+    if (Array.isArray(body.requires)) body.requires = [...body.requires].map((s) => String(s).trim()).sort();
+    if (Array.isArray(body.lets)) body.lets = sortByKey(body.lets.map((l) => ({
+      ...l,
+      expression: typeof l.expression === "string" ? l.expression.trim() : l.expression,
+    })), "name");
+    if (Array.isArray(body.ensures)) {
+      // Ensures are an ordered list semantically (assigns can depend on prior
+      // ones). Keep code order EXCEPT canonicalize within each item.
+      body.ensures = body.ensures.map(canonEnsuresItem);
+    }
+    out.body = body;
+  }
+  if (typeof out.guidance === "string") out.guidance = normString(out.guidance);
+  return out;
+}
+
+function canonEnsuresItem(it) {
+  const out = { ...it };
+  if (out.kind === "create" && out.fields && typeof out.fields === "object") {
+    out.fields = Object.fromEntries(
+      Object.entries(out.fields).sort(([a], [b]) => a.localeCompare(b)),
+    );
+  }
+  if (out.kind === "invoke" && out.args && typeof out.args === "object") {
+    out.args = Object.fromEntries(
+      Object.entries(out.args).sort(([a], [b]) => a.localeCompare(b)),
+    );
+  }
+  return out;
+}
+
+function canonScheduledJob(j) {
+  const out = { ...j };
+  if (out.body) {
+    const body = { ...out.body };
+    if (typeof body.when === "string") body.when = body.when.trim();
+    if (Array.isArray(body.requires)) body.requires = [...body.requires].map((s) => String(s).trim()).sort();
+    if (Array.isArray(body.ensures)) body.ensures = body.ensures.map(canonEnsuresItem);
+    out.body = body;
+  }
+  if (typeof out.guidance === "string") out.guidance = normString(out.guidance);
+  return out;
+}
+
+function canonIntegration(i) {
+  const out = { ...i };
+  if (Array.isArray(out.operations)) {
+    out.operations = sortByKey(out.operations.map((op) => ({
+      ...op,
+      params: Array.isArray(op.params) ? sortByKey(op.params.map(normField), "name") : op.params,
+      preconditions: Array.isArray(op.preconditions)
+        ? [...op.preconditions].map((s) => String(s).trim()).sort()
+        : op.preconditions,
+      raises: Array.isArray(op.raises) ? [...op.raises].sort() : op.raises,
+    })), "name");
+  }
+  return out;
+}
+
+function canonValueType(v) {
+  const out = { ...v };
+  if (Array.isArray(out.fields)) out.fields = sortByKey(out.fields.map(normField), "name");
+  return out;
+}
+
+function canonAuxEnum(e) {
+  return { ...e, values: [...(e.values ?? [])].sort() };
+}
+
+function canonInvariant(inv) {
+  return {
+    ...inv,
+    expression: typeof inv.expression === "string" ? inv.expression.trim() : inv.expression,
+    enforced_by: Array.isArray(inv.enforced_by) ? [...inv.enforced_by].sort() : inv.enforced_by,
+  };
+}
+
+function canonConfig(c) {
+  return { ...c, value: typeof c.value === "string" ? c.value.trim() : c.value };
+}
+
+function canonRoute(r) {
+  return { ...r };
+}
+
+function canonWebhook(w) {
+  return {
+    ...w,
+    linking_rule: typeof w.linking_rule === "string" ? normString(w.linking_rule) : w.linking_rule,
+  };
+}
+
+function canonInventory(inv) {
+  return {
+    header: inv.header ?? null,
+    entities: sortByKey((inv.entities ?? []).map(canonEntity), "name"),
+    value_types: sortByKey((inv.value_types ?? []).map(canonValueType), "name"),
+    auxiliary_enumerations: sortByKey((inv.auxiliary_enumerations ?? []).map(canonAuxEnum), "name"),
+    integrations: sortByKey((inv.integrations ?? []).map(canonIntegration), "name"),
+    config: sortByKey((inv.config ?? []).map(canonConfig), "name"),
+    transitions: sortByKey((inv.transitions ?? []).map(canonTransition), "name"),
+    scheduled_jobs: sortByKey((inv.scheduled_jobs ?? []).map(canonScheduledJob), "name"),
+    invariants: sortByKey((inv.invariants ?? []).map(canonInvariant), "name"),
+    routes: sortByKey((inv.routes ?? []).map(canonRoute), "method", "path"),
+    webhooks: sortByKey((inv.webhooks ?? []).map(canonWebhook), "path"),
+  };
+}
+
+// Stable JSON serialization: 2-space indent + sorted keys at every level.
+function stableStringify(value) {
+  return JSON.stringify(value, sortReplacer, 2) + "\n";
+}
+function sortReplacer(_key, value) {
+  if (value && typeof value === "object" && !Array.isArray(value)) {
+    return Object.fromEntries(
+      Object.entries(value).sort(([a], [b]) => a.localeCompare(b)),
+    );
+  }
+  return value;
+}
+
+function main() {
+  const [, , inputPath, outputPath] = process.argv;
+  if (!inputPath) {
+    console.error("usage: node eval/canonicalize-inventory.mjs <inventory.json> [<output.json>]");
+    process.exit(2);
+  }
+  const inv = JSON.parse(readFileSync(inputPath, "utf-8"));
+  const canon = canonInventory(inv);
+  const out = stableStringify(canon);
+  if (outputPath) writeFileSync(outputPath, out);
+  else process.stdout.write(out);
+}
+
+main();