Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
.DS_Store
.DS_Store
eval/results/**/node_modules/
141 changes: 141 additions & 0 deletions eval/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# A/B eval harness

Diagnostic harness for the distill skill. Compares two plugin variants
(`allium-baseline` and `allium-experimental`) against
`fixtures/insurance-claims/` and produces a Markdown variance report.

> **Note on the new architecture.** As of the plugin-self-containment
> refactor, the `allium-experimental` plugin produces a deterministic
> consensus spec on its own (no harness required) — see
> `plugins/experimental/skills/distill/SKILL.md`. A normal user invokes
> `allium-experimental:distill` and gets back `./allium-distilled/spec.allium`.
>
> This harness still exists for diagnostic work: measuring how much
> per-sample inventory variance the LLM has, A/B-ing the baseline (LLM
> writes spec) vs the experimental (LLM writes inventory + pipeline
> writes spec) approaches, and exercising the deterministic pipeline
> independently of the orchestrator skill.

## The pipeline (canonical location: inside the plugin)

The four-stage pipeline that produces a deterministic spec:

```
LLM × K → K inventory.json
↓ (canonicalize-inventory.mjs per sample)
K inventory.canonical.json
↓ (merge-inventories.mjs across all K)
1 inventory.merged.json
↓ (inventory-to-spec.mjs)
1 spec.allium ← deterministic deliverable
```

In the **plugin** (user-facing):
- `plugins/experimental/skills/distill/SKILL.md` orchestrates everything by spawning K subagents.
- `plugins/experimental/scripts/` holds the three pipeline scripts.
- `plugins/experimental/skills/distill/references/inventory-schema.md` is the contract subagents follow.

In **this harness** (diagnostic):
- `eval/run.mjs` runs the LLM directly K times per variant and applies the same canonicalize/merge/translate steps afterwards.
- `eval/canonicalize-inventory.mjs`, `eval/merge-inventories.mjs`, `eval/inventory-to-spec.mjs` are the same scripts (kept in `eval/` for the harness's standalone use; the plugin's copies are the canonical ones).

**Same K canonical inventories → byte-identical spec, every time** — true for both the plugin and the harness paths.

## Prereqs

- `claude` on `PATH`
- `allium` CLI (defaults to `/opt/homebrew/bin/allium`; override with
`ALLIUM_BIN=...` for `compare.mjs`)
- both plugins resolvable from the repo's marketplace
(`.claude-plugin/marketplace.json`)

## Usage

```sh
# Generate K samples per variant and the consensus spec for each.
node eval/run.mjs --samples 6 --parallel

# Open the consensus spec(s).
open eval/results/<timestamp>/experimental/spec.consensus.allium

# Inspect the per-sample variance and structural metrics.
node eval/compare.mjs eval/results/<timestamp>
open eval/results/<timestamp>/report.md
```

`run.mjs` prints the results dir on the first stderr line and the
`compare.mjs` command on the last — copy/paste, don't retype.

## Useful flags

`run.mjs`:

- `--samples N` — samples per variant (default 3; recommend ≥4 for consensus)
- `--variants baseline,experimental` — restrict which variants to run
- `--model haiku` — pin a specific Claude model for reproducibility
- `--timeout 900000` — per-invocation timeout in ms (default 15 min)
- `--parallel` — run all samples concurrently within a variant
- `--fixture PATH` — point at a different fixture
- `--out DIR` — override `eval/results/`

`compare.mjs` takes one positional argument: the timestamped results dir.

## Output layout

```
eval/results/<timestamp>/
├── run-config.json # snapshot of CLI args + prompt template
├── report.md # generated by compare.mjs
├── baseline/ # one dir per variant
│ ├── sample-1/
│ │ ├── inventory.json # raw LLM output (the deliverable from distill)
│ │ ├── inventory.canonical.json # normalised by canonicalize-inventory.mjs
│ │ ├── spec.allium # translator output for this sample (debugging)
│ │ ├── spec.llm.allium # if the LLM also wrote a spec, kept for forensics
│ │ ├── stdout.raw.txt
│ │ ├── stderr.txt
│ │ └── meta.json # invocation metadata + timing
│ ├── sample-2/...
│ ├── inventory.merged.json # consensus across all samples in this variant
│ └── spec.consensus.allium ← THE deterministic deliverable for this variant
└── experimental/
└── ...
```

`eval/results/` is gitignored.

## What the report covers

- Per-variant: `allium check` pass rate, median entity/rule/field counts,
per-sample diagnostics.
- Intra-variant determinism: pairwise unified-diff line counts, Jaccard
similarity of entity-name sets and rule-name sets across samples. With
the consensus pipeline, intra-sample diff numbers are now mostly a
diagnostic of *inventory* drift, not spec drift — the consensus is
deterministic regardless.
- Inter-variant: structural set-diff of entities/rules and a unified text
diff between sample-1 of each variant.

For semantic coverage scoring, hand-mark the consensus spec against
`reference/feature-coverage.md`.

## Standalone use of pipeline components

Each script is invokable on its own:

```sh
# Canonicalize one LLM inventory.
node eval/canonicalize-inventory.mjs in.json out.json

# Merge multiple canonical inventories into a consensus.
node eval/merge-inventories.mjs out.json in1.json in2.json in3.json …

# Translate any inventory (raw, canonical, or merged) to a spec.
node eval/inventory-to-spec.mjs in.json out.allium
```

Reproducibility property: given the same canonical inventories,
`merge-inventories.mjs` always produces byte-identical output;
`inventory-to-spec.mjs` always produces byte-identical output from any given
inventory. The only source of non-determinism is the LLM stage that produces
the raw inventories.
230 changes: 230 additions & 0 deletions eval/canonicalize-inventory.mjs
Original file line number Diff line number Diff line change
@@ -0,0 +1,230 @@
#!/usr/bin/env node
// Inventory canonicalizer.
//
// Reads an LLM-produced inventory.json and writes a normalized form
// (inventory.canonical.json). The normalization is deterministic and
// idempotent: two inventories that differ only in convention (nullability
// encoding, array order, guidance whitespace) collapse to the same canonical
// JSON.
//
// What we normalize:
// - Recursive alphabetical sort of every array of named records (by name,
// or by `path` for webhooks/routes, or by `method+path` for routes too).
// - Field nullability: prefer `type_hint: "T?"` and drop the `nullable: true`
// flag. Equivalent forms collapse to one canonical form.
// - Enum values: alphabetical.
// - String normalization: trim leading/trailing whitespace; collapse internal
// runs of whitespace to a single space; drop a trailing period for short
// prose (so "X." and "X" canonicalize the same way).
// - Guidance fields: same string normalization as above. NOT dropped (the
// user wants full feature coverage), but normalized.
// - JSON output: 2-space indent, sorted keys at every level — so two
// canonical inventories with the same content are byte-identical.
//
// What we DO NOT normalize:
// - Set membership (e.g., whether a derived property is present in one
// inventory but not another). That's model-choice variance, not
// convention drift. Use the SKILL.md tightenings to address it, or
// run consensus-voting in a separate tool.
//
// Usage:
// node eval/canonicalize-inventory.mjs <inventory.json> [<output.json>]

import { readFileSync, writeFileSync } from "fs";

function normString(s) {
if (typeof s !== "string") return s;
let v = s.trim().replace(/\s+/g, " ");
// Drop a single trailing period — short prose like "X." vs "X" should
// collapse. Don't strip from longer multi-sentence text (heuristic: only
// strip when there's no other period in the string).
if (v.endsWith(".") && v.indexOf(".") === v.length - 1) v = v.slice(0, -1);
return v;
}

function normField(field) {
const out = { ...field };
// Nullability convention: prefer suffix `?` on type_hint, drop nullable.
if (typeof out.type_hint === "string") {
const t = out.type_hint.trim();
const isNullable = out.nullable === true || t.endsWith("?");
const baseType = t.endsWith("?") ? t.slice(0, -1) : t;
out.type_hint = isNullable ? `${baseType}?` : baseType;
if ("nullable" in out) delete out.nullable;
}
return out;
}

function sortByKey(arr, ...keys) {
return [...arr].sort((a, b) => {
for (const k of keys) {
const av = String(a?.[k] ?? "");
const bv = String(b?.[k] ?? "");
const cmp = av.localeCompare(bv);
if (cmp !== 0) return cmp;
}
return 0;
});
}

function canonEntity(e) {
const out = { ...e };
if (Array.isArray(out.fields)) out.fields = sortByKey(out.fields.map(normField), "name");
if (out.status_enum?.values) {
out.status_enum = {
...out.status_enum,
values: [...out.status_enum.values].sort(),
};
}
if (Array.isArray(out.relationships)) out.relationships = sortByKey(out.relationships, "name");
if (Array.isArray(out.derived_properties)) {
out.derived_properties = sortByKey(out.derived_properties.map((d) => ({
...d,
expression: typeof d.expression === "string" ? d.expression.trim() : d.expression,
})), "name");
}
if (typeof out.guidance === "string") out.guidance = normString(out.guidance);
return out;
}

function canonTransition(t) {
const out = { ...t };
if (Array.isArray(out.called_from)) out.called_from = [...out.called_from].sort();
if (out.body) {
const body = { ...out.body };
if (Array.isArray(body.params)) body.params = sortByKey(body.params.map(normField), "name");
if (Array.isArray(body.requires)) body.requires = [...body.requires].map((s) => String(s).trim()).sort();
if (Array.isArray(body.lets)) body.lets = sortByKey(body.lets.map((l) => ({
...l,
expression: typeof l.expression === "string" ? l.expression.trim() : l.expression,
})), "name");
if (Array.isArray(body.ensures)) {
// Ensures are an ordered list semantically (assigns can depend on prior
// ones). Keep code order EXCEPT canonicalize within each item.
body.ensures = body.ensures.map(canonEnsuresItem);
}
out.body = body;
}
if (typeof out.guidance === "string") out.guidance = normString(out.guidance);
return out;
}

function canonEnsuresItem(it) {
const out = { ...it };
if (out.kind === "create" && out.fields && typeof out.fields === "object") {
out.fields = Object.fromEntries(
Object.entries(out.fields).sort(([a], [b]) => a.localeCompare(b)),
);
}
if (out.kind === "invoke" && out.args && typeof out.args === "object") {
out.args = Object.fromEntries(
Object.entries(out.args).sort(([a], [b]) => a.localeCompare(b)),
);
}
return out;
}

function canonScheduledJob(j) {
const out = { ...j };
if (out.body) {
const body = { ...out.body };
if (typeof body.when === "string") body.when = body.when.trim();
if (Array.isArray(body.requires)) body.requires = [...body.requires].map((s) => String(s).trim()).sort();
if (Array.isArray(body.ensures)) body.ensures = body.ensures.map(canonEnsuresItem);
out.body = body;
}
if (typeof out.guidance === "string") out.guidance = normString(out.guidance);
return out;
}

function canonIntegration(i) {
const out = { ...i };
if (Array.isArray(out.operations)) {
out.operations = sortByKey(out.operations.map((op) => ({
...op,
params: Array.isArray(op.params) ? sortByKey(op.params.map(normField), "name") : op.params,
preconditions: Array.isArray(op.preconditions)
? [...op.preconditions].map((s) => String(s).trim()).sort()
: op.preconditions,
raises: Array.isArray(op.raises) ? [...op.raises].sort() : op.raises,
})), "name");
}
return out;
}

function canonValueType(v) {
const out = { ...v };
if (Array.isArray(out.fields)) out.fields = sortByKey(out.fields.map(normField), "name");
return out;
}

function canonAuxEnum(e) {
return { ...e, values: [...(e.values ?? [])].sort() };
}

function canonInvariant(inv) {
return {
...inv,
expression: typeof inv.expression === "string" ? inv.expression.trim() : inv.expression,
enforced_by: Array.isArray(inv.enforced_by) ? [...inv.enforced_by].sort() : inv.enforced_by,
};
}

function canonConfig(c) {
return { ...c, value: typeof c.value === "string" ? c.value.trim() : c.value };
}

function canonRoute(r) {
return { ...r };
}

function canonWebhook(w) {
return {
...w,
linking_rule: typeof w.linking_rule === "string" ? normString(w.linking_rule) : w.linking_rule,
};
}

function canonInventory(inv) {
return {
header: inv.header ?? null,
entities: sortByKey((inv.entities ?? []).map(canonEntity), "name"),
value_types: sortByKey((inv.value_types ?? []).map(canonValueType), "name"),
auxiliary_enumerations: sortByKey((inv.auxiliary_enumerations ?? []).map(canonAuxEnum), "name"),
integrations: sortByKey((inv.integrations ?? []).map(canonIntegration), "name"),
config: sortByKey((inv.config ?? []).map(canonConfig), "name"),
transitions: sortByKey((inv.transitions ?? []).map(canonTransition), "name"),
scheduled_jobs: sortByKey((inv.scheduled_jobs ?? []).map(canonScheduledJob), "name"),
invariants: sortByKey((inv.invariants ?? []).map(canonInvariant), "name"),
routes: sortByKey((inv.routes ?? []).map(canonRoute), "method", "path"),
webhooks: sortByKey((inv.webhooks ?? []).map(canonWebhook), "path"),
};
}

// Stable JSON serialization: 2-space indent + sorted keys at every level.
function stableStringify(value) {
return JSON.stringify(value, sortReplacer, 2) + "\n";
}
function sortReplacer(_key, value) {
if (value && typeof value === "object" && !Array.isArray(value)) {
return Object.fromEntries(
Object.entries(value).sort(([a], [b]) => a.localeCompare(b)),
);
}
return value;
}

function main() {
const [, , inputPath, outputPath] = process.argv;
if (!inputPath) {
console.error("usage: node eval/canonicalize-inventory.mjs <inventory.json> [<output.json>]");
process.exit(2);
}
const inv = JSON.parse(readFileSync(inputPath, "utf-8"));
const canon = canonInventory(inv);
const out = stableStringify(canon);
if (outputPath) writeFileSync(outputPath, out);
else process.stdout.write(out);
}

main();
Loading