Skip to content

feat(llm-challenge): rebuild evidence collector#1225

Merged
toiroakr merged 45 commits into
mainfrom
docs/llm-challenge-redesign-reset
Jun 1, 2026
Merged

feat(llm-challenge): rebuild evidence collector#1225
toiroakr merged 45 commits into
mainfrom
docs/llm-challenge-redesign-reset

Conversation

@dqn
Copy link
Copy Markdown
Contributor

@dqn dqn commented May 23, 2026

Summary

Rebuild llm-challenge as a small evidence collector for SDK affordance work, based on the new rebuild brief. The tool now records reproducible Codex runs and artifacts without grading, scoring, reference solutions, or trend analysis.

Main changes

  • Replace the previous challenge/evaluator pipeline with a single pnpm -C llm-challenge challenge run command that discovers problems, packs SDK refs, applies profiles, prepares workspaces, runs Codex in Podman, and writes report.json.
  • Add the 19 prompt/scaffold-only problems across the sdk-api and cli groups.
  • Add focused tests for argument parsing, problem discovery and filtering, no-docs profile filtering, report and artifact paths, and workspace preparation.
  • Harden the runner and reports around verification evidence, artifact summaries, no-docs solves, pnpm store reuse, and cleanup.
  • Add llm-challenge agent skill documentation for creating problems, API proposal reporting, and A/B testing workflows.

Notes

  • No changeset is included because this is development tooling for evidence collection, not SDK package behavior.

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 23, 2026

⚠️ No Changeset found

Latest commit: a82e233

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 23, 2026

⚡ pkg.pr.new

@tailor-platform/sdk

pnpm add https://pkg.pr.new/@tailor-platform/sdk@a82e233
pnpm dlx https://pkg.pr.new/@tailor-platform/sdk@a82e233 --help

@tailor-platform/create-sdk

pnpm add https://pkg.pr.new/@tailor-platform/create-sdk@a82e233
pnpm dlx https://pkg.pr.new/@tailor-platform/create-sdk@a82e233 my-app

commit: a82e233

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

dqn added 2 commits May 25, 2026 15:25
Each run was creating its own pnpm store under
results/<runId>/.shared/pnpm-store and never deleting the per-problem
node_modules unless --prune-workspace-deps was passed, leaving tens of
GB on disk after a handful of runs.

- Persist the pnpm store at llm-challenge/.cache/pnpm-store so packages
  are hardlinked across runs instead of being re-downloaded per run.
- Wrap each problem's task body in try/finally so pruneWorkspaceDeps
  runs even when the solver fails or is interrupted; prune errors are
  logged so they can't mask the original failure.
- Flip the default to on and rename the flag to --no-prune-workspace-deps
  so debugging is opt-in.
@github-actions

This comment has been minimized.

@dqn
Copy link
Copy Markdown
Contributor Author

dqn commented May 31, 2026

llm-challenge artifact analysis: SDK/API improvement signals

Scope: this comment lists API affordance candidates, not SDK usage mistakes. The observations are based on trace/work evidence from the latest llm-challenge artifacts, but local artifact paths are intentionally omitted. In the examples, Before means current correct SDK/API usage or workaround; After means an illustrative public SDK/API affordance candidate, not an implementation contract.

Area Observed signal Improvement proposal
Resolver Context normalization The resolver-context solution had to add manual helpers for caller IDs, anonymous users, invoker presence, and runtime string values before the output matched the declared enum/string shape. Typecheck also caught raw string values being returned where narrower output enum values were expected. Add a resolver context normalization helper for common caller/invoker/env summaries.
Resolver output schema builder discoverability The structured-result solutions had to discover t.object, t.enum, .typeName(), and t.output<typeof schema>. Final code varied between bare output object literals and named t.object(...).typeName(...) schemas. Add a more discoverable resolver output builder that keeps schema naming and result-type inference together.
TailorDB required/unique/enum field intent The field-options solutions relied on SDK defaults and chained calls such as db.string().unique() and db.enum(...). The task required human-visible required, unique, and constrained-choice intent, but required-ness is implicit unless the author already knows the default. Add explicit field intent helpers or aliases so required/unique/enum intent is visible in code.
TailorDB relation naming The relation-naming solutions used the low-level relation object with type, toward.type, toward.as, and backward to express a common belongs-to / has-many relation. Add relation helpers for common cardinalities that make forward and backward names first-class.
Workflow job chaining The workflow solutions expressed linear dependency by hand with an orchestrator job or nested trigger() calls between jobs. The ordering was correct, but the dependency graph lived inside arbitrary job bodies. Add a workflow pipeline/sequence helper for simple linear job chains.
Workflow wait point resolution The approval workflow solutions used defineWaitPoint, wait, and resolve, then hand-wired a resolver or executor with duplicate input parsing and output shape. Add a wait-point resume/resolver helper that derives the resolver/executor shape from the wait point payload/result types.

Resolver Context normalization

Before:

type CallerType = "anonymous" | "user" | "machine_user";
type InvokerType = "none" | "user" | "machine_user";

function presentId(id: string | undefined): string | null {
  return id && id !== anonymousId ? id : null;
}

function callerType(type: string): CallerType {
  return type === "user" || type === "machine_user" ? type : "anonymous";
}

function invokerType(type: string | undefined): InvokerType {
  return type === "user" || type === "machine_user" ? type : "none";
}

body: ({ user, invoker, env }) => ({
  caller: {
    id: presentId(user.id),
    type: callerType(user.type),
    workspaceId: presentId(user.workspaceId),
  },
  request: {
    hasInvoker: invoker != null,
    invokerId: invoker?.id ?? null,
    invokerType: invokerType(invoker?.type),
  },
  environment: {
    summaryLabel: String(env.SUMMARY_LABEL ?? "unset"),
  },
})

After:

body: ({ user, invoker, env }) => {
  const context = resolverContext({ user, invoker, env });

  return {
    caller: context.callerSummary(),
    request: context.invokerSummary({ noneType: "none" }),
    environment: {
      summaryLabel: context.env.string("SUMMARY_LABEL", "unset"),
    },
  };
}

Resolver output schema builder discoverability

Before:

const inventoryDashboardOutput = t.object({
  summary: t
    .object({
      totalItems: t.int(),
      lowStockItems: t.int(),
      reorderNeeded: t.bool(),
    })
    .typeName("InventoryDashboardSummary"),
  items: t
    .object(
      {
        itemId: t.string(),
        sku: t.string(),
        stockStatus: t.enum(["inStock", "lowStock", "outOfStock"]),
      },
      { array: true },
    )
    .typeName("InventoryDashboardItemRow"),
}).typeName("InventoryDashboardResult");

export type InventoryDashboardResult = t.output<typeof inventoryDashboardOutput>;

After:

const inventoryDashboardOutput = resolverOutput("InventoryDashboardResult", (schema) => ({
  summary: schema.object("InventoryDashboardSummary", {
    totalItems: schema.int(),
    lowStockItems: schema.int(),
    reorderNeeded: schema.bool(),
  }),
  items: schema.arrayOf("InventoryDashboardItemRow", {
    itemId: schema.string(),
    sku: schema.string(),
    stockStatus: schema.enum("InventoryStockStatus", [
      "inStock",
      "lowStock",
      "outOfStock",
    ]),
  }),
}));

export type InventoryDashboardResult = typeof inventoryDashboardOutput.Output;

TailorDB required/unique/enum field intent

Before:

export const CustomerAccount = db.type("CustomerAccount", {
  customerId: db.uuid().unique(),
  contactEmail: db.string().unique(),
  displayName: db.string(),
  accountTier: db.enum([
    { value: "free", description: "Free account" },
    { value: "pro", description: "Professional account" },
    { value: "enterprise", description: "Enterprise account" },
  ]),
  signupTime: db.datetime(),
});

After:

export const CustomerAccount = db.type("CustomerAccount", {
  customerId: db.required.uuid().unique(),
  contactEmail: db.required.email().unique(),
  displayName: db.required.string(),
  accountTier: db.required.enum("AccountTier", [
    ["free", "Free account"],
    ["pro", "Professional account"],
    ["enterprise", "Enterprise account"],
  ]),
  signupTime: db.required.datetime(),
});

TailorDB relation naming

Before:

export const SalesOrder = db.type("SalesOrder", {
  orderNumber: db.string().unique(),
  customerId: db.uuid().relation({
    type: "manyToOne",
    toward: {
      type: Customer,
      as: "customer",
    },
    backward: "salesOrders",
  }),
});

After:

export const SalesOrder = db.type("SalesOrder", {
  orderNumber: db.required.string().unique(),
  customer: db.belongsTo(Customer, {
    foreignKey: "customerId",
    backref: "salesOrders",
  }),
});

Workflow job chaining

Before:

export const fulfillOrder = createWorkflowJob({
  name: "fulfill-order",
  body: async (order: OrderRequest): Promise<ConfirmationReceipt> => {
    const validatedOrder = await validateOrder.trigger(order);
    const reservation = await reserveInventory.trigger(validatedOrder);
    return await sendConfirmation.trigger(reservation);
  },
});

export default createWorkflow({
  name: "order-fulfillment",
  mainJob: fulfillOrder,
});

After:

export default createWorkflowPipeline({
  name: "order-fulfillment",
  input: orderRequest,
  steps: [
    validateOrder,
    reserveInventory,
    sendConfirmation,
  ],
});

Workflow wait point resolution

Before:

export const approvalGate = defineWaitPoint<ApprovalRequest, ApprovalDecision>(
  "approval-gate",
);

export const requestApproval = createWorkflowJob({
  name: "request-approval",
  body: async (input: ApprovalRequest) => {
    const decision = await approvalGate.wait(input);
    return await sendFinalNotification.trigger({ ...input, ...decision });
  },
});

export const submitApprovalDecision = createResolver({
  operation: "mutation",
  name: "submitApprovalDecision",
  input: {
    executionId: t.string(),
    approved: t.bool(),
    reviewer: t.string(),
    notes: t.string({ optional: true }),
  },
  body: async ({ input }) => {
    await approvalGate.resolve(input.executionId, () => ({
      approved: input.approved,
      reviewer: input.reviewer,
      notes: input.notes ?? null,
    }));
    return { resumed: true };
  },
});

After:

export const approvalGate = defineWaitPoint({
  name: "approval-gate",
  payload: approvalRequest,
  result: approvalDecision,
});

export const requestApproval = createWorkflowJob({
  name: "request-approval",
  body: async (input: ApprovalRequest) => {
    const decision = await approvalGate.wait(input);
    return await sendFinalNotification.trigger({ ...input, ...decision });
  },
});

export const submitApprovalDecision = approvalGate.createResolver({
  name: "submitApprovalDecision",
  operation: "mutation",
  mapInput: ({ approved, reviewer, notes }) => ({ approved, reviewer, notes }),
  output: { resumed: t.bool() },
});

@dqn
Copy link
Copy Markdown
Contributor Author

dqn commented May 31, 2026

A/B tested the SDK API affordance candidates with llm-challenge.

These are experiment results only; this PR now keeps SDK implementation changes out of the branch and contains llm-challenge changes. I am omitting local-only refs and paths from the PR comment.

success is the challenge success count. duration is average run duration. steps counts trace.jsonl item.completed events. Usage-limit zero-step rows were excluded and affected variants were rerun.

Area Baseline After Delta
Resolver Context normalization 3/3, 270.4s, 69.0 steps 3/3, 212.0s, 58.3 steps -58.4s, -10.7 steps
resolverOutput 3/3, 290.5s, 76.0 steps 3/3, 272.8s, 74.7 steps -17.7s, -1.3 steps
TailorDB field helpers 3/3, 271.8s, 68.7 steps 3/3, 274.6s, 76.7 steps +2.8s, +8.0 steps
db.belongsTo 3/3, 286.3s, 81.7 steps 3/3, 306.4s, 76.0 steps +20.1s, -5.7 steps
createWorkflowPipeline 2/3, 228.2s, 58.7 steps 3/3, 226.2s, 53.0 steps +1 success, -2.0s, -5.7 steps
wait point resolver helper 3/3, 429.9s, 88.3 steps 3/3, 375.4s, 88.3 steps -54.5s, ±0 steps

Caveat: after artifacts showed full adoption for Resolver Context normalization, resolverOutput, TailorDB field helpers, db.belongsTo, and createWorkflowPipeline; the wait point createResolver helper was adopted in 1/3 after runs.

Note: these SDK/API affordance candidates were prototyped only in a throwaway A/B worktree — none of them are included in this PR. Whether to adopt any of them is a separate decision still to be made.

…design-reset

# Conflicts:
#	llm-challenge/package.json
#	pnpm-lock.yaml
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

dqn added 5 commits May 31, 2026 19:02
Extract the helpers that were copy-pasted across the runner modules into
shared modules so they cannot drift:

- src/utils.ts: toPosix, isObject, tailText, pathExists, pathExistsSync
- src/workspace-files.ts: the workspace file walker and its exclude sets

Fold runner.ts's private spawn-to-buffer helper into process.ts via a
rejectOnNonZero option, drop the dead Problem.absolutePath field, collapse
the stripDeclarationJsDoc pass-through wrapper, derive the artifact summary's
command lists from a single terminal-command pass, hoist the glob regex out
of the verification match loops, and reuse the rerun-runs report type instead
of redeclaring it.
Repacking the SDK tarball with macOS `tar` stores xattrs/AppleDouble
(`._*`, LIBARCHIVE.xattr) entries, which leak into the no-docs profile and
add measurement noise when solvers inspect the package. Set COPYFILE_DISABLE
for the extract and repack so the host tar omits them; GNU tar ignores it.
The minMatches field counts glob-matching files that contain the pattern,
not regex occurrences, but the name and the "matches:" observation label
read as occurrence counts. Document the file-counting semantics on the field
and relabel the observation to "matchedFiles:" so spec authors are not misled.
No behavior change: every shipped verify.json uses the default of 1, where
file-count and occurrence-count are equivalent.
The A/B workflow recorded `success` as "the challenge report run result" and
contemplated solverExitCode=0 yet success=false, but report.json has no
`success` field, so the value could not be derived mechanically. Spell out the
computation from existing artifacts — solver completion in report.json plus
no unsatisfied/error check in the run's verification-summary.json — so the
recorded value matches reality without changing the report schema.
With concurrency >= 2 each worker wrote the shared report snapshot
independently, so a slow older write could land after a newer one and leave
report.json missing completed runs if the process was interrupted before the
final reconciling write — breaking later analysis and --rerun-nonzero-from.
Chain the writes through a single promise; since report only grows, the
on-disk file stays monotonic and never regresses to a stale snapshot.
@github-actions

This comment has been minimized.

…design-reset

Resolve the pnpm-lock.yaml conflict by aligning llm-challenge's oxlint and
oxlint-tsgolint to the workspace versions main bumped them to (1.66.0 /
0.23.0), then regenerate the lockfile so the hoisted linker resolves a single
oxlint version. llm-challenge lint/typecheck/test pass on oxlint 1.66.0.
@dqn dqn marked this pull request as ready for review June 1, 2026 04:46
@dqn dqn requested review from remiposo and toiroakr as code owners June 1, 2026 04:46
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Code Metrics Report (packages/sdk)

main (aa19bc1) #1225 (4882758) +/-
Coverage 64.2% 64.2% 0.0%
Code to Test Ratio 1:0.4 1:0.4 0.0
Details
  |                    | main (aa19bc1) | #1225 (4882758) | +/-  |
  |--------------------|----------------|-----------------|------|
  | Coverage           |          64.2% |           64.2% | 0.0% |
  |   Files            |            377 |             377 |    0 |
  |   Lines            |          13114 |           13114 |    0 |
  |   Covered          |           8428 |            8428 |    0 |
  | Code to Test Ratio |          1:0.4 |           1:0.4 |  0.0 |
  |   Code             |          87299 |           87299 |    0 |
  |   Test             |          37390 |           37390 |    0 |

SDK Configure Bundle Size

main (aa19bc1) #1225 (4882758) +/-
configure-index-size 18KB 18KB 0KB
dependency-chunks-size 33.52KB 33.52KB 0KB
total-bundle-size 51.51KB 51.51KB 0KB

Runtime Performance

main (aa19bc1) #1225 (4882758) +/-
Generate Median 2,809ms 2,802ms -7ms
Generate Max 2,913ms 2,840ms -73ms
Apply Build Median 2,835ms 2,839ms 4ms
Apply Build Max 2,905ms 2,891ms -14ms

Type Performance (instantiations)

main (aa19bc1) #1225 (4882758) +/-
tailordb-basic 35,133 35,133 0
tailordb-optional 3,841 3,841 0
tailordb-relation 7,428 7,428 0
tailordb-validate 2,566 2,566 0
tailordb-hooks 5,767 5,767 0
tailordb-object 12,136 12,136 0
tailordb-enum 2,462 2,462 0
resolver-basic 9,424 9,424 0
resolver-nested 26,111 26,111 0
resolver-array 18,187 18,187 0
executor-schedule 4,234 4,234 0
executor-webhook 873 873 0
executor-record 8,166 8,166 0
executor-resolver 4,369 4,369 0
executor-operation-function 868 868 0
executor-operation-gql 869 869 0
executor-operation-webhook 888 888 0
executor-operation-workflow 1,714 1,714 0

Reported by octocov

Copy link
Copy Markdown
Contributor

@toiroakr toiroakr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put it in for now.

@toiroakr toiroakr merged commit 8a67af2 into main Jun 1, 2026
55 checks passed
@toiroakr toiroakr deleted the docs/llm-challenge-redesign-reset branch June 1, 2026 06:09
@dqn
Copy link
Copy Markdown
Contributor Author

dqn commented Jun 3, 2026

LLM challenge measurement: TailorDB fluent API vs descriptor API

I added paired TailorDB schema problems to compare the current fluent API with the PR #905 descriptor/createTable API and ran 3 attempts for each style.

Context:

  • SDK ref: PR feat(tailordb,resolver)!: object-literal descriptor API and record-level hooks/validate #905 head (origin/feat/object-literal-descriptor-api, 532df7f)
  • Problems: same catalog-schema task, with prompts constrained to each API style
  • Runner note: the local Podman preflight failed, so this was measured with the host runner path while reusing the llm-challenge workspace setup, verification, report schema, and trace artifacts
  • Success means: solver exit code 0, not timed out, and no unsatisfied/error verification checks
API style Runs Successful Avg duration Avg steps Avg output tokens Main API-specific observation
Fluent API (db.type, db.*) 3 3/3 262.2s 67.0 9,397 No API-specific misconception observed. Runs converged on db.type, db.decimal({ scale: 2 }), db.uuid().relation(...), and db.fields.timestamps().
Descriptor API (createTable, field descriptors) 3 3/3 287.8s 67.0 10,148 The main friction was relation-field shape discovery. Solvers looked for/considered a relation descriptor shape, but the correct descriptor usage is a UUID field with a relation property, e.g. { kind: "uuid", relation: ... }.

Observed delta, descriptor minus fluent:

Metric Delta
Successful runs +0
Avg duration +25.6s
Avg steps +0.0
Avg output tokens +751

Notes:

  • Both API styles produced final artifacts that satisfied the checks.
  • The descriptor-specific issue was an intermediate discovery cost rather than a remaining incorrect final answer.
  • Non-SDK noise seen in traces included docs lookup misses and CLI-name exploration; I did not count those as API-specific mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants