feat(llm-challenge): rebuild evidence collector#1225
Conversation
|
⚡ pkg.pr.new@tailor-platform/sdk@tailor-platform/create-sdk
|
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Each run was creating its own pnpm store under results/<runId>/.shared/pnpm-store and never deleting the per-problem node_modules unless --prune-workspace-deps was passed, leaving tens of GB on disk after a handful of runs. - Persist the pnpm store at llm-challenge/.cache/pnpm-store so packages are hardlinked across runs instead of being re-downloaded per run. - Wrap each problem's task body in try/finally so pruneWorkspaceDeps runs even when the solver fails or is interrupted; prune errors are logged so they can't mask the original failure. - Flip the default to on and rename the flag to --no-prune-workspace-deps so debugging is opt-in.
This comment has been minimized.
This comment has been minimized.
llm-challenge artifact analysis: SDK/API improvement signalsScope: this comment lists API affordance candidates, not SDK usage mistakes. The observations are based on trace/work evidence from the latest llm-challenge artifacts, but local artifact paths are intentionally omitted. In the examples, Before means current correct SDK/API usage or workaround; After means an illustrative public SDK/API affordance candidate, not an implementation contract.
Resolver Context normalizationBefore: type CallerType = "anonymous" | "user" | "machine_user";
type InvokerType = "none" | "user" | "machine_user";
function presentId(id: string | undefined): string | null {
return id && id !== anonymousId ? id : null;
}
function callerType(type: string): CallerType {
return type === "user" || type === "machine_user" ? type : "anonymous";
}
function invokerType(type: string | undefined): InvokerType {
return type === "user" || type === "machine_user" ? type : "none";
}
body: ({ user, invoker, env }) => ({
caller: {
id: presentId(user.id),
type: callerType(user.type),
workspaceId: presentId(user.workspaceId),
},
request: {
hasInvoker: invoker != null,
invokerId: invoker?.id ?? null,
invokerType: invokerType(invoker?.type),
},
environment: {
summaryLabel: String(env.SUMMARY_LABEL ?? "unset"),
},
})After: body: ({ user, invoker, env }) => {
const context = resolverContext({ user, invoker, env });
return {
caller: context.callerSummary(),
request: context.invokerSummary({ noneType: "none" }),
environment: {
summaryLabel: context.env.string("SUMMARY_LABEL", "unset"),
},
};
}Resolver output schema builder discoverabilityBefore: const inventoryDashboardOutput = t.object({
summary: t
.object({
totalItems: t.int(),
lowStockItems: t.int(),
reorderNeeded: t.bool(),
})
.typeName("InventoryDashboardSummary"),
items: t
.object(
{
itemId: t.string(),
sku: t.string(),
stockStatus: t.enum(["inStock", "lowStock", "outOfStock"]),
},
{ array: true },
)
.typeName("InventoryDashboardItemRow"),
}).typeName("InventoryDashboardResult");
export type InventoryDashboardResult = t.output<typeof inventoryDashboardOutput>;After: const inventoryDashboardOutput = resolverOutput("InventoryDashboardResult", (schema) => ({
summary: schema.object("InventoryDashboardSummary", {
totalItems: schema.int(),
lowStockItems: schema.int(),
reorderNeeded: schema.bool(),
}),
items: schema.arrayOf("InventoryDashboardItemRow", {
itemId: schema.string(),
sku: schema.string(),
stockStatus: schema.enum("InventoryStockStatus", [
"inStock",
"lowStock",
"outOfStock",
]),
}),
}));
export type InventoryDashboardResult = typeof inventoryDashboardOutput.Output;TailorDB required/unique/enum field intentBefore: export const CustomerAccount = db.type("CustomerAccount", {
customerId: db.uuid().unique(),
contactEmail: db.string().unique(),
displayName: db.string(),
accountTier: db.enum([
{ value: "free", description: "Free account" },
{ value: "pro", description: "Professional account" },
{ value: "enterprise", description: "Enterprise account" },
]),
signupTime: db.datetime(),
});After: export const CustomerAccount = db.type("CustomerAccount", {
customerId: db.required.uuid().unique(),
contactEmail: db.required.email().unique(),
displayName: db.required.string(),
accountTier: db.required.enum("AccountTier", [
["free", "Free account"],
["pro", "Professional account"],
["enterprise", "Enterprise account"],
]),
signupTime: db.required.datetime(),
});TailorDB relation namingBefore: export const SalesOrder = db.type("SalesOrder", {
orderNumber: db.string().unique(),
customerId: db.uuid().relation({
type: "manyToOne",
toward: {
type: Customer,
as: "customer",
},
backward: "salesOrders",
}),
});After: export const SalesOrder = db.type("SalesOrder", {
orderNumber: db.required.string().unique(),
customer: db.belongsTo(Customer, {
foreignKey: "customerId",
backref: "salesOrders",
}),
});Workflow job chainingBefore: export const fulfillOrder = createWorkflowJob({
name: "fulfill-order",
body: async (order: OrderRequest): Promise<ConfirmationReceipt> => {
const validatedOrder = await validateOrder.trigger(order);
const reservation = await reserveInventory.trigger(validatedOrder);
return await sendConfirmation.trigger(reservation);
},
});
export default createWorkflow({
name: "order-fulfillment",
mainJob: fulfillOrder,
});After: export default createWorkflowPipeline({
name: "order-fulfillment",
input: orderRequest,
steps: [
validateOrder,
reserveInventory,
sendConfirmation,
],
});Workflow wait point resolutionBefore: export const approvalGate = defineWaitPoint<ApprovalRequest, ApprovalDecision>(
"approval-gate",
);
export const requestApproval = createWorkflowJob({
name: "request-approval",
body: async (input: ApprovalRequest) => {
const decision = await approvalGate.wait(input);
return await sendFinalNotification.trigger({ ...input, ...decision });
},
});
export const submitApprovalDecision = createResolver({
operation: "mutation",
name: "submitApprovalDecision",
input: {
executionId: t.string(),
approved: t.bool(),
reviewer: t.string(),
notes: t.string({ optional: true }),
},
body: async ({ input }) => {
await approvalGate.resolve(input.executionId, () => ({
approved: input.approved,
reviewer: input.reviewer,
notes: input.notes ?? null,
}));
return { resumed: true };
},
});After: export const approvalGate = defineWaitPoint({
name: "approval-gate",
payload: approvalRequest,
result: approvalDecision,
});
export const requestApproval = createWorkflowJob({
name: "request-approval",
body: async (input: ApprovalRequest) => {
const decision = await approvalGate.wait(input);
return await sendFinalNotification.trigger({ ...input, ...decision });
},
});
export const submitApprovalDecision = approvalGate.createResolver({
name: "submitApprovalDecision",
operation: "mutation",
mapInput: ({ approved, reviewer, notes }) => ({ approved, reviewer, notes }),
output: { resumed: t.bool() },
}); |
|
A/B tested the SDK API affordance candidates with These are experiment results only; this PR now keeps SDK implementation changes out of the branch and contains
Caveat: after artifacts showed full adoption for Resolver Context normalization, Note: these SDK/API affordance candidates were prototyped only in a throwaway A/B worktree — none of them are included in this PR. Whether to adopt any of them is a separate decision still to be made. |
…design-reset # Conflicts: # llm-challenge/package.json # pnpm-lock.yaml
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Extract the helpers that were copy-pasted across the runner modules into shared modules so they cannot drift: - src/utils.ts: toPosix, isObject, tailText, pathExists, pathExistsSync - src/workspace-files.ts: the workspace file walker and its exclude sets Fold runner.ts's private spawn-to-buffer helper into process.ts via a rejectOnNonZero option, drop the dead Problem.absolutePath field, collapse the stripDeclarationJsDoc pass-through wrapper, derive the artifact summary's command lists from a single terminal-command pass, hoist the glob regex out of the verification match loops, and reuse the rerun-runs report type instead of redeclaring it.
Repacking the SDK tarball with macOS `tar` stores xattrs/AppleDouble (`._*`, LIBARCHIVE.xattr) entries, which leak into the no-docs profile and add measurement noise when solvers inspect the package. Set COPYFILE_DISABLE for the extract and repack so the host tar omits them; GNU tar ignores it.
The minMatches field counts glob-matching files that contain the pattern, not regex occurrences, but the name and the "matches:" observation label read as occurrence counts. Document the file-counting semantics on the field and relabel the observation to "matchedFiles:" so spec authors are not misled. No behavior change: every shipped verify.json uses the default of 1, where file-count and occurrence-count are equivalent.
The A/B workflow recorded `success` as "the challenge report run result" and contemplated solverExitCode=0 yet success=false, but report.json has no `success` field, so the value could not be derived mechanically. Spell out the computation from existing artifacts — solver completion in report.json plus no unsatisfied/error check in the run's verification-summary.json — so the recorded value matches reality without changing the report schema.
With concurrency >= 2 each worker wrote the shared report snapshot independently, so a slow older write could land after a newer one and leave report.json missing completed runs if the process was interrupted before the final reconciling write — breaking later analysis and --rerun-nonzero-from. Chain the writes through a single promise; since report only grows, the on-disk file stays monotonic and never regresses to a stale snapshot.
This comment has been minimized.
This comment has been minimized.
…design-reset Resolve the pnpm-lock.yaml conflict by aligning llm-challenge's oxlint and oxlint-tsgolint to the workspace versions main bumped them to (1.66.0 / 0.23.0), then regenerate the lockfile so the hoisted linker resolves a single oxlint version. llm-challenge lint/typecheck/test pass on oxlint 1.66.0.
Code Metrics Report (packages/sdk)
Details | | main (aa19bc1) | #1225 (4882758) | +/- |
|--------------------|----------------|-----------------|------|
| Coverage | 64.2% | 64.2% | 0.0% |
| Files | 377 | 377 | 0 |
| Lines | 13114 | 13114 | 0 |
| Covered | 8428 | 8428 | 0 |
| Code to Test Ratio | 1:0.4 | 1:0.4 | 0.0 |
| Code | 87299 | 87299 | 0 |
| Test | 37390 | 37390 | 0 |SDK Configure Bundle Size
Runtime Performance
Type Performance (instantiations)
Reported by octocov |
toiroakr
left a comment
There was a problem hiding this comment.
Let's put it in for now.
LLM challenge measurement: TailorDB fluent API vs descriptor APII added paired TailorDB schema problems to compare the current fluent API with the PR #905 descriptor/createTable API and ran 3 attempts for each style. Context:
Observed delta, descriptor minus fluent:
Notes:
|
Summary
Rebuild
llm-challengeas a small evidence collector for SDK affordance work, based on the new rebuild brief. The tool now records reproducible Codex runs and artifacts without grading, scoring, reference solutions, or trend analysis.Main changes
pnpm -C llm-challenge challenge runcommand that discovers problems, packs SDK refs, applies profiles, prepares workspaces, runs Codex in Podman, and writesreport.json.sdk-apiandcligroups.Notes