v2.0.0 by bucurdavid · Pull Request #43 · fastxyz/skill-optimizer

bucurdavid · 2026-05-04T20:25:02Z

No description provided.

* chore: ignore .worktrees/ directory * feat: stable task IDs + optimizer loop diagram in README - fix(tasks): derive task IDs from sha1(action names) instead of LLM-supplied id field, which changed on every regeneration and broke --task filters. Action names come from the discovered surface and are stable across runs when the surface hasn't changed. Duplicate action-name sets get a -1/-2 numeric suffix. - docs: add horizontal optimizer-loop SVG diagram to README top, showing the full init → baseline → iterate (analyze/mutate/ re-benchmark/accept-reject) → output flow at a glance. Closes #17 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs: add PNG version of optimizer loop diagram Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs: use SVG in README (PNG kept as companion file) SVG renders natively on GitHub and scales without pixelation. PNG is included alongside as a companion for external use cases (email, Office docs, tools that don't render SVG). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: address Copilot review on PR #24 - Delete SAFE_TASK_ID / isSafeTaskId from src/tasks/generate.ts — dead after the stable-ID refactor (still used in src/benchmark/config.ts for external task-file validation, so that copy stays). - Extend stable task IDs to the prompt surface: fall back to a SHA-1 hash of the prompt text when expected_actions is empty, so prompt-surface --task filters survive regeneration. - Rewrite four README links (optimizer-loop.svg, docs/reference/*.md, CONTRIBUTING.md) to absolute github.com URLs — docs/ and CONTRIBUTING.md are not in the npm tarball's files field, so relative paths 404 when users view the README on npmjs.com. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs: final 1.1.0 polish — CHANGELOG, error message, README consistency - CHANGELOG.md: fill in all additions and fixes that landed on development after the initial 1.1.0 bump (stable task IDs, Codex auth, SKILL folder, diagram, model-ID slug overhaul, error message). - src/errors.ts + docs/reference/errors.md: fix E_MODEL_ID_FORMAT — was "missing the openrouter/ prefix"; now lists all three valid provider prefixes (openrouter/, anthropic/, openai/). - README.md: use catalog-correct openrouter/google/gemini-2.5-flash (dot, not hyphen) in answers.json example; change "skill-optimizer benchmark" to "skill-optimizer run" for consistency with other examples. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(tasks): stable dedup suffix order + accurate CHANGELOG wording Sort validated tasks by (id, prompt) before the dedup counter loop so that numeric suffixes assigned to same-action-hash tasks are determined by content order, not LLM output order. Previously, if the model regenerated two create_wallet tasks in swapped order, the -1/-2 suffixes would swap between runs, making --task filters unstable for multi-variant cases. Also soften the CHANGELOG entry for "stable task IDs": clarifies that SDK/CLI/MCP IDs are stable across regenerations (action names come from discovered code), while prompt-surface IDs are only stable when the LLM produces identical wording. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: OpenClaw Agent (basd) <basd@openclaw.ai> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* Update README with Fast team and payment info Added information about the Fast team and payment infrastructure for AI agents. Requested by Jessy. * Update README.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: dmn <damian.ovidiu27@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix: exempt openai/ model IDs from dot→hyphen rewrite OpenAI's direct-API model IDs use dots in version numbers (gpt-5.4, gpt-4.1). fix.ts already exempted openrouter/ but not openai/, so a manufactured model-id-bad-format issue would corrupt an openai/ ID. Defense-in-depth: validate.ts already skips emitting the issue for openai/, but fix.ts must independently respect the documented invariant in CLAUDE.md. Also wires smoke-model-ids.ts into npm test — it existed on disk but was not in the test script. * chore: remove unused src/discovery/prompt.ts and its tests The active prompt discovery lives in src/project/discover-prompt.ts, imported by snapshot.ts and benchmark/runner.ts. src/discovery/prompt.ts was a dead parallel implementation — only referenced by its own test file. Removing both the dead module and its test file to keep the codebase lean. The live discover-prompt.ts API (discoverPromptCapabilities) is covered by other smoke tests in this PR. * fix: prompt surface not blocked by coverage violation Prompt-surface tasks don't guarantee 1:1 capability coverage the way SDK/CLI/MCP tasks do, so coverageViolation=true was hard-FAILing every prompt benchmark regardless of actual scores. computeVerdict now only appends the coverage-violation reason when config.surface !== 'prompt'. Coverage is still computed and appears in the report. Regression guard in new smoke-verdict-prompt.ts locks in both halves of the behavior: prompt PASSes with coverageViolation=true + scores above floor; mcp still FAILs under identical conditions. * refactor: extract resolveCriteriaForTask from runner Pure refactor. Moves the caps→criteria lookup out of runner.ts into src/benchmark/prompt-criteria.ts so it can be unit-tested without running the full LLM pipeline. Behavior is unchanged in this commit — tasks missing capabilityId are logged as eval errors (FAIL with message) rather than silently vacuously passing; capabilityId tagging by the generator lands in a later commit. Adds optional capabilityId to GeneratedTask (SDK/CLI/MCP generators don't set it). Runtime enforcement (throw on missing/unknown) lives in resolveCriteriaForTask — no silent fallback, per the no-legacy-compat policy. New smoke-prompt-criteria.ts locks in: match, distinct-per-capability, throws-on-unknown, throws-on-missing, and noActiveCriteria flagging. * fix: evaluator flags noActiveCriteria instead of vacuous 1.0 pass Previously, when every criteria category was empty the evaluator returned score: 1.0 — any response (including an empty string) scored a perfect pass. Now the evaluator returns score: 0 and noActiveCriteria: true. The runner treats that flag as an evaluation error with an actionable message pointing at the SKILL.md section for the offending capability. Evaluator stays dumb (no pass/fail policy). Runner is the policy layer. * feat: per-capability prompt scoring via capabilityId The runner's caps[0] global was collapsing every prompt-surface task into evaluation against the first discovered capability regardless of what the task actually exercised. With this commit, each generated prompt-surface task is tagged at generation time with the action key of the capability it exercises, and the runner looks up criteria per-task via resolveCriteriaForTask (wired in previous commits). No legacy compat. Prompt-surface tasks lacking capabilityId fail to load; users regenerate with `skill-optimizer generate-tasks`. Regression guards: - smoke-generation.ts: valid tagging plus rejection of unknown ids - smoke-verdict-prompt.ts: three caps produce distinct criteria (caps[0]-collapse detector), P3 regression guard via evaluator, and mock-LLM verdict matrix (threshold + weight math) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs: v1.1.0 correctness fixes CHANGELOG gets a Fixed block covering P1/P2/P3/Bug B/C5 for v1.1.0. README prompt templates section reflects per-capability scoring. SKILL.md audit for any guidance contradicting the fixed behavior. * test: release-readiness coverage check smoke-changelog-coverage.ts parses the top block of CHANGELOG.md and asserts every item in Added/Fixed has at least one test file referencing relevant keywords. Guards against 'shipped feature, forgot the test' — the class that let P1/P2/P3 slip past v1.1.0 before this PR. smoke-release.ts also gains an assertion that the CHANGELOG contains a section header matching the current package.json version. * fix: plumb capabilityId through TaskDefinition and tighten changelog coverage check - Add capabilityId?: string to TaskDefinition and update normalizeTaskDefinition to read and pass it through — without this, resolveCriteriaForTask threw on every prompt-surface task because loadTasks silently dropped the field - smoke-changelog-coverage: require ≥2 tokens to co-occur in a single test file (whole-word match) instead of any one token anywhere in the corpus — prevents false-passes on generic words like "prompt" or "coverage" - generate.ts comment: clarify that parseGeneratedTasks attaches capabilityId; membership validation is in ground.ts, not here Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: output capability criteria and prompt coverage computation - discover-prompt.ts: _output capabilities now store section: section.body (full markdown with fences) instead of section: snippet (extracted content without fences). generateCriteriaFromCapability requires fences to extract format patterns, so passing bare snippet produced empty criteria and forced noActiveCriteria/FAIL for every output-format task. - coverage.ts: actionNamesOf falls back to capabilityId when expected_actions is empty — prompt tasks always have expected_actions:[] so coverage showed 0/N covered. capabilityId matches action.name for prompt capabilities (key===name in capabilityToAction), so this correctly attributes coverage. - Tests: regression guards added in smoke-prompt-criteria and smoke-coverage. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: OpenClaw Agent (basd) <basd@openclaw.ai> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… error quality) (#30) * docs: add implementation plan for PR #26 review fixes (13 issues) * fix(preflight): exempt prompt surface from maxTasks check; surface-aware discovery hints Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(init): add prompt surface next-steps guidance * fix(wizard): accept anthropic/ and openai/ model IDs in custom model validator * fix(generate): allow missing expected_actions on prompt surface in validateTask When `knownCapabilityKeys` is defined (prompt surface), LLMs may omit `expected_actions` entirely even though the prompt requests an empty array. Fall back to `[]` instead of throwing, so task generation is not blocked. * fix(docs): correct config path to .skill-optimizer/ and update stale model ID Replace all occurrences of `skill-optimizer/skill-optimizer.json` (without dot) with `.skill-optimizer/skill-optimizer.json` (with dot) to match the actual path written by `src/init/scaffold.ts`. Also update stale `openrouter/openai/gpt-4o` model ID in `SKILL/references/setup.md` to `openrouter/openai/gpt-4o-mini`. * fix(snapshot): include snapshotPath in unsupported-format error message * fix(runner): set toolPrecision=1.0 for prompt surface tasks * fix(docs): correct apiKeyEnv description and loop.ts agent cwd comment Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(generate): guard generateCandidateTasksWithCoverage against prompt surface * fix(tasks): replace brittle string match with NoTextBlocksError class --------- Co-authored-by: OpenClaw Agent (basd) <basd@openclaw.ai> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

…ard, doc fixes - fix(anthropic): apply tool-name codec to sanitize dotted tool names - fix(runner): skip prompt eval when model call failed; set toolPrecision=1.0 - fix(docs): correct .skill-optimizer path, optimize.model default, and schema description

Reflects that the default API key env var is determined by model provider prefix (openrouter/ → OPENROUTER_API_KEY, etc.), not by benchmark.format. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(workbench): replace benchmark stack with eval workbench * chore(workbench): untrack local eval corpora * feat(workbench): harden eval runner and add examples * feat(workbench): simplify eval outputs and skill docs * feat(workbench): pass through eval credentials * feat(skill): add cross-agent plugin distribution * chore: update package author metadata * fix(codex): add plugin marketplace metadata * docs: refresh agent context files * feat(workbench): add hidden MCP services * fix(plugin): align OpenCode entrypoint * refactor(workbench): remove reference solutions * docs(workbench): add optimization loop * docs(release): refresh Skill Optimizer positioning * style(plugin): normalize OpenCode plugin quotes * fix(workbench): harden env and MCP service handling * fix(workbench): align skill docs with current examples Clarify command-skill eval guidance and keep packaged examples from drifting from the supported workbench schema. * fix(workbench): validate models before runs Validate standalone run-case model refs early and keep partially-started MCP service containers visible to cleanup.

Mirror provider-specific install instructions in the README and keep packaged plugin descriptions consistent across supported agents.

OpenClaw Agent (basd) and others added 19 commits April 16, 2026 21:28

chore: ignore .worktrees/ directory

f91c326

Merge staging into development: resolve conflicts for dev→stg PR #29

fbcbd50

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Merge branch 'staging' into development

d9d30ec

Merge remote-tracking branch 'origin/staging' into development

ea4ee1b

docs: regenerate config-schema.md with updated apiKeyEnv description

efd5af7

Reflects that the default API key env var is determined by model provider prefix (openrouter/ → OPENROUTER_API_KEY, etc.), not by benchmark.format. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Merge remote-tracking branch 'origin/staging' into development

fdaa38e

Merge remote-tracking branch 'origin/staging' into development

94512a0

Merge remote-tracking branch 'origin/staging' into development

d4f5fef

Merge remote-tracking branch 'origin/staging' into development

de5ffe4

Merge remote-tracking branch 'origin/staging' into development

53a1ea5

Merge branch 'staging' into development

9aa17f9

Merge branch 'staging' into development

1b320b9

docs(install): align provider installation guidance (#41)

eeb37df

Mirror provider-specific install instructions in the README and keep packaged plugin descriptions consistent across supported agents.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2.0.0#43

v2.0.0#43
bucurdavid wants to merge 19 commits into
stagingfrom
development

bucurdavid commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bucurdavid commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants