Skip to content

v2.0.0#43

Open
bucurdavid wants to merge 19 commits into
stagingfrom
development
Open

v2.0.0#43
bucurdavid wants to merge 19 commits into
stagingfrom
development

Conversation

@bucurdavid
Copy link
Copy Markdown
Collaborator

No description provided.

OpenClaw Agent (basd) and others added 19 commits April 16, 2026 21:28
* chore: ignore .worktrees/ directory

* feat: stable task IDs + optimizer loop diagram in README

- fix(tasks): derive task IDs from sha1(action names) instead of
  LLM-supplied id field, which changed on every regeneration and
  broke --task filters. Action names come from the discovered surface
  and are stable across runs when the surface hasn't changed.
  Duplicate action-name sets get a -1/-2 numeric suffix.

- docs: add horizontal optimizer-loop SVG diagram to README top,
  showing the full init → baseline → iterate (analyze/mutate/
  re-benchmark/accept-reject) → output flow at a glance.

Closes #17

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs: add PNG version of optimizer loop diagram

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs: use SVG in README (PNG kept as companion file)

SVG renders natively on GitHub and scales without pixelation.
PNG is included alongside as a companion for external use cases
(email, Office docs, tools that don't render SVG).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address Copilot review on PR #24

- Delete SAFE_TASK_ID / isSafeTaskId from src/tasks/generate.ts — dead
  after the stable-ID refactor (still used in src/benchmark/config.ts
  for external task-file validation, so that copy stays).
- Extend stable task IDs to the prompt surface: fall back to a SHA-1
  hash of the prompt text when expected_actions is empty, so
  prompt-surface --task filters survive regeneration.
- Rewrite four README links (optimizer-loop.svg, docs/reference/*.md,
  CONTRIBUTING.md) to absolute github.com URLs — docs/ and
  CONTRIBUTING.md are not in the npm tarball's files field, so relative
  paths 404 when users view the README on npmjs.com.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs: final 1.1.0 polish — CHANGELOG, error message, README consistency

- CHANGELOG.md: fill in all additions and fixes that landed on
  development after the initial 1.1.0 bump (stable task IDs, Codex
  auth, SKILL folder, diagram, model-ID slug overhaul, error message).
- src/errors.ts + docs/reference/errors.md: fix E_MODEL_ID_FORMAT —
  was "missing the openrouter/ prefix"; now lists all three valid
  provider prefixes (openrouter/, anthropic/, openai/).
- README.md: use catalog-correct openrouter/google/gemini-2.5-flash
  (dot, not hyphen) in answers.json example; change "skill-optimizer
  benchmark" to "skill-optimizer run" for consistency with other examples.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(tasks): stable dedup suffix order + accurate CHANGELOG wording

Sort validated tasks by (id, prompt) before the dedup counter loop so
that numeric suffixes assigned to same-action-hash tasks are determined
by content order, not LLM output order. Previously, if the model
regenerated two create_wallet tasks in swapped order, the -1/-2 suffixes
would swap between runs, making --task filters unstable for multi-variant
cases.

Also soften the CHANGELOG entry for "stable task IDs": clarifies that
SDK/CLI/MCP IDs are stable across regenerations (action names come from
discovered code), while prompt-surface IDs are only stable when the LLM
produces identical wording.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: OpenClaw Agent (basd) <basd@openclaw.ai>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* Update README with Fast team and payment info

Added information about the Fast team and payment infrastructure for AI agents. Requested by Jessy.

* Update README.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: dmn <damian.ovidiu27@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* fix: exempt openai/ model IDs from dot→hyphen rewrite

OpenAI's direct-API model IDs use dots in version numbers (gpt-5.4,
gpt-4.1). fix.ts already exempted openrouter/ but not openai/, so a
manufactured model-id-bad-format issue would corrupt an openai/ ID.
Defense-in-depth: validate.ts already skips emitting the issue for
openai/, but fix.ts must independently respect the documented
invariant in CLAUDE.md.

Also wires smoke-model-ids.ts into npm test — it existed on disk but
was not in the test script.

* chore: remove unused src/discovery/prompt.ts and its tests

The active prompt discovery lives in src/project/discover-prompt.ts,
imported by snapshot.ts and benchmark/runner.ts. src/discovery/prompt.ts
was a dead parallel implementation — only referenced by its own test file.
Removing both the dead module and its test file to keep the codebase lean.
The live discover-prompt.ts API (discoverPromptCapabilities) is covered by
other smoke tests in this PR.

* fix: prompt surface not blocked by coverage violation

Prompt-surface tasks don't guarantee 1:1 capability coverage the way
SDK/CLI/MCP tasks do, so coverageViolation=true was hard-FAILing every
prompt benchmark regardless of actual scores.

computeVerdict now only appends the coverage-violation reason when
config.surface !== 'prompt'. Coverage is still computed and appears
in the report.

Regression guard in new smoke-verdict-prompt.ts locks in both halves
of the behavior: prompt PASSes with coverageViolation=true + scores
above floor; mcp still FAILs under identical conditions.

* refactor: extract resolveCriteriaForTask from runner

Pure refactor. Moves the caps→criteria lookup out of runner.ts into
src/benchmark/prompt-criteria.ts so it can be unit-tested without
running the full LLM pipeline. Behavior is unchanged in this commit —
tasks missing capabilityId are logged as eval errors (FAIL with
message) rather than silently vacuously passing; capabilityId tagging
by the generator lands in a later commit.

Adds optional capabilityId to GeneratedTask (SDK/CLI/MCP generators
don't set it). Runtime enforcement (throw on missing/unknown) lives
in resolveCriteriaForTask — no silent fallback, per the
no-legacy-compat policy.

New smoke-prompt-criteria.ts locks in: match, distinct-per-capability,
throws-on-unknown, throws-on-missing, and noActiveCriteria flagging.

* fix: evaluator flags noActiveCriteria instead of vacuous 1.0 pass

Previously, when every criteria category was empty the evaluator
returned score: 1.0 — any response (including an empty string) scored
a perfect pass. Now the evaluator returns score: 0 and
noActiveCriteria: true. The runner treats that flag as an evaluation
error with an actionable message pointing at the SKILL.md section for
the offending capability.

Evaluator stays dumb (no pass/fail policy). Runner is the policy layer.

* feat: per-capability prompt scoring via capabilityId

The runner's caps[0] global was collapsing every prompt-surface task
into evaluation against the first discovered capability regardless of
what the task actually exercised. With this commit, each generated
prompt-surface task is tagged at generation time with the action key
of the capability it exercises, and the runner looks up criteria
per-task via resolveCriteriaForTask (wired in previous commits).

No legacy compat. Prompt-surface tasks lacking capabilityId fail to
load; users regenerate with `skill-optimizer generate-tasks`.

Regression guards:
  - smoke-generation.ts: valid tagging plus rejection of unknown ids
  - smoke-verdict-prompt.ts: three caps produce distinct criteria
    (caps[0]-collapse detector), P3 regression guard via evaluator,
    and mock-LLM verdict matrix (threshold + weight math)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs: v1.1.0 correctness fixes

CHANGELOG gets a Fixed block covering P1/P2/P3/Bug B/C5 for v1.1.0.
README prompt templates section reflects per-capability scoring.
SKILL.md audit for any guidance contradicting the fixed behavior.

* test: release-readiness coverage check

smoke-changelog-coverage.ts parses the top block of CHANGELOG.md and
asserts every item in Added/Fixed has at least one test file
referencing relevant keywords. Guards against 'shipped feature,
forgot the test' — the class that let P1/P2/P3 slip past v1.1.0
before this PR.

smoke-release.ts also gains an assertion that the CHANGELOG contains
a section header matching the current package.json version.

* fix: plumb capabilityId through TaskDefinition and tighten changelog coverage check

- Add capabilityId?: string to TaskDefinition and update normalizeTaskDefinition
  to read and pass it through — without this, resolveCriteriaForTask threw on
  every prompt-surface task because loadTasks silently dropped the field
- smoke-changelog-coverage: require ≥2 tokens to co-occur in a single test file
  (whole-word match) instead of any one token anywhere in the corpus — prevents
  false-passes on generic words like "prompt" or "coverage"
- generate.ts comment: clarify that parseGeneratedTasks attaches capabilityId;
  membership validation is in ground.ts, not here

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: output capability criteria and prompt coverage computation

- discover-prompt.ts: _output capabilities now store section: section.body
  (full markdown with fences) instead of section: snippet (extracted content
  without fences). generateCriteriaFromCapability requires fences to extract
  format patterns, so passing bare snippet produced empty criteria and forced
  noActiveCriteria/FAIL for every output-format task.
- coverage.ts: actionNamesOf falls back to capabilityId when expected_actions
  is empty — prompt tasks always have expected_actions:[] so coverage showed
  0/N covered. capabilityId matches action.name for prompt capabilities
  (key===name in capabilityToAction), so this correctly attributes coverage.
- Tests: regression guards added in smoke-prompt-criteria and smoke-coverage.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: OpenClaw Agent (basd) <basd@openclaw.ai>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… error quality) (#30)

* docs: add implementation plan for PR #26 review fixes (13 issues)

* fix(preflight): exempt prompt surface from maxTasks check; surface-aware discovery hints

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(init): add prompt surface next-steps guidance

* fix(wizard): accept anthropic/ and openai/ model IDs in custom model validator

* fix(generate): allow missing expected_actions on prompt surface in validateTask

When `knownCapabilityKeys` is defined (prompt surface), LLMs may omit
`expected_actions` entirely even though the prompt requests an empty array.
Fall back to `[]` instead of throwing, so task generation is not blocked.

* fix(docs): correct config path to .skill-optimizer/ and update stale model ID

Replace all occurrences of `skill-optimizer/skill-optimizer.json` (without dot) with
`.skill-optimizer/skill-optimizer.json` (with dot) to match the actual path written by
`src/init/scaffold.ts`. Also update stale `openrouter/openai/gpt-4o` model ID in
`SKILL/references/setup.md` to `openrouter/openai/gpt-4o-mini`.

* fix(snapshot): include snapshotPath in unsupported-format error message

* fix(runner): set toolPrecision=1.0 for prompt surface tasks

* fix(docs): correct apiKeyEnv description and loop.ts agent cwd comment

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(generate): guard generateCandidateTasksWithCoverage against prompt surface

* fix(tasks): replace brittle string match with NoTextBlocksError class

---------

Co-authored-by: OpenClaw Agent (basd) <basd@openclaw.ai>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…ard, doc fixes

- fix(anthropic): apply tool-name codec to sanitize dotted tool names
- fix(runner): skip prompt eval when model call failed; set toolPrecision=1.0
- fix(docs): correct .skill-optimizer path, optimize.model default, and schema description
Reflects that the default API key env var is determined by model provider
prefix (openrouter/ → OPENROUTER_API_KEY, etc.), not by benchmark.format.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(workbench): replace benchmark stack with eval workbench

* chore(workbench): untrack local eval corpora

* feat(workbench): harden eval runner and add examples

* feat(workbench): simplify eval outputs and skill docs

* feat(workbench): pass through eval credentials

* feat(skill): add cross-agent plugin distribution

* chore: update package author metadata

* fix(codex): add plugin marketplace metadata

* docs: refresh agent context files

* feat(workbench): add hidden MCP services

* fix(plugin): align OpenCode entrypoint

* refactor(workbench): remove reference solutions

* docs(workbench): add optimization loop

* docs(release): refresh Skill Optimizer positioning

* style(plugin): normalize OpenCode plugin quotes

* fix(workbench): harden env and MCP service handling

* fix(workbench): align skill docs with current examples

Clarify command-skill eval guidance and keep packaged examples from drifting from the supported workbench schema.

* fix(workbench): validate models before runs

Validate standalone run-case model refs early and keep partially-started MCP service containers visible to cleanup.
Mirror provider-specific install instructions in the README and keep packaged plugin descriptions consistent across supported agents.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants