Skip to content

Commit de0712f

Browse files
committed
fix: remove broken references to missing scripts in skill-creator
replace all references to non-existent files (generate_review.py, aggregate_benchmark, run_loop, package_skill, agent .md files, eval_review.html) with self-contained instructions using claude's built-in tools. the skill is now fully standalone.
1 parent 4051cc0 commit de0712f

2 files changed

Lines changed: 57 additions & 98 deletions

File tree

plugins/skill-creator/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ Advanced features include blind A/B comparison between skill versions and benchm
3333

3434
## Included
3535

36-
- `SKILL.md` -- comprehensive skill creation guide (~480 lines)
36+
- `SKILL.md` -- comprehensive skill creation guide (~440 lines), fully self-contained with no external script dependencies
3737
- `references/schemas.md` -- JSON schemas for evals, grading, benchmarks, and other data structures
3838

3939
## License

plugins/skill-creator/skills/skill-creator/SKILL.md

Lines changed: 56 additions & 97 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ At a high level, the process of creating a skill goes like this:
1414
- Create a few test prompts and run claude-with-access-to-the-skill on them
1515
- Help the user evaluate the results both qualitatively and quantitatively
1616
- While the runs happen in the background, draft some quantitative evals if there aren't any (if there are some, you can either use as is or modify if you feel something needs to change about them). Then explain them to the user (or if they already existed, explain the ones that already exist)
17-
- Use the `eval-viewer/generate_review.py` script to show the user the results for them to look at, and also let them look at the quantitative metrics
17+
- Present results to the user directly in conversation — show outputs, diffs, and metrics inline. For file outputs, tell the user where they're saved so they can inspect them
1818
- Rewrite the skill based on feedback from the user's evaluation of the results (and also if there are any glaring flaws that become apparent from the quantitative benchmarks)
1919
- Repeat until you're satisfied
2020
- Expand the test set and try again at larger scale
@@ -222,71 +222,44 @@ This is the only opportunity to capture this data — it comes through the task
222222

223223
Once all runs are done:
224224

225-
1. **Grade each run** — spawn a grader subagent (or grade inline) that reads `agents/grader.md` and evaluates each assertion against the outputs. Save results to `grading.json` in each run directory. The grading.json expectations array must use the fields `text`, `passed`, and `evidence` (not `name`/`met`/`details` or other variants) — the viewer depends on these exact field names. For assertions that can be checked programmatically, write and run a script rather than eyeballing it — scripts are faster, more reliable, and can be reused across iterations.
226-
227-
2. **Aggregate into benchmark** — run the aggregation script from the skill-creator directory:
228-
```bash
229-
python -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>
230-
```
231-
This produces `benchmark.json` and `benchmark.md` with pass_rate, time, and tokens for each configuration, with mean +/- stddev and the delta. If generating benchmark.json manually, see `references/schemas.md` for the exact schema the viewer expects.
232-
Put each with_skill version before its baseline counterpart.
233-
234-
3. **Do an analyst pass** — read the benchmark data and surface patterns the aggregate stats might hide. See `agents/analyzer.md` (the "Analyzing Benchmark Results" section) for what to look for — things like assertions that always pass regardless of skill (non-discriminating), high-variance evals (possibly flaky), and time/token tradeoffs.
235-
236-
4. **Launch the viewer** with both qualitative outputs and quantitative data:
237-
```bash
238-
nohup python <skill-creator-path>/eval-viewer/generate_review.py \
239-
<workspace>/iteration-N \
240-
--skill-name "my-skill" \
241-
--benchmark <workspace>/iteration-N/benchmark.json \
242-
> /dev/null 2>&1 &
243-
VIEWER_PID=$!
225+
1. **Grade each run** — spawn a grader subagent (or grade inline) to evaluate each assertion against the outputs. For each assertion, record whether it passed and cite specific evidence from the outputs. Save results to `grading.json` in each run directory using this structure:
226+
```json
227+
{
228+
"expectations": [
229+
{"text": "assertion text", "passed": true, "evidence": "what you found"}
230+
],
231+
"summary": {"passed": 2, "failed": 1, "total": 3, "pass_rate": 0.67}
232+
}
244233
```
245-
For iteration 2+, also pass `--previous-workspace <workspace>/iteration-<N-1>`.
246-
247-
**Cowork / headless environments:** If `webbrowser.open()` is not available or the environment has no display, use `--static <output_path>` to write a standalone HTML file instead of starting a server. Feedback will be downloaded as a `feedback.json` file when the user clicks "Submit All Reviews". After download, copy `feedback.json` into the workspace directory for the next iteration to pick up.
234+
For assertions that can be checked programmatically, write and run a script rather than eyeballing it — scripts are faster, more reliable, and can be reused across iterations.
248235

249-
Note: please use generate_review.py to create the viewer; there's no need to write custom HTML.
236+
2. **Aggregate into benchmark** — read all `grading.json` and `timing.json` files from the iteration directory, compute pass_rate, time, and tokens for each configuration (with_skill vs without_skill), and calculate mean +/- stddev and the delta. Save as `benchmark.json` and `benchmark.md`. See `references/schemas.md` for the exact schema. Put each with_skill version before its baseline counterpart.
250237

251-
5. **Tell the user** something like: "I've opened the results in your browser. There are two tabs — 'Outputs' lets you click through each test case and leave feedback, 'Benchmark' shows the quantitative comparison. When you're done, come back here and let me know."
238+
3. **Do an analyst pass** — read the benchmark data and surface patterns the aggregate stats might hide: assertions that always pass regardless of skill (non-discriminating), high-variance evals (possibly flaky), and time/token tradeoffs. Save observations to the `notes` field of benchmark.json.
252239

253-
### What the user sees in the viewer
240+
4. **Present results to the user** — show a summary of each test case directly in conversation:
241+
- The prompt that was given
242+
- Key outputs (inline or file paths for large files)
243+
- Grading results (pass/fail per assertion)
244+
- Benchmark comparison table (with_skill vs without_skill)
245+
- Ask: "How do these look? Any feedback on specific test cases?"
254246

255-
The "Outputs" tab shows one test case at a time:
256-
- **Prompt**: the task that was given
257-
- **Output**: the files the skill produced, rendered inline where possible
258-
- **Previous Output** (iteration 2+): collapsed section showing last iteration's output
259-
- **Formal Grades** (if grading was run): collapsed section showing assertion pass/fail
260-
- **Feedback**: a textbox that auto-saves as they type
261-
- **Previous Feedback** (iteration 2+): their comments from last time, shown below the textbox
247+
### Step 5: Collect feedback
262248

263-
The "Benchmark" tab shows the stats summary: pass rates, timing, and token usage for each configuration, with per-eval breakdowns and analyst observations.
249+
Ask the user for feedback on each test case. Empty or positive feedback means it looked fine. Focus improvements on the test cases where the user had specific complaints.
264250

265-
Navigation is via prev/next buttons or arrow keys. When done, they click "Submit All Reviews" which saves all feedback to `feedback.json`.
266-
267-
### Step 5: Read the feedback
268-
269-
When the user tells you they're done, read `feedback.json`:
251+
You can also save structured feedback to `feedback.json` in the iteration directory for reference in future iterations:
270252

271253
```json
272254
{
273255
"reviews": [
274-
{"run_id": "eval-0-with_skill", "feedback": "the chart is missing axis labels", "timestamp": "..."},
275-
{"run_id": "eval-1-with_skill", "feedback": "", "timestamp": "..."},
276-
{"run_id": "eval-2-with_skill", "feedback": "perfect, love this", "timestamp": "..."}
277-
],
278-
"status": "complete"
256+
{"run_id": "eval-0-with_skill", "feedback": "the chart is missing axis labels"},
257+
{"run_id": "eval-1-with_skill", "feedback": ""},
258+
{"run_id": "eval-2-with_skill", "feedback": "perfect, love this"}
259+
]
279260
}
280261
```
281262

282-
Empty feedback means the user thought it was fine. Focus your improvements on the test cases where the user had specific complaints.
283-
284-
Kill the viewer server when you're done with it:
285-
286-
```bash
287-
kill $VIEWER_PID 2>/dev/null
288-
```
289-
290263
---
291264

292265
## Improving the skill
@@ -324,7 +297,13 @@ Keep going until:
324297

325298
## Advanced: Blind comparison
326299

327-
For situations where you want a more rigorous comparison between two versions of a skill (e.g., the user asks "is the new version actually better?"), there's a blind comparison system. Read `agents/comparator.md` and `agents/analyzer.md` for the details. The basic idea is: give two outputs to an independent agent without telling it which is which, and let it judge quality. Then analyze why the winner won.
300+
For situations where you want a more rigorous comparison between two versions of a skill (e.g., the user asks "is the new version actually better?"), use a blind comparison approach:
301+
302+
1. **Compare** — spawn a subagent that receives two outputs labeled "A" and "B" (randomize which is which). The subagent scores each on content (correctness, completeness, accuracy) and structure (organization, formatting, usability), picks a winner, and explains why. Save to `comparison.json`.
303+
304+
2. **Analyze** — spawn another subagent to analyze why the winner won: what instructions led to better output, what the loser's skill was missing, and concrete improvement suggestions. Save to `analysis.json`.
305+
306+
See `references/schemas.md` for the comparison.json and analysis.json schemas.
328307

329308
This is optional, requires subagents, and most users won't need it. The human review loop is usually sufficient.
330309

@@ -359,39 +338,26 @@ The key thing to avoid: don't make should-not-trigger queries obviously irreleva
359338

360339
### Step 2: Review with user
361340

362-
Present the eval set to the user for review using the HTML template:
363-
364-
1. Read the template from `assets/eval_review.html`
365-
2. Replace the placeholders:
366-
- `__EVAL_DATA_PLACEHOLDER__` -> the JSON array of eval items (no quotes around it — it's a JS variable assignment)
367-
- `__SKILL_NAME_PLACEHOLDER__` -> the skill's name
368-
- `__SKILL_DESCRIPTION_PLACEHOLDER__` -> the skill's current description
369-
3. Write to a temp file (e.g., `/tmp/eval_review_<skill-name>.html`) and open it: `open /tmp/eval_review_<skill-name>.html`
370-
4. The user can edit queries, toggle should-trigger, add/remove entries, then click "Export Eval Set"
371-
5. The file downloads to `~/Downloads/eval_set.json` — check the Downloads folder for the most recent version in case there are multiple (e.g., `eval_set (1).json`)
341+
Present the eval set to the user for review directly in conversation. Show each query with its should_trigger label and ask: "Do these look right? Want to add, remove, or change any?" Adjust based on feedback.
372342

373343
This step matters — bad eval queries lead to bad descriptions.
374344

375345
### Step 3: Run the optimization loop
376346

377-
Tell the user: "This will take some time — I'll run the optimization loop in the background and check on it periodically."
347+
Tell the user: "This will take some time — I'll iterate on the description to improve triggering accuracy."
378348

379-
Save the eval set to the workspace, then run in the background:
349+
The optimization loop works as follows:
380350

381-
```bash
382-
python -m scripts.run_loop \
383-
--eval-set <path-to-trigger-eval.json> \
384-
--skill-path <path-to-skill> \
385-
--model <model-id-powering-this-session> \
386-
--max-iterations 5 \
387-
--verbose
388-
```
389-
390-
Use the model ID from your system prompt (the one powering the current session) so the triggering test matches what the user actually experiences.
391-
392-
While it runs, periodically tail the output to give the user updates on which iteration it's on and what the scores look like.
393-
394-
This handles the full optimization loop automatically. It splits the eval set into 60% train and 40% held-out test, evaluates the current description (running each query 3 times to get a reliable trigger rate), then calls Claude with extended thinking to propose improvements based on what failed. It re-evaluates each new description on both train and test, iterating up to 5 times. When it's done, it opens an HTML report in the browser showing the results per iteration and returns JSON with `best_description` — selected by test score rather than train score to avoid overfitting.
351+
1. Split the eval set into 60% train and 40% held-out test
352+
2. For each iteration (up to 5):
353+
a. Test the current description against each query using `claude -p "<query>" --skill-prompt-file <skill-path>` (run each query 3 times for reliability)
354+
b. Check whether the skill triggered (look for skill invocation in the output)
355+
c. Calculate trigger accuracy (correct triggers + correct non-triggers)
356+
d. Analyze what failed — which queries triggered incorrectly or didn't trigger when they should have
357+
e. Rewrite the description to address the failures, using your understanding of how skill triggering works
358+
f. Evaluate the new description on both train and test sets
359+
3. Select the best description by test score (not train score) to avoid overfitting
360+
4. Report results per iteration to the user
395361

396362
### How skill triggering works
397363

@@ -405,15 +371,15 @@ Take `best_description` from the JSON output and update the skill's SKILL.md fro
405371

406372
---
407373

408-
### Package and Present (only if `present_files` tool is available)
374+
### Package and Present
409375

410-
Check whether you have access to the `present_files` tool. If you don't, skip this step. If you do, package the skill and present the .skill file to the user:
376+
If the user wants a portable `.skill` archive, create one by zipping the skill directory:
411377

412378
```bash
413-
python -m scripts.package_skill <path/to/skill-folder>
379+
cd <parent-of-skill-folder> && zip -r <skill-name>.skill <skill-name>/
414380
```
415381

416-
After packaging, direct the user to the resulting `.skill` file path so they can install it.
382+
Direct the user to the resulting `.skill` file path so they can share or install it. Skip this step if the user doesn't need packaging.
417383

418384
---
419385

@@ -433,7 +399,7 @@ In Claude.ai, the core workflow is the same (draft -> test -> review -> improve
433399

434400
**Blind comparison**: Requires subagents. Skip it.
435401

436-
**Packaging**: The `package_skill.py` script works anywhere with Python and a filesystem. On Claude.ai, you can run it and the user can download the resulting `.skill` file.
402+
**Packaging**: Use a simple `zip` command to package the skill. On Claude.ai, you can create it and the user can download the resulting `.skill` file.
437403

438404
---
439405

@@ -442,24 +408,17 @@ In Claude.ai, the core workflow is the same (draft -> test -> review -> improve
442408
If you're in Cowork, the main things to know are:
443409

444410
- You have subagents, so the main workflow (spawn test cases in parallel, run baselines, grade, etc.) all works. (However, if you run into severe problems with timeouts, it's OK to run the test prompts in series rather than parallel.)
445-
- You don't have a browser or display, so when generating the eval viewer, use `--static <output_path>` to write a standalone HTML file instead of starting a server. Then proffer a link that the user can click to open the HTML in their browser.
446-
- For whatever reason, the Cowork setup seems to disincline Claude from generating the eval viewer after running the tests, so just to reiterate: whether you're in Cowork or in Claude Code, after running tests, you should always generate the eval viewer for the human to look at examples before revising the skill yourself and trying to make corrections, using `generate_review.py` (not writing your own boutique html code). Sorry in advance but I'm gonna go all caps here: GENERATE THE EVAL VIEWER *BEFORE* evaluating inputs yourself. You want to get them in front of the human ASAP!
447-
- Feedback works differently: since there's no running server, the viewer's "Submit All Reviews" button will download `feedback.json` as a file. You can then read it from there (you may have to request access first).
448-
- Packaging works — `package_skill.py` just needs Python and a filesystem.
449-
- Description optimization (`run_loop.py` / `run_eval.py`) should work in Cowork just fine since it uses `claude -p` via subprocess, not a browser, but please save it until you've fully finished making the skill and the user agrees it's in good shape.
411+
- You don't have a browser or display, so present all results directly in conversation — show outputs inline, provide file paths for large files, and collect feedback through the chat.
412+
- After running tests, always present results to the human BEFORE revising the skill yourself. Get them in front of the human ASAP!
413+
- Packaging works with a simple `zip` command — no external scripts needed.
414+
- Description optimization uses `claude -p` via subprocess, not a browser, so it works fine in Cowork. Save it until the skill is in good shape.
450415

451416
---
452417

453418
## Reference files
454419

455-
The agents/ directory contains instructions for specialized subagents. Read them when you need to spawn the relevant subagent.
456-
457-
- `agents/grader.md` — How to evaluate assertions against outputs
458-
- `agents/comparator.md` — How to do blind A/B comparison between two outputs
459-
- `agents/analyzer.md` — How to analyze why one version beat another
460-
461420
The references/ directory has additional documentation:
462-
- `references/schemas.md` — JSON structures for evals.json, grading.json, etc.
421+
- `references/schemas.md` — JSON structures for evals.json, grading.json, benchmark.json, comparison.json, analysis.json, and other data structures
463422

464423
---
465424

@@ -469,7 +428,7 @@ Repeating one more time the core loop here for emphasis:
469428
- Draft or edit the skill
470429
- Run claude-with-access-to-the-skill on test prompts
471430
- With the user, evaluate the outputs:
472-
- Create benchmark.json and run `eval-viewer/generate_review.py` to help the user review them
431+
- Create benchmark.json and present results to the user for review
473432
- Run quantitative evals
474433
- Repeat until you and the user are satisfied
475434
- Package the final skill and return it to the user.

0 commit comments

Comments
 (0)