fix: copy skill resources to workspace and improve trigger detection by melanie531 · Pull Request #2 · aws-samples/sample-agent-skill-eval

melanie531 · 2026-03-18T22:57:14Z

Problem

Two issues that significantly impact functional and trigger evaluation scores:

1. Skill scripts/references not available in functional eval workspace

_execute_eval_pair() in functional.py creates a temporary workspace directory for each eval run. However, only the eval case files are copied into this workspace — the skill's own scripts/, references/, and assets/ directories are not copied.

When SKILL.md instructs the agent to run a script (e.g., python3 scripts/check.py), the file does not exist in the workspace. This causes:

Script execution failures
Lower functional scores because assertions checking for script output fail
No meaningful difference between with-skill and without-skill runs

2. Trigger detection fails for all `--append-system-prompt` injected skills

ClaudeRunner injects skill content via --append-system-prompt, meaning the agent receives the skill instructions in its system prompt. However, _detect_skill_trigger_from_parsed() primarily looks for Read tool calls targeting SKILL.md to detect skill activation.

Since the agent already has the skill content in its system prompt, it never needs to read SKILL.md from disk. This results in:

All positive trigger queries failing (agent used the skill knowledge but didn't read the file)
All negative trigger queries passing (correctly detected as not triggered)
Exactly 50% trigger score for any skill with balanced positive/negative queries

Fix

Functional workspace (functional.py)

Copy scripts/, references/, and assets/ from the skill directory into the temp workspace
Also copy SKILL.md so the agent can read it if needed
This allows the agent to actually execute the scripts described in SKILL.md

Trigger detection (trigger.py)

Add word-boundary matching for skill name in agent text output
Add detection of skill script filenames referenced in agent text output
These complement the existing tool-call detection, covering cases where the agent demonstrates skill awareness through its response text

Testing

6 new unit tests added (1 functional + 5 trigger)
All 651 existing tests continue to pass
No changes to public API or CLI interface

Evidence

Tested with 4 skills in a CI/CD pipeline. Before this fix:

Metric	Score	Reason
Trigger	50%	All positive queries fail (SKILL.md never read)
Functional	28-36%	Scripts not found in workspace

The trigger score of exactly 50% is the signature of this bug — it means the detection mechanism is essentially random (only negative queries pass).

…d workspace hint Three issues that significantly impact functional and trigger scores: 1. Functional eval creates a temp workspace but doesn't copy the skill's scripts/, references/, or assets/ directories into it. When the agent tries to execute skill scripts (e.g., 'python3 scripts/check.py'), the files don't exist, causing script-based assertions to fail. Fix: Copy scripts/, references/, assets/, and SKILL.md from the skill directory into the with-skill workspace. Use separate workspaces for with-skill and without-skill runs to prevent contamination. 2. Trigger detection only recognizes skill activation through Read tool calls targeting SKILL.md. However, ClaudeRunner injects skill content via --append-system-prompt, so the agent never reads SKILL.md from disk. Fix: Add word-boundary matching for skill name in agent text output, and path-level matching (scripts/{filename}) for script references. Bare filenames like 'check.py' no longer trigger false positives. 3. The system prompt injection doesn't tell the agent that skill scripts are available in the working directory, so the agent may not attempt to execute them even when they're present. Fix: Append a workspace hint to the injected system prompt informing the agent that scripts/ is available in the working directory. Also removes 'Skill' from --allowedTools since it refers to Claude Code's ~/.claude/commands/ mechanism which is not used with --append-system-prompt. All 652 tests pass (8 new tests added).

melanie531 force-pushed the fix/workspace-skill-files branch from 8a03bde to 274ee1f Compare March 18, 2026 23:02

melanie531 merged commit b1c01aa into aws-samples:main Mar 18, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: copy skill resources to workspace and improve trigger detection#2

fix: copy skill resources to workspace and improve trigger detection#2
melanie531 merged 1 commit intoaws-samples:mainfrom
melanie531:fix/workspace-skill-files

melanie531 commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

melanie531 commented Mar 18, 2026

Problem

1. Skill scripts/references not available in functional eval workspace

2. Trigger detection fails for all --append-system-prompt injected skills

Fix

Functional workspace (functional.py)

Trigger detection (trigger.py)

Testing

Evidence

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

2. Trigger detection fails for all `--append-system-prompt` injected skills