fix: copy skill resources to workspace and improve trigger detection#2
Merged
melanie531 merged 1 commit intoaws-samples:mainfrom Mar 18, 2026
Merged
Conversation
…d workspace hint
Three issues that significantly impact functional and trigger scores:
1. Functional eval creates a temp workspace but doesn't copy the skill's
scripts/, references/, or assets/ directories into it. When the agent
tries to execute skill scripts (e.g., 'python3 scripts/check.py'),
the files don't exist, causing script-based assertions to fail.
Fix: Copy scripts/, references/, assets/, and SKILL.md from the skill
directory into the with-skill workspace. Use separate workspaces for
with-skill and without-skill runs to prevent contamination.
2. Trigger detection only recognizes skill activation through Read tool
calls targeting SKILL.md. However, ClaudeRunner injects skill content
via --append-system-prompt, so the agent never reads SKILL.md from disk.
Fix: Add word-boundary matching for skill name in agent text output,
and path-level matching (scripts/{filename}) for script references.
Bare filenames like 'check.py' no longer trigger false positives.
3. The system prompt injection doesn't tell the agent that skill scripts
are available in the working directory, so the agent may not attempt
to execute them even when they're present.
Fix: Append a workspace hint to the injected system prompt informing
the agent that scripts/ is available in the working directory.
Also removes 'Skill' from --allowedTools since it refers to Claude Code's
~/.claude/commands/ mechanism which is not used with --append-system-prompt.
All 652 tests pass (8 new tests added).
8a03bde to
274ee1f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Two issues that significantly impact functional and trigger evaluation scores:
1. Skill scripts/references not available in functional eval workspace
_execute_eval_pair()infunctional.pycreates a temporary workspace directory for each eval run. However, only the eval casefilesare copied into this workspace — the skill's ownscripts/,references/, andassets/directories are not copied.When SKILL.md instructs the agent to run a script (e.g.,
python3 scripts/check.py), the file does not exist in the workspace. This causes:2. Trigger detection fails for all
--append-system-promptinjected skillsClaudeRunnerinjects skill content via--append-system-prompt, meaning the agent receives the skill instructions in its system prompt. However,_detect_skill_trigger_from_parsed()primarily looks for Read tool calls targeting SKILL.md to detect skill activation.Since the agent already has the skill content in its system prompt, it never needs to read SKILL.md from disk. This results in:
Fix
Functional workspace (functional.py)
scripts/,references/, andassets/from the skill directory into the temp workspaceSKILL.mdso the agent can read it if neededTrigger detection (trigger.py)
Testing
Evidence
Tested with 4 skills in a CI/CD pipeline. Before this fix:
The trigger score of exactly 50% is the signature of this bug — it means the detection mechanism is essentially random (only negative queries pass).