A benchmarking tool for evaluating OpenCode skill discoverability and workflow compliance.
- Skill Discovery Testing: Test if LLMs correctly discover and use skills
- Multi-Mode Evaluation: Test explicit (command-triggered) and implicit (autonomous) skill usage
- Workflow Compliance Scoring: Measure how well agents follow skill instructions
- Baseline Comparison: Track improvements and detect regressions
- Preset System: Reusable scoring configurations for different skill types
bun add -D skill-discovery-bench# Run evaluation
skill-discovery-bench -s myskill -p ./my-plugin
# Run both modes with multiple runs
skill-discovery-bench -s myskill -p ./my-plugin --mode both --runs 3
# Set baseline
skill-discovery-bench -s myskill -p ./my-plugin --set-baselineRequired:
-s, --skill <name> Skill name to evaluate
-p, --plugin <path> Path to plugin directory
Options:
-m, --mode <mode> Test mode: explicit, implicit, or both (default: explicit)
-r, --runs <n> Number of test runs (default: 1)
--model <model> Model to use (default: pre-configured)
--preset <name> Scoring preset to use (default: default)
--presets List available presets
-v, --verbose Show detailed output
-b, --set-baseline Save results as new baseline
--show-results Show results from latest run
-h, --help Show help
import { runTests, getPreset } from "skill-discovery-bench";
const preset = getPreset("intellisearch");
const { metrics, resultsDir } = await runTests({
config: {
mode: "explicit",
runs: 3,
skill: {
name: "myskill",
pluginPath: "/path/to/plugin",
},
queryFile: "test-queries/search.md",
projectDir: process.cwd(),
},
presetPatterns: preset?.solutionExtraction?.patterns,
});
console.log(`Workflow Score: ${metrics.workflowScore}`);
console.log(`Skill Loaded: ${metrics.skillLoaded}`);| Preset | Description |
|---|---|
default |
Basic scoring for any skill |
intellisearch |
Scoring for library/repository discovery skills |
import type { SkillPreset } from "skill-discovery-bench";
const myPreset: SkillPreset = {
name: "my-skill",
description: "Custom scoring rules",
scoring: {
weights: {
skillLoaded: 0.50,
toolsUsed: [
{ pattern: "my_tool", weight: 0.25 },
],
},
violations: [
{ rule: "must_use_my_tool", impact: -0.20 },
],
thresholds: {
minWorkflowScore: 0.60,
maxTokenIncrease: 5000,
minSolutionsFound: 1,
scoreTolerance: 0.1,
},
},
};Results are saved to results/{mode}-{timestamp}/ with:
run-metrics.json- Per-run metricstoken-metrics.json- Token usage analysisconsistency-report.json- Jaccard similarity, variance, violations
- Promptfoo
tool-call-f1: F1 scoring for tool selection accuracy - ToolBench Pass Rate: Completion within N API calls
- Frontmatter Scorer: Score skill metadata quality
- Multi-model Comparison: Run same query across multiple models
MIT