Skip to content

Expert-Vision-Software/skill-discovery-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Skill Discovery Bench

A benchmarking tool for evaluating OpenCode skill discoverability and workflow compliance.

Features

  • Skill Discovery Testing: Test if LLMs correctly discover and use skills
  • Multi-Mode Evaluation: Test explicit (command-triggered) and implicit (autonomous) skill usage
  • Workflow Compliance Scoring: Measure how well agents follow skill instructions
  • Baseline Comparison: Track improvements and detect regressions
  • Preset System: Reusable scoring configurations for different skill types

Installation

bun add -D skill-discovery-bench

Quick Start

# Run evaluation
skill-discovery-bench -s myskill -p ./my-plugin

# Run both modes with multiple runs
skill-discovery-bench -s myskill -p ./my-plugin --mode both --runs 3

# Set baseline
skill-discovery-bench -s myskill -p ./my-plugin --set-baseline

CLI Options

Required:
  -s, --skill <name>     Skill name to evaluate
  -p, --plugin <path>    Path to plugin directory

Options:
  -m, --mode <mode>      Test mode: explicit, implicit, or both (default: explicit)
  -r, --runs <n>         Number of test runs (default: 1)
  --model <model>        Model to use (default: pre-configured)
  --preset <name>        Scoring preset to use (default: default)
  --presets              List available presets
  -v, --verbose          Show detailed output
  -b, --set-baseline     Save results as new baseline
  --show-results         Show results from latest run
  -h, --help             Show help

Programmatic API

import { runTests, getPreset } from "skill-discovery-bench";

const preset = getPreset("intellisearch");

const { metrics, resultsDir } = await runTests({
  config: {
    mode: "explicit",
    runs: 3,
    skill: {
      name: "myskill",
      pluginPath: "/path/to/plugin",
    },
    queryFile: "test-queries/search.md",
    projectDir: process.cwd(),
  },
  presetPatterns: preset?.solutionExtraction?.patterns,
});

console.log(`Workflow Score: ${metrics.workflowScore}`);
console.log(`Skill Loaded: ${metrics.skillLoaded}`);

Presets

Built-in Presets

Preset Description
default Basic scoring for any skill
intellisearch Scoring for library/repository discovery skills

Creating Custom Presets

import type { SkillPreset } from "skill-discovery-bench";

const myPreset: SkillPreset = {
  name: "my-skill",
  description: "Custom scoring rules",
  scoring: {
    weights: {
      skillLoaded: 0.50,
      toolsUsed: [
        { pattern: "my_tool", weight: 0.25 },
      ],
    },
    violations: [
      { rule: "must_use_my_tool", impact: -0.20 },
    ],
    thresholds: {
      minWorkflowScore: 0.60,
      maxTokenIncrease: 5000,
      minSolutionsFound: 1,
      scoreTolerance: 0.1,
    },
  },
};

Output

Results are saved to results/{mode}-{timestamp}/ with:

  • run-metrics.json - Per-run metrics
  • token-metrics.json - Token usage analysis
  • consistency-report.json - Jaccard similarity, variance, violations

Future Integrations

  • Promptfoo tool-call-f1: F1 scoring for tool selection accuracy
  • ToolBench Pass Rate: Completion within N API calls
  • Frontmatter Scorer: Score skill metadata quality
  • Multi-model Comparison: Run same query across multiple models

License

MIT

About

A benchmarking tool for evaluating OpenCode skill discoverability and workflow compliance.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors