Design, run & learn from hiring case studies — a markdown-native toolkit shipped as a Claude Code plugin. Give it a role, what you want to learn about candidates, and a real challenge from your team; get back a fair, candidate-ready case pack with an evidence-based scoring rubric — and a learning loop that makes the next case better.
| Agent | What it does |
|---|---|
designer |
Scaffolds a brief.md, then turns it into a sanitized candidate case pack (case.md) + a frozen BARS scoring rubric (rubric.md), grounded in the assessment-practices library and your past lessons |
evaluator |
Turns an interviewer's raw notes into an evidence-first scorecard.md — quotes first, then scores, "insufficient evidence" where notes are silent. Also preps calibration across scorecards |
debrief |
After the round: rates how well the case did its job (signal vs. prediction, discrimination, anchor quality) and extracts reusable lessons into lessons-learned/ |
| Skill | What it holds |
|---|---|
assessment-practices |
The extensible library: 8 case formats (work sample, take-home, live problem-solving, role-play, presentation & defense, reverse critique, data interpretation, inbox exercise), 4 scoring methods (BARS, evidence-first scorecard, independent-then-calibrate, signal mapping), 3 principles (structured-over-unstructured, bias interrupters, candidate experience). Add a practice by dropping a markdown file in practices/ |
brief.md ──designer──▶ case.md + rubric.md ──run the sessions──▶ notes
│
lessons-learned/ ◀──debrief── debrief.md ◀──evaluator── scorecard-*.md
└────────────────── feeds the next design ──────────────────┘
# develop / try locally
claude --plugin-dir .
# then, in Claude Code:
# "New case study for a Senior Product Manager" → designer scaffolds the brief
# "Design the case from case-studies/<slug>/brief.md" → case pack + rubric
# "Score this session: <paste notes>" → evaluator
# "Debrief the round for <slug>: <brain-dump>" → debrief + lessonsUser content (case-studies/, lessons-learned/) lives in your workspace, not the plugin.
- Rubrics contain only job-related competencies (max 5), each with behaviorally anchored levels.
- Same case, same scripted probes, same rubric for every candidate in a round.
- Evidence before scores; no evidence → null, never a guessed midpoint.
- Independent scoring before any discussion; dissent recorded, not averaged.
- The agents never infer or use protected characteristics — and flag notes that do.
- Scores support a human decision. They never make it.
MIT