Flux is an evolutionary loop that discovers, evaluates, and refines domain-specific skills for Claude Code — without a human writing each skill by hand and without a labeled benchmark to grade against.
You give it a domain ("React 19 frontend development", "technical writing for API docs", "Unity 6 AR rendering on Android"). It generates its own test challenges, attempts them, grades the results, finds the failures, proposes a skill to close the gap, builds that skill, and only keeps it if it measurably beats what came before. Repeat.
Status: Working v1. Run end-to-end across 5 domains. It is a research-grade system, not a productized tool — the "engine" is a set of agent specifications executed by Claude Code itself, not a standalone runtime. See Honest Limitations.
Claude Code "skills" are Markdown instruction files that make the agent better at a specific kind of work. They're powerful, but authoring them has three problems:
- They're hand-written. Quality depends entirely on the author remembering every relevant pitfall.
- There's no feedback loop. Nothing tells you whether a skill actually makes the agent better, or whether your "improvement" was a regression.
- They go stale. A skill written for React 18 silently misleads on React 19.
Flux attacks all three by treating skill authoring as an optimization problem with a closed evaluation loop.
The hard part of building an evolutionary loop for skills is that there is no ground truth. There's no benchmark dataset for "React 19 frontend skill quality." Flux's central design bet is:
If you can synthesize a fair rubric, you don't need a benchmark.
So Flux manufactures its own training signal:
- A Domain Analyst generates synthetic challenges, each with a weighted scoring rubric, split into train/validation sets.
- An LLM-as-judge Evaluator scores solutions against those rubrics — and is explicitly designed to be skeptical of itself (it penalizes solutions that look written-for-the-grader, and penalizes overclaiming).
Everything else is a fairly standard generate → evaluate → analyze → rebuild loop wrapped around that signal.
┌─────────────────────────────────────────────┐
│ /flux evolve "domain" │
└─────────────────────┬───────────────────────┘
│
┌──────────────▼──────────────┐
│ Domain Analyst │ once per domain
│ challenges.json + rubrics │
│ (70% train / 30% validation)│
└──────────────┬───────────────┘
│
┌──────────────────────────▼──────────────────────────┐
│ EVOLUTION LOOP (per iteration) │
│ │
│ Executor ─► attempts challenges with current skills │
│ │ │
│ Evaluator ─► scores vs. rubric, finds failures │
│ │ │
│ Proposer ─► clusters failures → ONE proposal │
│ │ │
│ Builder ─► materializes candidate SKILL.md │
│ │ (in staging/, not yet live) │
│ │ │
│ Validate ─► run candidate on HELD-OUT challenges │
│ │ │
│ new_score > frontier_score ? │
│ ├── yes ─► PROMOTE (skill goes live) │
│ └── no ─► REJECT (archived with eval data) │
└───────────────────────────────────────────────────────┘
Each iteration either advances the frontier or is discarded — the frontier only moves on a strict improvement against challenges the candidate was never trained on.
Flux is built from five single-purpose subagents. Four are restricted to the minimum tools they need (least privilege, declared in each file's tools: frontmatter). The executor is deliberately left unrestricted — it has to attempt arbitrary domain challenges and can't know in advance which tools a given solution will require.
| Agent | Job | Tools |
|---|---|---|
flux-domain-analyst |
Research the domain; generate challenges + rubrics | Read, Write, Bash, Glob, Grep, WebSearch, WebFetch |
flux-executor |
Attempt challenges as a practitioner; record honest traces | unrestricted (inherits all) — intentional, see above |
flux-evaluator |
Score solutions against rubrics (LLM-as-judge) | Read, Write, Bash, Glob, Grep |
flux-proposer |
Analyze failures → one high-impact proposal | Read, Write, Glob, Grep (read-only on the world) |
flux-builder |
Materialize the proposal into a SKILL.md | Read, Write, Edit, Bash, Glob, Grep, WebSearch, WebFetch |
The /flux orchestrator is the state machine that sequences them, manages per-domain state, and makes the promote/reject decision.
Flux carries two kinds of learned knowledge, borrowed from the XSkill paper:
- Skills — task-level Markdown procedures (
SKILL.md). What the agent loads and follows. - Experiences — action-level insights (
{condition, action, rationale}) accumulated in a per-domain experience bank.
The deliberate twist: experiences are not retrieved at inference time as a separate stream. The Builder reads the experience bank and rewrites relevant experiences directly into the skill prose as workflow tips, pitfalls, and error-handling notes. At runtime there's only the skill — but the skill is materially shaped by accumulated experience. This keeps runtime cost at zero, at the price of dynamic cross-skill recombination (a deliberate v1 simplification).
Run across 5 domains spanning code and non-code work:
| Domain | Iterations | Challenges | Skills promoted | Final score |
|---|---|---|---|---|
| React 19 (core) | 10 | 20 | 2 | 0.934 |
| Next.js App Router | 10 | 18 | 2 | 0.886 |
| Technical writing (API docs) | 5 | 20 | 1 | 0.926 |
| Unity 6 AR combat | 5 | 18 | 1 | 0.881 |
| Unity 6 AR / URP (Android) | 5 | 22 | 2 | 0.877 |
| Total | 35 | 98 | 8 | — |
Plateau behavior was observable and correct: the React 19 core domain stopped improving at 0.934 and the loop rejected the next two proposals rather than promoting churn.
Read this honestly: these scores come from a Claude-family LLM judge, not an independent benchmark. They are a meaningful relative signal across iterations of the same domain (did this candidate beat the last frontier on held-out challenges?), not an absolute capability measurement. There is no external-benchmark or no-skills control comparison yet. See Honest Limitations.
| Decision | Why | Tradeoff accepted |
|---|---|---|
| No ground truth — synthesize rubrics | No benchmark exists for arbitrary domains | Evaluation quality depends on rubric quality |
| LLM-as-judge with anti-gaming penalties | The naive judge rewards test-prep answers within ~2 iterations | Judge isn't independently calibrated against humans |
| Strict frontier (promote only on improvement) | Prevents quality churn / regression | Greedy; can get stuck in local optima (frontier size = 1 in v1) |
| Held-out validation set | Stops the loop from overfitting to training challenges | Fewer challenges per role |
| Least-privilege agents (4 of 5) | Smaller blast radius; clearer contracts | The executor stays unrestricted by necessity |
| Spec, not a runtime | Installs in seconds; lives natively in Claude Code | Every iteration is interactive and token-heavy; no unattended CI runs |
git clone https://github.com/brentmercado/flux-skill-evolution.git
cd flux-skill-evolution
bash install.sh # copies the skill + agents into ~/.claude/Then, inside Claude Code:
/flux evolve "React 19 frontend development" --iterations 5
/flux evolve "Python data pipelines" --threshold 0.85
/flux domains # list everything you've evolved
/flux inspect react-19-frontend
Evolved skills land under ~/.claude/skills/{domain}/ and become available to Claude Code globally.
Flux loosely synthesizes ideas from two papers — it does not reimplement either faithfully or reproduce their experiments. See docs/CITATIONS.md for full references.
- EvoSkill — the executor → proposer → builder loop, a greedy frontier, and a persistent proposal history to avoid retreading rejected ideas.
- XSkill — the dual-stream (skills + experiences) knowledge model.
What's original to Flux: operating with no benchmark (the Domain Analyst + LLM-as-judge replace the dataset), the anti-gaming evaluator, and baking experiences into skill prose at build time rather than retrieving them at inference time.
- Scores are judge-relative, not benchmarked. No external benchmark, no no-skills control run.
- Auto-promotion to the global skills directory is not fully wired — in practice promoted skills are reviewed in staging before going live.
- Greedy frontier (size 1). No multi-program search, no parallel rollouts.
- Experience recombination is static — merged at build time, not retrieved dynamically.
- Version lifecycle is schema-only. Skills carry version metadata (
domain_snapshot,valid_when,superseded_by) but nothing yet detects staleness or auto-supersedes. - It's a spec executed by Claude Code, not a standalone program. A thin SDK-driven runtime is the obvious v2.
- Multi-program frontier (keep N candidates, not 1)
- Parallel rollouts per iteration
- Embedding-based experience retrieval at inference time
- Automatic staleness detection from version metadata
- A real runtime (Claude Agent SDK) for unattended / CI runs
- A
feedbackcommand to fold real-usage signals back into evolution
MIT — see LICENSE.