Flux — Autonomous Skill Evolution for Claude Code

Flux is an evolutionary loop that discovers, evaluates, and refines domain-specific skills for Claude Code — without a human writing each skill by hand and without a labeled benchmark to grade against.

You give it a domain ("React 19 frontend development", "technical writing for API docs", "Unity 6 AR rendering on Android"). It generates its own test challenges, attempts them, grades the results, finds the failures, proposes a skill to close the gap, builds that skill, and only keeps it if it measurably beats what came before. Repeat.

Status: Working v1. Run end-to-end across 5 domains. It is a research-grade system, not a productized tool — the "engine" is a set of agent specifications executed by Claude Code itself, not a standalone runtime. See Honest Limitations.

The problem

Claude Code "skills" are Markdown instruction files that make the agent better at a specific kind of work. They're powerful, but authoring them has three problems:

They're hand-written. Quality depends entirely on the author remembering every relevant pitfall.
There's no feedback loop. Nothing tells you whether a skill actually makes the agent better, or whether your "improvement" was a regression.
They go stale. A skill written for React 18 silently misleads on React 19.

Flux attacks all three by treating skill authoring as an optimization problem with a closed evaluation loop.

The core idea

The hard part of building an evolutionary loop for skills is that there is no ground truth. There's no benchmark dataset for "React 19 frontend skill quality." Flux's central design bet is:

If you can synthesize a fair rubric, you don't need a benchmark.

So Flux manufactures its own training signal:

A Domain Analyst generates synthetic challenges, each with a weighted scoring rubric, split into train/validation sets.
An LLM-as-judge Evaluator scores solutions against those rubrics — and is explicitly designed to be skeptical of itself (it penalizes solutions that look written-for-the-grader, and penalizes overclaiming).

Everything else is a fairly standard generate → evaluate → analyze → rebuild loop wrapped around that signal.

How it works

                 ┌─────────────────────────────────────────────┐
                 │              /flux evolve "domain"            │
                 └─────────────────────┬───────────────────────┘
                                       │
                        ┌──────────────▼──────────────┐
                        │      Domain Analyst          │  once per domain
                        │  challenges.json + rubrics   │
                        │  (70% train / 30% validation)│
                        └──────────────┬───────────────┘
                                       │
            ┌──────────────────────────▼──────────────────────────┐
            │                  EVOLUTION LOOP (per iteration)       │
            │                                                       │
            │   Executor ─► attempts challenges with current skills │
            │       │                                               │
            │   Evaluator ─► scores vs. rubric, finds failures      │
            │       │                                               │
            │   Proposer ─► clusters failures → ONE proposal        │
            │       │                                               │
            │   Builder  ─► materializes candidate SKILL.md         │
            │       │         (in staging/, not yet live)           │
            │       │                                               │
            │   Validate ─► run candidate on HELD-OUT challenges    │
            │       │                                               │
            │   new_score > frontier_score ?                        │
            │       ├── yes ─► PROMOTE  (skill goes live)           │
            │       └── no  ─► REJECT   (archived with eval data)   │
            └───────────────────────────────────────────────────────┘

Each iteration either advances the frontier or is discarded — the frontier only moves on a strict improvement against challenges the candidate was never trained on.

The five agents

Flux is built from five single-purpose subagents. Four are restricted to the minimum tools they need (least privilege, declared in each file's tools: frontmatter). The executor is deliberately left unrestricted — it has to attempt arbitrary domain challenges and can't know in advance which tools a given solution will require.

Agent	Job	Tools
`flux-domain-analyst`	Research the domain; generate challenges + rubrics	Read, Write, Bash, Glob, Grep, WebSearch, WebFetch
`flux-executor`	Attempt challenges as a practitioner; record honest traces	unrestricted (inherits all) — intentional, see above
`flux-evaluator`	Score solutions against rubrics (LLM-as-judge)	Read, Write, Bash, Glob, Grep
`flux-proposer`	Analyze failures → one high-impact proposal	Read, Write, Glob, Grep (read-only on the world)
`flux-builder`	Materialize the proposal into a SKILL.md	Read, Write, Edit, Bash, Glob, Grep, WebSearch, WebFetch

The /flux orchestrator is the state machine that sequences them, manages per-domain state, and makes the promote/reject decision.

Dual-stream knowledge: skills + experiences

Flux carries two kinds of learned knowledge, borrowed from the XSkill paper:

Skills — task-level Markdown procedures (SKILL.md). What the agent loads and follows.
Experiences — action-level insights ({condition, action, rationale}) accumulated in a per-domain experience bank.

The deliberate twist: experiences are not retrieved at inference time as a separate stream. The Builder reads the experience bank and rewrites relevant experiences directly into the skill prose as workflow tips, pitfalls, and error-handling notes. At runtime there's only the skill — but the skill is materially shaped by accumulated experience. This keeps runtime cost at zero, at the price of dynamic cross-skill recombination (a deliberate v1 simplification).

Results

Run across 5 domains spanning code and non-code work:

Domain	Iterations	Challenges	Skills promoted	Final score
React 19 (core)	10	20	2	0.934
Next.js App Router	10	18	2	0.886
Technical writing (API docs)	5	20	1	0.926
Unity 6 AR combat	5	18	1	0.881
Unity 6 AR / URP (Android)	5	22	2	0.877
Total	35	98	8	—

Plateau behavior was observable and correct: the React 19 core domain stopped improving at 0.934 and the loop rejected the next two proposals rather than promoting churn.

Read this honestly: these scores come from a Claude-family LLM judge, not an independent benchmark. They are a meaningful relative signal across iterations of the same domain (did this candidate beat the last frontier on held-out challenges?), not an absolute capability measurement. There is no external-benchmark or no-skills control comparison yet. See Honest Limitations.

Design decisions & tradeoffs

Decision	Why	Tradeoff accepted
No ground truth — synthesize rubrics	No benchmark exists for arbitrary domains	Evaluation quality depends on rubric quality
LLM-as-judge with anti-gaming penalties	The naive judge rewards test-prep answers within ~2 iterations	Judge isn't independently calibrated against humans
Strict frontier (promote only on improvement)	Prevents quality churn / regression	Greedy; can get stuck in local optima (frontier size = 1 in v1)
Held-out validation set	Stops the loop from overfitting to training challenges	Fewer challenges per role
Least-privilege agents (4 of 5)	Smaller blast radius; clearer contracts	The executor stays unrestricted by necessity
Spec, not a runtime	Installs in seconds; lives natively in Claude Code	Every iteration is interactive and token-heavy; no unattended CI runs

Install & usage

git clone https://github.com/brentmercado/flux-skill-evolution.git
cd flux-skill-evolution
bash install.sh          # copies the skill + agents into ~/.claude/

Then, inside Claude Code:

/flux evolve "React 19 frontend development" --iterations 5
/flux evolve "Python data pipelines" --threshold 0.85
/flux domains            # list everything you've evolved
/flux inspect react-19-frontend

Evolved skills land under ~/.claude/skills/{domain}/ and become available to Claude Code globally.

Research lineage

Flux loosely synthesizes ideas from two papers — it does not reimplement either faithfully or reproduce their experiments. See docs/CITATIONS.md for full references.

EvoSkill — the executor → proposer → builder loop, a greedy frontier, and a persistent proposal history to avoid retreading rejected ideas.
XSkill — the dual-stream (skills + experiences) knowledge model.

What's original to Flux: operating with no benchmark (the Domain Analyst + LLM-as-judge replace the dataset), the anti-gaming evaluator, and baking experiences into skill prose at build time rather than retrieving them at inference time.

Honest limitations

Scores are judge-relative, not benchmarked. No external benchmark, no no-skills control run.
Auto-promotion to the global skills directory is not fully wired — in practice promoted skills are reviewed in staging before going live.
Greedy frontier (size 1). No multi-program search, no parallel rollouts.
Experience recombination is static — merged at build time, not retrieved dynamically.
Version lifecycle is schema-only. Skills carry version metadata (domain_snapshot, valid_when, superseded_by) but nothing yet detects staleness or auto-supersedes.
It's a spec executed by Claude Code, not a standalone program. A thin SDK-driven runtime is the obvious v2.

Roadmap

Multi-program frontier (keep N candidates, not 1)
Parallel rollouts per iteration
Embedding-based experience retrieval at inference time
Automatic staleness detection from version metadata
A real runtime (Claude Agent SDK) for unattended / CI runs
A feedback command to fold real-usage signals back into evolution

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.claude		.claude
docs		docs
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
install.sh		install.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Flux — Autonomous Skill Evolution for Claude Code

The problem

The core idea

How it works

The five agents

Dual-stream knowledge: skills + experiences

Results

Design decisions & tradeoffs

Install & usage

Research lineage

Honest limitations

Roadmap

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Flux — Autonomous Skill Evolution for Claude Code

The problem

The core idea

How it works

The five agents

Dual-stream knowledge: skills + experiences

Results

Design decisions & tradeoffs

Install & usage

Research lineage

Honest limitations

Roadmap

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages