Skip to content

brentmercado/flux-skill-evolution

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Flux — Autonomous Skill Evolution for Claude Code

Flux is an evolutionary loop that discovers, evaluates, and refines domain-specific skills for Claude Code — without a human writing each skill by hand and without a labeled benchmark to grade against.

You give it a domain ("React 19 frontend development", "technical writing for API docs", "Unity 6 AR rendering on Android"). It generates its own test challenges, attempts them, grades the results, finds the failures, proposes a skill to close the gap, builds that skill, and only keeps it if it measurably beats what came before. Repeat.

Status: Working v1. Run end-to-end across 5 domains. It is a research-grade system, not a productized tool — the "engine" is a set of agent specifications executed by Claude Code itself, not a standalone runtime. See Honest Limitations.


The problem

Claude Code "skills" are Markdown instruction files that make the agent better at a specific kind of work. They're powerful, but authoring them has three problems:

  1. They're hand-written. Quality depends entirely on the author remembering every relevant pitfall.
  2. There's no feedback loop. Nothing tells you whether a skill actually makes the agent better, or whether your "improvement" was a regression.
  3. They go stale. A skill written for React 18 silently misleads on React 19.

Flux attacks all three by treating skill authoring as an optimization problem with a closed evaluation loop.

The core idea

The hard part of building an evolutionary loop for skills is that there is no ground truth. There's no benchmark dataset for "React 19 frontend skill quality." Flux's central design bet is:

If you can synthesize a fair rubric, you don't need a benchmark.

So Flux manufactures its own training signal:

  • A Domain Analyst generates synthetic challenges, each with a weighted scoring rubric, split into train/validation sets.
  • An LLM-as-judge Evaluator scores solutions against those rubrics — and is explicitly designed to be skeptical of itself (it penalizes solutions that look written-for-the-grader, and penalizes overclaiming).

Everything else is a fairly standard generate → evaluate → analyze → rebuild loop wrapped around that signal.

How it works

                 ┌─────────────────────────────────────────────┐
                 │              /flux evolve "domain"            │
                 └─────────────────────┬───────────────────────┘
                                       │
                        ┌──────────────▼──────────────┐
                        │      Domain Analyst          │  once per domain
                        │  challenges.json + rubrics   │
                        │  (70% train / 30% validation)│
                        └──────────────┬───────────────┘
                                       │
            ┌──────────────────────────▼──────────────────────────┐
            │                  EVOLUTION LOOP (per iteration)       │
            │                                                       │
            │   Executor ─► attempts challenges with current skills │
            │       │                                               │
            │   Evaluator ─► scores vs. rubric, finds failures      │
            │       │                                               │
            │   Proposer ─► clusters failures → ONE proposal        │
            │       │                                               │
            │   Builder  ─► materializes candidate SKILL.md         │
            │       │         (in staging/, not yet live)           │
            │       │                                               │
            │   Validate ─► run candidate on HELD-OUT challenges    │
            │       │                                               │
            │   new_score > frontier_score ?                        │
            │       ├── yes ─► PROMOTE  (skill goes live)           │
            │       └── no  ─► REJECT   (archived with eval data)   │
            └───────────────────────────────────────────────────────┘

Each iteration either advances the frontier or is discarded — the frontier only moves on a strict improvement against challenges the candidate was never trained on.

The five agents

Flux is built from five single-purpose subagents. Four are restricted to the minimum tools they need (least privilege, declared in each file's tools: frontmatter). The executor is deliberately left unrestricted — it has to attempt arbitrary domain challenges and can't know in advance which tools a given solution will require.

Agent Job Tools
flux-domain-analyst Research the domain; generate challenges + rubrics Read, Write, Bash, Glob, Grep, WebSearch, WebFetch
flux-executor Attempt challenges as a practitioner; record honest traces unrestricted (inherits all) — intentional, see above
flux-evaluator Score solutions against rubrics (LLM-as-judge) Read, Write, Bash, Glob, Grep
flux-proposer Analyze failures → one high-impact proposal Read, Write, Glob, Grep (read-only on the world)
flux-builder Materialize the proposal into a SKILL.md Read, Write, Edit, Bash, Glob, Grep, WebSearch, WebFetch

The /flux orchestrator is the state machine that sequences them, manages per-domain state, and makes the promote/reject decision.

Dual-stream knowledge: skills + experiences

Flux carries two kinds of learned knowledge, borrowed from the XSkill paper:

  • Skills — task-level Markdown procedures (SKILL.md). What the agent loads and follows.
  • Experiences — action-level insights ({condition, action, rationale}) accumulated in a per-domain experience bank.

The deliberate twist: experiences are not retrieved at inference time as a separate stream. The Builder reads the experience bank and rewrites relevant experiences directly into the skill prose as workflow tips, pitfalls, and error-handling notes. At runtime there's only the skill — but the skill is materially shaped by accumulated experience. This keeps runtime cost at zero, at the price of dynamic cross-skill recombination (a deliberate v1 simplification).

Results

Run across 5 domains spanning code and non-code work:

Domain Iterations Challenges Skills promoted Final score
React 19 (core) 10 20 2 0.934
Next.js App Router 10 18 2 0.886
Technical writing (API docs) 5 20 1 0.926
Unity 6 AR combat 5 18 1 0.881
Unity 6 AR / URP (Android) 5 22 2 0.877
Total 35 98 8

Plateau behavior was observable and correct: the React 19 core domain stopped improving at 0.934 and the loop rejected the next two proposals rather than promoting churn.

Read this honestly: these scores come from a Claude-family LLM judge, not an independent benchmark. They are a meaningful relative signal across iterations of the same domain (did this candidate beat the last frontier on held-out challenges?), not an absolute capability measurement. There is no external-benchmark or no-skills control comparison yet. See Honest Limitations.

Design decisions & tradeoffs

Decision Why Tradeoff accepted
No ground truth — synthesize rubrics No benchmark exists for arbitrary domains Evaluation quality depends on rubric quality
LLM-as-judge with anti-gaming penalties The naive judge rewards test-prep answers within ~2 iterations Judge isn't independently calibrated against humans
Strict frontier (promote only on improvement) Prevents quality churn / regression Greedy; can get stuck in local optima (frontier size = 1 in v1)
Held-out validation set Stops the loop from overfitting to training challenges Fewer challenges per role
Least-privilege agents (4 of 5) Smaller blast radius; clearer contracts The executor stays unrestricted by necessity
Spec, not a runtime Installs in seconds; lives natively in Claude Code Every iteration is interactive and token-heavy; no unattended CI runs

Install & usage

git clone https://github.com/brentmercado/flux-skill-evolution.git
cd flux-skill-evolution
bash install.sh          # copies the skill + agents into ~/.claude/

Then, inside Claude Code:

/flux evolve "React 19 frontend development" --iterations 5
/flux evolve "Python data pipelines" --threshold 0.85
/flux domains            # list everything you've evolved
/flux inspect react-19-frontend

Evolved skills land under ~/.claude/skills/{domain}/ and become available to Claude Code globally.

Research lineage

Flux loosely synthesizes ideas from two papers — it does not reimplement either faithfully or reproduce their experiments. See docs/CITATIONS.md for full references.

  • EvoSkill — the executor → proposer → builder loop, a greedy frontier, and a persistent proposal history to avoid retreading rejected ideas.
  • XSkill — the dual-stream (skills + experiences) knowledge model.

What's original to Flux: operating with no benchmark (the Domain Analyst + LLM-as-judge replace the dataset), the anti-gaming evaluator, and baking experiences into skill prose at build time rather than retrieving them at inference time.

Honest limitations

  • Scores are judge-relative, not benchmarked. No external benchmark, no no-skills control run.
  • Auto-promotion to the global skills directory is not fully wired — in practice promoted skills are reviewed in staging before going live.
  • Greedy frontier (size 1). No multi-program search, no parallel rollouts.
  • Experience recombination is static — merged at build time, not retrieved dynamically.
  • Version lifecycle is schema-only. Skills carry version metadata (domain_snapshot, valid_when, superseded_by) but nothing yet detects staleness or auto-supersedes.
  • It's a spec executed by Claude Code, not a standalone program. A thin SDK-driven runtime is the obvious v2.

Roadmap

  • Multi-program frontier (keep N candidates, not 1)
  • Parallel rollouts per iteration
  • Embedding-based experience retrieval at inference time
  • Automatic staleness detection from version metadata
  • A real runtime (Claude Agent SDK) for unattended / CI runs
  • A feedback command to fold real-usage signals back into evolution

License

MIT — see LICENSE.

About

Autonomous skill evolution engine for Claude Code — an evolutionary loop that discovers, evaluates, and refines domain-specific skills with no ground-truth benchmark.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages