You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1. Integration test Stages 4-6:
- Stage 4: validateAndDeploy path (move proposal → validate → deploy)
Exercises the EXACT code path where _skillAction scoping bug lived.
Would have caught that critical bug before any manual review.
- Stage 5: Activation tracking (recordActivation, getSkillStats,
listActiveSkills, getSkillMaturity). Verifies the health pipeline
records and reports correctly.
- Stage 6: Evolution trigger path (checkMilestone, executeEvolve,
isShortCircuitCandidate). Verifies the evolution modules connect
to deployed skills without crashing.
Full lifecycle: trace → analyze → propose → deploy → activate → evolve.
2. PHILOSOPHY.md: Mission statement on human-in-the-loop design.
Why nothing auto-deploys. Why validation before deployment.
Why milestone-based evolution. Why format types not channel names.
Why provider agnosticism. Why research grounding.
Written as a standalone document, not a competitor comparison.
3. Multi-agent shared skills: ACEFORGE_SHARED_SKILLS=true deploys
approved skills to ~/.openclaw/skills/ (visible to ALL agents on
the same machine) in addition to the per-workspace copy. Follows
OpenClaw's native skill precedence: workspace > shared > bundled.
4. Composition execution bridge: proposeCompositionSkills() converts
co-activation detections into actual workflow skill proposals via
generateWorkflowSkillWithLLm. Wired into the agent_end Phase 2
cycle. Closes the gap between "these skills activate together"
and "here's a workflow that combines them."
AceForge generates skills. It proposes them. It validates them against 23 attack patterns. It scores them on structural quality and trace coverage. It even has an LLM judge evaluate borderline cases.
6
+
7
+
But it never deploys a skill without your explicit approval.
8
+
9
+
This is not a limitation — it is the core design constraint. Every other decision in AceForge follows from this one.
10
+
11
+
## Why Human-in-the-Loop
12
+
13
+
The [ClawHavoc campaign](https://www.antiy.net/p/clawhavoc-analysis-of-large-scale-poisoning-campaign-targeting-the-openclaw-skill-market-for-ai-agents/) distributed 1,184 malicious skills through ClawHub. Security researchers found that 20% of skills on the registry contained malicious payloads — reverse shells, credential exfiltration, prompt injection. These skills passed basic checks. They looked legitimate. They had reasonable names and descriptions.
14
+
15
+
An auto-deploying system would have installed them.
16
+
17
+
AceForge's position: **the person running the agent is the final authority on what that agent learns.** Skills generated from trace data are proposals, not mandates. The `/forge preview` command exists so you can read what a skill teaches in plain language before deciding. The `/forge quality` command exists so you can see the structural score. The unified diff in `/forge evolve` exists so you can see exactly what changed, line by line.
18
+
19
+
Auto-deployment optimizes for speed. Human approval optimizes for trust. We chose trust.
20
+
21
+
## Why Validation Before Deployment
22
+
23
+
Every skill passes through a security validator before it can be deployed:
24
+
25
+
-**Credential scanning** — API keys, tokens, passwords in skill text
26
+
-**Path traversal** — attempts to read `~/.ssh`, `/etc/shadow`, or escape the workspace
-**Shell history access** — attempts to read `.bash_history` or `.zsh_history`
29
+
-**SOUL.md injection** — attempts to override the agent's identity
30
+
-**23 adversarial mutations** — the test suite generates known-bad skills and verifies they're caught
31
+
32
+
This validation runs on every skill — LLM-generated, manually proposed, or upgraded. If a skill fails security validation, it's blocked and the user is told exactly why. No silent failures, no "warnings" that get ignored.
33
+
34
+
## Why Milestone-Based Evolution, Not Continuous Mutation
35
+
36
+
AceForge distills trace data at activation milestones (500, 2,000, 5,000 uses) rather than continuously mutating skills after every use. This follows [K2-Agent's SRLR loop](https://arxiv.org/abs/2603.00676) and [SAGE's Sequential Rollout](https://arxiv.org/abs/2512.17102).
37
+
38
+
The reasoning: continuous mutation creates unstable skills that change faster than you can evaluate them. Milestone-based distillation gives skills time to accumulate operational wisdom before triggering a revision cycle. When a skill reaches 500 activations, it has enough data for statistically meaningful divergence detection. The revision at that point is informed, not reactive.
39
+
40
+
And even then — the revision is a proposal. It goes through the same human approval gate as every other skill.
41
+
42
+
## Why Format Types, Not Channel Names
43
+
44
+
The notification formatting layer operates on format types (`html`, `markdown`, `mrkdwn`, `plain`), not channel names (`telegram`, `slack`, `discord`). Channel names appear exactly once, in a lookup table called `FORMAT_MAP`.
45
+
46
+
This isn't academic purity. It prevents a real bug: Slack's `*` means bold, Discord's `*` means italic. If you hardcode channel names into formatting functions, adding a new channel means touching every function. With format types, adding a channel is one line in a table.
47
+
48
+
## Why Provider-Agnostic LLM Pipeline
49
+
50
+
AceForge's LLM pipeline supports both OpenAI-compatible (`/chat/completions`) and Anthropic-native (`/v1/messages`) API formats. Format auto-detected from the provider name or openclaw.json `api` field.
51
+
52
+
This matters because vendor lock-in in LLM tooling is a trap. Models improve and change pricing monthly. The generator that works best today might not be the right choice in three months. AceForge should never be the reason you can't switch.
53
+
54
+
13 providers have correct default URLs built in. Adding a new one is one line in `PROVIDER_DEFAULTS`.
55
+
56
+
## Why Research Grounding
57
+
58
+
Every major design decision in AceForge cites a specific paper and explains how the paper's finding informed the implementation. This is not decoration — it's engineering discipline.
59
+
60
+
When [SkillsBench](https://arxiv.org/abs/2602.12670) found that 56% of agent skills are never invoked because their descriptions don't match how users phrase requests, that directly informed AceForge's trigger phrase check in the reviewer prompt and the description optimizer module.
61
+
62
+
When [Single-Agent scaling](https://arxiv.org/abs/2601.04748) found that more skills don't always help and selection quality degrades at scale, that directly informed the escalating crystallization threshold (3→5 at 20+ skills).
63
+
64
+
Research without implementation is theory. Implementation without research is guessing.
`${candidate.skills.join(" + ")} activate together ${Math.round(candidate.coActivationRate*100)}% of sessions (${candidate.sessionsObserved} observed)\n\n`+
0 commit comments