feat: Autopilot Supervisor v2 — full implementation (M0–M7 + DoD)#36
feat: Autopilot Supervisor v2 — full implementation (M0–M7 + DoD)#36chinkan wants to merge 59 commits into
Conversation
Co-authored-by: chinkan.ai <chinkan@gmail.com>
…-M7 + DoD) Co-authored-by: chinkan.ai <chinkan@gmail.com>
|
Claude finished @chinkan's task in 3m 8s —— View job PR #36 Code Review — Autopilot Supervisor v2 (docs)
|
Co-authored-by: chinkan.ai <chinkan@gmail.com>
Co-authored-by: chinkan.ai <chinkan@gmail.com>
Co-authored-by: chinkan.ai <chinkan@gmail.com>
Co-authored-by: chinkan.ai <chinkan@gmail.com>
Co-authored-by: chinkan.ai <chinkan@gmail.com>
Co-authored-by: chinkan.ai <chinkan@gmail.com>
Co-authored-by: chinkan.ai <chinkan@gmail.com>
Co-authored-by: chinkan.ai <chinkan@gmail.com>
Co-authored-by: chinkan.ai <chinkan@gmail.com>
Co-authored-by: chinkan.ai <chinkan@gmail.com>
…th deferred to M3) Co-authored-by: chinkan.ai <chinkan@gmail.com>
Co-authored-by: chinkan.ai <chinkan@gmail.com>
Co-authored-by: chinkan.ai <chinkan@gmail.com>
…→artifacts) Co-authored-by: chinkan.ai <chinkan@gmail.com>
… decode (review) Co-authored-by: chinkan.ai <chinkan@gmail.com>
Co-authored-by: chinkan.ai <chinkan@gmail.com>
…orts) Co-authored-by: chinkan.ai <chinkan@gmail.com>
Co-authored-by: chinkan.ai <chinkan@gmail.com>
Co-authored-by: chinkan.ai <chinkan@gmail.com>
Co-authored-by: chinkan.ai <chinkan@gmail.com>
Co-authored-by: chinkan.ai <chinkan@gmail.com>
Co-authored-by: chinkan.ai <chinkan@gmail.com>
Co-authored-by: chinkan.ai <chinkan@gmail.com>
… (review) Co-authored-by: chinkan.ai <chinkan@gmail.com>
…review) Co-authored-by: chinkan.ai <chinkan@gmail.com>
Co-authored-by: chinkan.ai <chinkan@gmail.com>
…review) Co-authored-by: chinkan.ai <chinkan@gmail.com>
Co-authored-by: chinkan.ai <chinkan@gmail.com>
Adds Supervisor::workspace_mgr (Option) and a new_for_test_with_repo constructor. In execute_now, after the Plan artifact is written, branches on TaskType::CodeChange|BugFix|Refactor and (when a WorkspaceManager is configured) records a Plan->PrepareWorkspace transition, calls WorkspaceManager::prepare, and writes a workspace artifact before transitioning to Execute. Also persists classification (task_type/risk_level/execution_mode) in TaskStore::update_classification so execute_now sees the classifier output when re-reading the task from the DB. Co-authored-by: chinkan.ai <chinkan@gmail.com>
Co-authored-by: chinkan.ai <chinkan@gmail.com>
Co-authored-by: chinkan.ai <chinkan@gmail.com>
Co-authored-by: chinkan.ai <chinkan@gmail.com>
Co-authored-by: chinkan.ai <chinkan@gmail.com>
Co-authored-by: chinkan.ai <chinkan@gmail.com>
Co-authored-by: chinkan.ai <chinkan@gmail.com>
Co-authored-by: chinkan.ai <chinkan@gmail.com>
Co-authored-by: chinkan.ai <chinkan@gmail.com>
…mands Co-authored-by: chinkan.ai <chinkan@gmail.com>
Adds regex-based redaction of credential-style tokens (api_key, password, secret, token, bearer) so secrets never reach disk or escape via ArtifactManager::write_text. Also derives Default on RiskThresholdsConfig and PolicyEngine to satisfy clippy::derivable_impls. Co-authored-by: chinkan.ai <chinkan@gmail.com>
…ction Supervisor (review) Co-authored-by: chinkan.ai <chinkan@gmail.com>
Co-authored-by: chinkan.ai <chinkan@gmail.com>
…ery workflow) Co-authored-by: chinkan.ai <chinkan@gmail.com>
Co-authored-by: chinkan.ai <chinkan@gmail.com>
…eview I-1) Co-authored-by: chinkan.ai <chinkan@gmail.com>
… (final review I-2) Co-authored-by: chinkan.ai <chinkan@gmail.com>
…gistry (final review I-3) Co-authored-by: chinkan.ai <chinkan@gmail.com>
|
Claude finished @chinkan's task in 6m 28s —— View job PR #36 Code Review — Autopilot Supervisor v2 (Implementation)
|
@copilot please fix the all C S D A parts. |
C1: Evidence::ExitCode tuple variant → struct variant {code: i32} for
serde internally-tagged compat; fixes latent serialization panic.
Updated all 6 construction/match sites.
C2: required_capabilities never persisted in sup_tasks.
- Added required_capabilities TEXT column to DB schema
- Added to INSERT in store::create()
- Added to SELECT+deserialize in store::get()
- Added to UPDATE in store::update_classification()
C3: Orchestrator ultimate fallback to reasoning backend when both
select_by_name and select_for(capability) return None; prevents
coding/research/document tasks failing silently with no backend.
C4: Replace (_, Cancelled) catch-all with explicit non-terminal
from-states; Done→Cancelled and Failed→Cancelled now disallowed.
Also added Route/Plan/PrepareWorkspace→Paused for pause() callers.
Added debug_assert in record_transition to catch violations in tests.
Added test assertions for !transition_allowed(Done, Cancelled) etc.
D1: Filter jobs by current plan IDs before verification in execute_now()
so resumed tasks don't see orphan rows from aborted prior runs.
A1: Extract shared run_cli_process() helper to backend/mod.rs;
ClaudeCodeCliBackend, CodexCliBackend, ScriptBackend all delegate
to it. Gracefully ignores EPIPE on stdin write (process may exit
before reading all stdin).
A2: Fix WorkspaceManager::prepare worktree path — with_extension
replaced by proper parent().join() to avoid mangling repo names.
A3: Replace time-based parallel test with AtomicUsize count assertion.
A4: Rename HeuristicClassifier::classify inherent method to
classify_as_task to disambiguate from Classifier trait impl.
Minor: Fix silent serde_json::from_str().unwrap_or() → map_err in
jobs_for_task (store.rs lines 224, 228).
Agent-Logs-Url: https://github.com/chinkan/RustFox/sessions/45817859-c1c7-4605-a948-b3798210809c
Co-authored-by: chinkan <16433287+chinkan@users.noreply.github.com>
Fixed all C, S (documented), D, and A items in commit f81df5b. Here's a summary: C1 — C2 — C3 — Orchestrator now falls back to C4 — Replaced D1 — A1 — Shared A2 — A3 — Parallel orchestrator test now asserts A4 — Minor — All 275 tests pass, |
|
Claude finished @chinkan's task in 4m 14s —— View job PR #36 Code Review — Autopilot Supervisor v2 (Final Review)
|
| Item | File | Status |
|---|---|---|
C1 — ExitCode { code: i32 } struct variant |
job.rs:30 |
✅ Fixed |
C2 — required_capabilities persisted |
store.rs:68-149 |
✅ Fixed |
C3 — Orchestrator fallback to "reasoning" |
orchestrator.rs:116 |
✅ Fixed |
| C4 — Terminal states block Cancelled | state.rs:14-40, 57-59 |
✅ Fixed |
| D1 — Resume orphan-job filter | mod.rs:134-207 |
✅ Fixed |
A1 — run_cli_process shared helper |
backend/mod.rs:129-191 |
✅ Fixed |
A2 — WorkspaceManager path construction |
workspace.rs:34-44 |
✅ Fixed |
| A3 — Counter-based parallel test | orchestrator.rs:249-253 |
✅ Fixed |
A4 — classify_as_task rename |
classifier.rs:78 |
✅ Fixed |
| I-1,2,3 — Review state, parallel groups, prod registry | mod.rs, orchestrator.rs, main.rs |
✅ Fixed |
New Issues
N1 — execute_now hardcodes Route as from-state — breaks resume audit trail [mod.rs:124-131]
execute_now always opens with record_transition(Route, Plan), but when called via resume(), the task is already in Execute state (just set by the Paused → Execute transition in resume). Since record_transition receives the hardcoded Route (not the actual DB state) as from, the debug_assert!(transition_allowed(from, to)) checks Route → Plan (which is valid and passes), while the DB unconditionally writes Plan. The actual transition is Execute → Plan, which is not a legal transition — but it goes undetected.
The audit log for a resumed task reads: … → Paused → Execute → Plan → …, where Execute → Plan is silently illegal. Functional behaviour is unaffected only because transition enforcement is debug_assert-only.
Fix: read the actual current state from task.status before calling record_transition. For resume, use task.status (which is Execute) as from, not the hardcoded Route.
// mod.rs — execute_now should derive from-state from the loaded task:
let pre_plan_state = task.status.clone(); // Route on first run, Execute on resume
self.store.record_transition(task_id, pre_plan_state, TaskStatus::Plan, ...).await?;N2 — SkillAwareClassifier is dead code in production [classifier.rs:120-148, mod.rs:97-104]
Both Supervisor::new and Supervisor::new_for_test instantiate HeuristicClassifier directly. SkillAwareClassifier — the M5 skill-pack feature — is never used through the production supervisor path. The classify method on SkillAwareClassifier is not part of the Classifier trait (it returns Task, not ClassificationOutcome), so it cannot be used as a Box<dyn Classifier> either.
Effect: the skill packs loaded from skills/sup-*/SKILL.md at startup never influence task routing. M5's supervisor_required_caps override is untriggerable in production.
Fix: Either wire SkillAwareClassifier into Supervisor::new (inject the SkillRegistry), or add a classify_with_skills method to the Classifier trait, or document this as deferred.
Fix this →
N3 — ShellBackend::run has no timeout [backend/shell.rs:61-88]
ShellBackend::run calls Command::output().await (blocking indefinitely) while the three CLI backends all go through run_cli_process, which applies tokio::time::timeout(Duration::from_secs(job.timeout_secs), ...). A sleep 9999 or hung pipe will stall the Tokio task and block the orchestrator.
Fix: Apply the same pattern as run_cli_process:
match tokio::time::timeout(
Duration::from_secs(job.timeout_secs),
Command::new("sh").arg("-c").arg(&cmd).current_dir(&self.sandbox).output(),
).await {
Ok(res) => res?,
Err(_) => return Ok(/* timeout JobOutput */),
}Minor / Accepted Debt
M1 — store.rs:get silently swallows DB errors [store.rs:127-130]
Ok(match rows.next() {
Some(Ok(t)) => Some(t),
_ => None, // ← also catches Some(Err(db_error))
})A deserialization failure on an existing row returns Ok(None) (task not found) instead of propagating the error. Prefer rows.next().transpose()?.
M2 — Unknown + Medium risk auto-executes [policy.rs:33-41]
None of the early-return guards trigger for TaskType::Unknown + RiskLevel::Medium, so it falls through to PolicyDecision::AutoExecute. The Unknown + Low path correctly returns Clarify; Medium should too, since unknown intent with non-trivial risk probably warrants clarification.
M3 — Capability strings are bare literals with no shared constants
"reasoning", "coding", "shell", "research", "document" appear as string literals across planner.rs, classifier.rs, and backend/mod.rs. A typo silently falls back to reasoning (via C3 fix) rather than producing a visible error. Adding pub const CAP_REASONING: &str = "reasoning" etc. in backend/mod.rs would surface mismatches at compile time.
M4 — store.rs:get still hard-codes three fields as Null [store.rs:122-124]
constraints, inputs, and expected_outputs are not included in the SELECT and are always reconstructed as serde_json::Value::Null. This is documented accepted debt, but any future Planner logic that inspects these fields will silently get Null.
What's Working Well
- Secret redaction (
redact.rs) — preserves separator in$2; all test assertions match implementation. - Resume orphan filtering (
mod.rs:134-207) —current_job_idsHashSet correctly excludes prior-run jobs from verification. - Rigorous-mode audit trail —
Execute → Review → Verifycorrectly recorded; DoD test atsupervisor_dod_smoke.rs:147verifies the"REVIEW"row. - Capability routing fallback — orchestrator gracefully falls back to
reasoningfor unregistered capabilities. - Timeout enforcement in CLI backends —
kill_on_drop(true)+tokio::time::timeoutinrun_cli_process. - State machine terminal-state guard — explicit match arms prevent
Done/Failed → Cancelled; tests cover both. run_cli_processdeduplication — 90% duplicate code eliminated; all three CLI backends are thin wrappers.required_capabilitiespersistence —store.rs:133-150correctly UPDATEs the column;store.rs:70-121correctly SELECTs and deserializes it.
Overall Assessment
All critical bugs from the prior review are resolved. The two most impactful remaining issues are N2 (M5 SkillAwareClassifier never fires in production) and N3 (ShellBackend missing timeout). N1 (resume state audit drift) affects audit trail correctness but not task outcomes. Everything else is documented accepted debt. The feature is safe to ship given the Telegram dispatcher isn't wired to the supervisor yet.


Status — Ready to Merge ✅
All 8 milestones complete. 275 tests passing (246 lib + 29 integration).
cargo clippy --all-targets -- -D warningsclean.cargo fmt --all -- --checkclean.cargo buildclean.Summary
Evolves RustFox from a single-loop AI assistant into a generic autonomous task supervisor with task-first state machine, pluggable backends (Claude Code CLI, Codex CLI, shell, MCP, the existing in-process Agent), policy-driven autonomy, evidence-based verification, and resumable persisted state.
The new
src/supervisor/module sits above the existingAgent. Existing Telegram bot behavior is unchanged — the supervisor pipeline is exposed via theSupervisor::*API; full/superviseTelegram dispatcher routing is a deferred follow-up.Milestone breakdown
SupervisorConfig, 4 SQLite tables (sup_tasks,sup_jobs,sup_transitions,sup_artifacts)Task/Job/state machine/TaskStore/IntakeRouter/HeuristicClassifier/PolicyEngine/ArtifactManager/Supervisor::submitend-to-end; crate becomes hybrid bin+libBackendtrait (with spec §10 methods) + capabilityRegistry+ 6 concrete backends (Reasoning, Shell, MCP, ClaudeCode, Codex, Script). CLI backends enforce per-job timeout viakill_on_drop(true).WorkflowTemplate/Planner/Orchestrator/VerificationEngine/Reporter+Supervisor::execute_nowend-to-end +parse_commandhelper for/superviseWorkspaceManager(git branch + optional worktree) +PREPARE_WORKSPACEstage for code tasks; classification persisted acrosssubmit/execute_nowsupervisor:block + 5 default workflow skill packs (sup-coding,sup-research,sup-writing,sup-ops,sup-general) +SkillAwareClassifierRunContextRiskThresholdsConfig), pause/resume + startup discovery, supervisor command parser, secret redaction in artifact write pathReview-state audit;CLAUDE.mddocuments the new architectureFinal review fixes applied
Execute → Review → Verifyso the audit trail matchesWorkflowTemplate(was alwaysExecute → Verify).parallel_groups=[[0,3]]with serial jobs at indices 1, 2).Supervisorinmain.rsnow boots withReasoningBackend::from_agentandShellBackendregistered (was empty registry).Known accepted debt (documented at call sites, scoped for follow-up PRs)
transition_allowedtable defined but not enforced insiderecord_transition(M3 review note).jobs_for_taskreconstructsresult.evidencelossily (synth oneOutputValidatedstub) — full evidence persistence requires a schema change./supervise,/tasks,/resume,/cancel,/approve,/clarifyparser is wired and tested but the Telegram message dispatcher does not yet route to handlers (programmatic API only).ShellBackend::validateuses naive substring checks (TODO comment names the M2.5 hardening targets).LlmBackedClassifier::inner_llmis#[allow(dead_code)]— the LLM-augmented classification path is deferred per M1.7.What's new from a runtime perspective
skills/sup-*/SKILL.md.[supervisor](default_autonomy_mode, artifacts_dir) +[supervisor.risk](require_approval_for_low/medium, auto_execute_only_low). All defaults preserve existing behavior.Verification
cargo test: 275 passed, 0 failedcargo clippy --all-targets -- -D warnings: cleancargo fmt --all -- --check: cleancargo build: cleanProcess
Implemented via subagent-driven development: per milestone, dispatched a fresh implementer subagent with full task context, then ran independent spec-compliance + code-quality reviewer subagents. Every milestone had its review issues addressed before moving to the next. Final integration review caught three additional issues (I-1, I-2, I-3), all fixed.
35+ commits, one logical change per commit.