Open-source agent orchestration framework. V11's proven patterns, any LLM, pure Python. Amazon Nova AI Hackathon — deadline March 16, 2026. Live at forge.herakles.dev.
Every session uses V11 spec-driven workflow. No ad-hoc implementation.
- Start:
/v11— detect state, check tasks, show dashboard - Plan: Create tasks via
TaskCreatewith sprint metadata before writing code - Execute:
TaskUpdate(status="in_progress")before work,TaskUpdate(status="completed")after - Verify:
pytest tests/ -x -qafter changes (1051 tests must pass), syntax-check modified files - Track: Update sprint_history.md and MEMORY.md after completing a sprint
Never implement without tasks. Even quick fixes get a task for traceability.
Phase: Submission ready. Core framework complete (19 sprints, ~30,000 LOC, 35 modules, 1,670 tests).
| Model | Grade | Score | Time | Turns |
|---|---|---|---|---|
| Nova Lite (32K) | S | 100% | 144s | 40 |
| Nova Pro (300K) | S | 100% | 167s | 39 |
| Nova Premier (1M) | S | 100% | 1110s | 33 |
# Tests
pytest tests/ -x -q # Quick (1670 tests)
pytest tests/unit/test_pipeline.py -v # Single module
python3 -c "import py_compile; py_compile.compile('FILE.py', doraise=True)" # Syntax check
# Interactive CLI
python3 forge_cli.py
# Non-interactive
python3 forge.py plan "description" --model nova-lite
python3 forge.py build --model gemini-flash
python3 forge.py preview
python3 forge.py deploy --domain app.example.com
# Benchmarks
source ~/.secrets/hercules.env # REQUIRED for AWS
python3 benchmark_nova_models.py --model nova-lite -v # Single model
python3 benchmark_nova_models.py --all # All 3 Nova models
python3 benchmark_nova_models.py --history # Run history trend
python3 benchmark_nova_models.py --diff-checks benchmarks/runs/2026-03/run_PREV.json
# Website
cd web/ && python3 -m http.server 8160 # Local preview
# Production: Docker container on port 8160, nginx at forge.herakles.dev- Language: Python 3.11+ (pure Python, no JS/TS)
- CLI: Click (forge.py) + custom interactive shell (forge_cli.py)
- LLM Providers: AWS Bedrock (Nova Lite/Pro/Premier), OpenRouter (Gemini), Anthropic (Claude)
- UI: Rich (live progress, tables, panels, spinners)
- Testing: pytest (1,670 tests, 50 test files)
- Deployment: Docker + nginx + SSL + Cloudflare Tunnels
- Website: Static HTML/CSS/JS at web/ (forge.herakles.dev, port 8160)
User Goal -> Interview (3-phase deep planning)
-> ForgeAgent (Planning) -> spec.md + tasks.json
-> WaveExecutor (Parallel Agents per wave) -> Built Project
-> GateReviewer (Adversarial quality check) -> PASS/FAIL
-> Preview (Cloudflare Tunnel) -> Shareable URL
-> ForgeDeployer (Docker + nginx) -> Live URL
- 3-tier prompts: Slim (<=32K), Focused (<=1M), Full (>1M) — model-appropriate system prompts
- Bedrock timeout: 300s read_timeout via botocore.Config (Premier needs ~100s/inference)
- Pre-seeded context: Dependent tasks get upstream file content injected (saves 2-3 turns)
- JSON recovery:
_recover_json()handles malformed LLM output (trailing commas, truncation, fences) - Autonomy A0-A5: 6-level trust system with auto-escalation cap at A3
- Context compaction: Budget-based (60%/65%), preserves toolUse/toolResult pairs
- Adaptive turn budgets:
compute_turn_budget()scales by file count (1-file: 15 soft/19 hard) - Convergence detector:
ConvergenceTrackerdisables writes after 5 idle turns - Verify phase budget: Capped at soft//4 turns, prevents endless read-back loops
| Module | LOC | Purpose |
|---|---|---|
| forge_cli.py | 4689 | Interactive shell, deep planning interview, all /commands |
| forge_agent.py | 2004 | Tool-use loop, 12 tools, ConvergenceTracker, verify phase, auto-verify |
| forge_hooks_impl.py | 1057 | 12 hook implementations |
| forge_guards.py | 1030 | RiskClassifier + PathSandbox + AutonomyManager (A0-A5) |
| forge_assistant.py | 1014 | Smart assistant — skill detection, interview, scope summary |
| forge_orchestrator.py | 999 | Plan/build/deploy orchestration + JSON recovery |
| forge_preview.py | 996 | PreviewManager — 14-stack detection + Cloudflare Tunnel |
| prompt_builder.py | 940 | 3-tier prompt system + autonomy-aware + previewability |
| formations.py | 907 | 11 formations + DAAO routing + 5 tool policies |
| model_router.py | 900 | 3 provider adapters (Bedrock 300s timeout, OpenAI, Anthropic) |
| forge_pipeline.py | 870 | WaveExecutor + ArtifactManager + GateReviewer |
| forge.py | 740 | Click CLI commands (14 commands) |
| forge_display.py | 685 | Rich live UI, brand-themed spinners and progress |
| forge_tasks.py | 643 | TaskStore + topological sort (Kahn's algorithm) |
| forge_index.py | 634 | ProjectIndex, export/import scanning, dependency graph |
| forge_session.py | 625 | Session lifecycle + persistence |
| forge_verify.py | 1072 | BuildVerifier — L1 static, L2 server, L3 browser checks |
| forge_deployer.py | 468 | Docker + nginx + SSL deployment |
| forge_web.py | 372 | Web dashboard + docs chat API |
| config.py | 366 | Model configs, context windows, adaptive turn budgets |
| forge_teams.py | 318 | Multi-agent team spawning |
| forge_memory.py | 309 | Persistent memory system |
| forge_migrate.py | 295 | Legacy version migration (V5-V10 → Forge) |
| forge_models.py | 294 | Model definitions and capability profiles |
| forge_hooks.py | 293 | Hook system (V11 compatible) |
| forge_prompt.py | 282 | Selection menus, Escape-to-cancel, brand colors |
| forge_compliance.py | 280 | 10-gate compliance checker |
| forge_registry.py | 276 | Agent definition registry (20 agents) |
| forge_competition.py | 253 | Competition readiness validator (8 checks) |
| forge_livereload.py | 252 | LiveReloadServer for build previews |
| forge_comms.py | 233 | BuildContext, FileClaim, AgentAnnouncement |
| forge_audit.py | 224 | JSONL audit trail |
| forge_theme.py | 189 | Design tokens, brand palette, console, visual helpers |
| forge_schema.py | 134 | 8 JSON schema validators |
| Module | LOC | Purpose |
|---|---|---|
| benchmark_nova_models.py | 2404 | Model benchmark — auto-save, regressions, pre-seeded context, aligned to CLI |
| benchmarks/benchmark_store.py | 569 | BenchmarkStore, metadata, regressions, diffs, hints |
| benchmark_expense_tracker.py | 835 | E2E benchmark (legacy) |
| demo_nova_e2e.py | 564 | E2E demo script |
| challenge_build.py | 274 | Challenge build runner |
| File | LOC | Purpose |
|---|---|---|
| web/index.html | 802 | Main page — hero, quickstart, ask-nova chat, try-it prompts, architecture |
| web/style.css | 1381 | Design system v4 — brand palette, responsive, dark theme |
| web/app.js | 343 | Interactive UI — chat, nav, asciinema player integration |
| web/demo.cast | — | Asciinema recording (NDJSON) for terminal demo |
nova-forge/
├── bin/forge, bin/herakles # CLI entry points (symlinked)
├── forge.py # Click CLI (14 commands)
├── forge_cli.py # Interactive shell (main file, 3604 LOC)
├── forge_agent.py # Core agent loop + 12 tools
├── forge_assistant.py # Smart assistant (skill, recommendations)
├── model_router.py # LLM provider adapters (Bedrock/OpenAI/Anthropic)
├── prompt_builder.py # 3-tier system prompt (slim/focused/full)
├── config.py # Configuration + context windows
├── forge_orchestrator.py # Plan/build/deploy coordination
├── forge_pipeline.py # WaveExecutor + ArtifactManager
├── forge_guards.py # Security (risk, sandbox, autonomy)
├── forge_preview.py # 14-stack preview + Cloudflare Tunnel
├── forge_deployer.py # Docker + nginx deployment
├── forge_verify.py # BuildVerifier (L1-L3 checks)
├── forge_display.py # Rich live UI
├── forge_theme.py # Brand design tokens
├── formations.py # 11 formations + DAAO routing
├── forge_tasks.py # TaskStore + topo sort
├── forge_*.py # (14 more modules — see Module Map)
├── benchmark_nova_models.py # Model benchmark suite
├── benchmarks/ # Benchmark infrastructure + run history
├── agents/ # 20 YAML agent definitions
├── schemas/ # 8 JSON schemas
├── templates/ # 4 app skeletons
├── scripts/ # Demo recording tools
├── tests/unit/ # 1,670 tests (50 test files)
├── web/ # Website (forge.herakles.dev)
│ ├── index.html, style.css, app.js, demo.cast
├── Dockerfile, docker-compose.yml, requirements.txt
└── CLAUDE.md # This file
- V11 workflow: Always use TaskCreate/TaskUpdate for tracking. No untracked work.
- Read before edit: Always read files before modifying them
- Syntax check after edit:
python3 -c "import py_compile; py_compile.compile('FILE.py', doraise=True)" - Test after changes:
pytest tests/ -x -q(1,670 tests, all must pass) - No docs unless asked: Don't create README/docs files unprompted
- Secrets: Never commit credentials; load via
source ~/.secrets/hercules.env - Benchmark after model changes:
python3 benchmark_nova_models.py --all -vto verify no regressions
| Sprint | Date | Key Deliverable |
|---|---|---|
| 5 | 03-10 | 12 tools, parallel waves, artifact handoffs, gate review, autonomy, streaming |
| 6 | 03-11 | Multi-agent comms, preview/verification, 8 interface bugs fixed |
| 7 | 03-11 | Light model optimization (SLIM_TOOLS, slim prompts, smart compaction) |
| 8 | 03-11 | Agent intelligence (multi-lang verify, read-before-edit, completeness) |
| 9 | 03-12 | Assistant layer, A0-A5 autonomy, adaptive UX |
| 10 | 03-12 | Agent fine-tuning (file ownership, write enforcement, 883→1000 tests) |
| 11 | 03-13 | 14-stack preview, benchmark infrastructure, 1000 tests |
| 12 | 03-13 | Deep planning interview (3-phase, 8 categories, scope summary) |
| 13 | 03-13 | CLI visual upgrade, JSON recovery, 3-tier prompts, Pro C→S, Premier C→A |
| 14 | 03-13 | Premier S tier (pre-seeded context, Bedrock timeout, removed turn caps) |
| 15 | 03-14 | Preview resilience (3x retry, health monitor), agent circuit breaker, /health + /competition commands, website stats fix, todo-app benchmark scenario |
| 16 | 03-14 | 2 new formations (recovery, all-hands-planning), agent self-correction, demo recording script, benchmark resilience checks |
| 17 | 03-15 | Agent loop convergence: adaptive turn budgets, ConvergenceTracker, verify phase budget, escalation budget reduction, hard limit tightening, benchmark aligned to CLI path, completeness directive |
| 18 | 03-15 | 5-agent architecture review (78 issues found), fix _auto_verify shlex bug, prompt contradictions resolved, agent loop hardening (6 fixes), preview 127.0.0.1, artifact 4KB threshold |
| 19 | 03-16 | 8-agent test swarm (1108→1670 tests), Premier max_tokens 16384 + stop_reason detection, all 3 models S 100%, submission prep |
Full details: memory/sprint_history.md