|
1 | 1 | # Eos Development Roadmap |
2 | 2 |
|
3 | | -**Last Updated**: 2025-10-30 |
| 3 | +**Last Updated**: 2025-10-31 |
4 | 4 | **Version**: 1.2 |
5 | 5 |
|
6 | 6 | --- |
@@ -3455,6 +3455,21 @@ sudo eos update hecate --migrate-to-vault # Migrates existing .env to Vault |
3455 | 3455 | - **Change**: Add `127.0.0.1:2019:2019` to Caddy ports |
3456 | 3456 | - **Impact**: Enables Option B drift detection, Option C precipitate, oauth2-signout injection |
3457 | 3457 | - **Effort**: 5 minutes |
| 3458 | +
|
| 3459 | +4. **Domain Auto-Detection via Redirect URIs** 📅 DEFERRED (P2 - Polish) |
| 3460 | + - **File**: [pkg/hecate/self_enrollment.go:129-164](pkg/hecate/self_enrollment.go#L129-L164) |
| 3461 | + - **Current**: Matches app slug to domain prefix (e.g., "bionicgpt" → "bionicgpt.example.com") |
| 3462 | + - **Problem**: Fails when user chooses different subdomain (e.g., "chat.example.com" for bionicgpt) |
| 3463 | + - **Solution**: Query Authentik application's `redirect_uris` field via API |
| 3464 | + - Extract domain from redirect URI: `https://chat.codemonkey.net.au/akprox/callback` → `chat.codemonkey.net.au` |
| 3465 | + - Match extracted domain against Caddy routes |
| 3466 | + - **Pros**: True auto-detection, works regardless of subdomain naming convention |
| 3467 | + - **Cons**: Additional API call, assumes redirect URIs configured correctly |
| 3468 | + - **Rationale**: Current workaround (explicit `--dns` flag) is acceptable for now |
| 3469 | + - **Target**: 2026-Q1 (low priority, user feedback needed) |
| 3470 | + - **Effort**: 2-3 hours |
| 3471 | + - **Testing**: Test with apps using non-slug subdomains |
| 3472 | + - **Reference**: Authentik API `/api/v3/core/applications/{id}/` returns `redirect_uris` array |
3458 | 3473 | - **Testing**: Verify `curl http://localhost:2019/config/` works from host |
3459 | 3474 |
|
3460 | 3475 | 4. **Missing HTTP/3 UDP Port** ([Issue #8](https://github.com/CodeMonkeyCybersecurity/eos/issues/TBD)) |
@@ -4138,3 +4153,282 @@ authentik-worker: |
4138 | 4153 | 3. **Off-site Storage** - S3 or B2? |
4139 | 4154 | - Recommendation: B2 (cost-effective, good restic support) |
4140 | 4155 |
|
| 4156 | +--- |
| 4157 | +
|
| 4158 | +## 🔐 Hecate Security & Reliability Improvements (2025-10-31 Adversarial Analysis) |
| 4159 | +
|
| 4160 | +**Last Updated**: 2025-10-31 |
| 4161 | +**Status**: P0 Complete, P1-P3 Planned |
| 4162 | +**Owner**: Henry + Claude |
| 4163 | +**Context**: Comprehensive adversarial analysis of 26 command files + 83 package files identified improvements |
| 4164 | +
|
| 4165 | +--- |
| 4166 | +
|
| 4167 | +### ✅ Completed (2025-10-31) |
| 4168 | +
|
| 4169 | +#### P0 #8: Backend Health Check Timeout Feedback ✅ |
| 4170 | +- **Priority**: P0 - Usability |
| 4171 | +- **Status**: ✅ COMPLETE |
| 4172 | +- **Effort**: 30 minutes |
| 4173 | +- **Impact**: Human-centric - users see progress during 10s backend checks |
| 4174 | +- **Implementation**: [pkg/hecate/add/bionicgpt.go:153-181](pkg/hecate/add/bionicgpt.go#L153-L181) |
| 4175 | +- **Changes**: |
| 4176 | + - Added context-aware timeout with progress feedback |
| 4177 | + - Shows "Waiting for backend response... (Xs/10s)" every 2 seconds |
| 4178 | + - Prevents user confusion during network delays |
| 4179 | +- **Evidence**: Follows "Technology serves humans" principle from CLAUDE.md |
| 4180 | +
|
| 4181 | +#### P0 #9: Docker SDK Fallback Logging ✅ |
| 4182 | +- **Priority**: P1 - Observability |
| 4183 | +- **Status**: ✅ COMPLETE |
| 4184 | +- **Effort**: 20 minutes |
| 4185 | +- **Impact**: Production troubleshooting, telemetry-enabled |
| 4186 | +- **Implementation**: [pkg/hecate/caddy_admin_api.go:76-97](pkg/hecate/caddy_admin_api.go#L76-L97) |
| 4187 | +- **Changes**: |
| 4188 | + - Replaced `fmt.Fprintf(stderr)` with structured logging (zap) |
| 4189 | + - Added error context, remediation steps, strategy tracking |
| 4190 | + - Complies with CLAUDE.md Rule #1 (ONLY use otelzap.Ctx) |
| 4191 | +- **Before**: Silent failures, no telemetry |
| 4192 | +- **After**: Structured logs with error details, remediation guidance |
| 4193 | +
|
| 4194 | +--- |
| 4195 | +
|
| 4196 | +### 📅 This Month (November 2025) |
| 4197 | +
|
| 4198 | +#### P1 #6: Admin API Network Segmentation |
| 4199 | +- **Priority**: P1 - Security |
| 4200 | +- **Status**: PLANNED |
| 4201 | +- **Effort**: 2-3 hours |
| 4202 | +- **Deadline**: 2025-11-15 |
| 4203 | +- **CVSS**: 7.2 (High) - Container compromise → full proxy control |
| 4204 | +- **Risk**: Caddy Admin API accessible to ALL containers on Docker bridge |
| 4205 | +- **Attack Scenario**: |
| 4206 | + 1. Attacker compromises any container in Hecate stack |
| 4207 | + 2. From container: `curl http://hecate-caddy:2019/config/` → retrieve full config |
| 4208 | + 3. Attacker modifies config → routes traffic to malicious backend |
| 4209 | +- **Solution**: |
| 4210 | + ```yaml |
| 4211 | + # docker-compose.yml |
| 4212 | + services: |
| 4213 | + caddy: |
| 4214 | + networks: |
| 4215 | + - caddy_admin # Separate network for Admin API |
| 4216 | + - caddy_proxy # Existing proxy network |
| 4217 | +
|
| 4218 | + networks: |
| 4219 | + caddy_admin: |
| 4220 | + internal: true # No external routing |
| 4221 | + ``` |
| 4222 | +- **Impact**: Limits blast radius of container compromise |
| 4223 | +- **Vendor Evidence**: Caddy docs 2025: "Protect admin endpoint... bind to permissioned unix socket" |
| 4224 | +- **Files to Change**: |
| 4225 | + - `pkg/hecate/types_docker.go` - Add admin network |
| 4226 | + - `assets/hecate/docker-compose.yml` - Update template |
| 4227 | + - Documentation update |
| 4228 | +
|
| 4229 | +#### P1 #10: Authentik Token Discovery Cleanup |
| 4230 | +- **Priority**: P1 - Reliability/Security |
| 4231 | +- **Status**: PLANNED |
| 4232 | +- **Effort**: 4-6 hours (with migration plan) |
| 4233 | +- **Deadline**: 2025-12-01 (1 month migration window) |
| 4234 | +- **Current Issues**: |
| 4235 | + - 5 different env var names (AUTHENTIK_API_TOKEN, AUTHENTIK_TOKEN, AUTHENTIK_API_KEY, etc.) |
| 4236 | + - 2 file locations (/opt/hecate/.env, /opt/bionicgpt/.env) |
| 4237 | + - Bootstrap token used as API key (never expires, root privileges) |
| 4238 | +- **Target State**: |
| 4239 | + ```yaml |
| 4240 | + # /opt/hecate/.env (SINGLE location) |
| 4241 | + AUTHENTIK_BOOTSTRAP_TOKEN=<admin-login-token> # UI login only |
| 4242 | + AUTHENTIK_API_TOKEN=<dedicated-api-token> # API access, 365d expiry |
| 4243 | + ``` |
| 4244 | +- **Migration Plan**: |
| 4245 | + - **Month 1** (Nov 2025): Add deprecation warnings for legacy vars |
| 4246 | + - **Month 3** (Jan 2026): Fail with error if legacy vars used (with migration steps) |
| 4247 | + - **Month 6** (Apr 2026): Remove legacy code paths entirely |
| 4248 | +- **Files to Change**: |
| 4249 | + - `pkg/hecate/add/bionicgpt.go:390-488` - Simplify token discovery |
| 4250 | + - `pkg/hecate/auth.go:362-423` - Remove legacy fallbacks |
| 4251 | + - `pkg/hecate/authentik/export.go` - Update token retrieval |
| 4252 | +- **Vendor Evidence**: Authentik 2023.2+ invalidates all sessions on logout |
| 4253 | +
|
| 4254 | +--- |
| 4255 | +
|
| 4256 | +### 📅 Next Quarter (Q1 2026) |
| 4257 | +
|
| 4258 | +#### P2 #14: Implement `--remove` Flag |
| 4259 | +- **Priority**: P2 - Completeness |
| 4260 | +- **Status**: PLANNED |
| 4261 | +- **Effort**: 2-3 weeks |
| 4262 | +- **Deadline**: 2026-01-31 |
| 4263 | +- **Current State**: Returns "not yet implemented" with manual workaround |
| 4264 | +- **Impact**: Completes CRUD operations for Hecate routes |
| 4265 | +- **Design**: Use same 8-phase pattern as `--add`: |
| 4266 | + ``` |
| 4267 | + Phase 1: Validation (service exists) |
| 4268 | + Phase 2: Pre-flight checks (Caddy running) |
| 4269 | + Phase 3: Backup (BEFORE removal) |
| 4270 | + Phase 4: Service-specific cleanup (Authentik resources) |
| 4271 | + Phase 5: Remove route from Caddyfile |
| 4272 | + Phase 6: Validate and reload Caddy |
| 4273 | + Phase 7: Verify route is gone |
| 4274 | + Phase 8: Cleanup backups |
| 4275 | + ``` |
| 4276 | +- **Files to Create**: |
| 4277 | + - `pkg/hecate/remove/remove.go` - Business logic (mirror of add.go) |
| 4278 | + - `pkg/hecate/remove/validation.go` - Input validation |
| 4279 | + - `pkg/hecate/remove/integrators.go` - Service-specific cleanup |
| 4280 | +- **Integration Points**: |
| 4281 | + - `cmd/update/hecate.go:286-302` - Replace stub with delegation |
| 4282 | + - Authentik cleanup: Delete proxy provider, application |
| 4283 | + - Caddyfile: Remove route block, reload Caddy |
| 4284 | +- **Testing**: Add integration test for add → remove → verify gone |
| 4285 | +
|
| 4286 | +#### P2 #12: Backup Integrity Verification |
| 4287 | +- **Priority**: P2 - Reliability |
| 4288 | +- **Status**: PLANNED |
| 4289 | +- **Effort**: 1 week |
| 4290 | +- **Deadline**: 2025-11-30 |
| 4291 | +- **Current Gap**: Backups created but never verified |
| 4292 | +- **Risk**: Corrupt backup discovered only during emergency restore |
| 4293 | +- **Solution**: |
| 4294 | + ```go |
| 4295 | + func BackupCaddyfile(rc *RuntimeContext) (string, error) { |
| 4296 | + // Create backup |
| 4297 | + backupPath := fmt.Sprintf("%s/Caddyfile.backup.%s", BackupDir, timestamp) |
| 4298 | + copyFile(CaddyfilePath, backupPath) |
| 4299 | +
|
| 4300 | + // VERIFY: Read back and checksum |
| 4301 | + originalHash := sha256File(CaddyfilePath) |
| 4302 | + backupHash := sha256File(backupPath) |
| 4303 | +
|
| 4304 | + if originalHash != backupHash { |
| 4305 | + os.Remove(backupPath) // Delete corrupt backup |
| 4306 | + return "", fmt.Errorf("backup verification failed") |
| 4307 | + } |
| 4308 | +
|
| 4309 | + logger.Info("Backup verified", zap.String("checksum", backupHash[:16])) |
| 4310 | + return backupPath, nil |
| 4311 | + } |
| 4312 | + ``` |
| 4313 | +- **Files to Change**: |
| 4314 | + - `pkg/hecate/add/backup.go` - Add verification logic |
| 4315 | + - Add SHA256 helper function |
| 4316 | +- **Testing**: Test with corrupted backup, ensure detection |
| 4317 | +- **Vendor Evidence**: Docker Compose 2025 best practices: "Configure health checks" |
| 4318 | +
|
| 4319 | +#### P2 #11: Rate Limiting on Admin API |
| 4320 | +- **Priority**: P2 - Security (DoS prevention) |
| 4321 | +- **Status**: PLANNED |
| 4322 | +- **Effort**: 1-2 weeks |
| 4323 | +- **Deadline**: 2026-01-15 |
| 4324 | +- **Risk**: Attacker floods Admin API → DoS via resource exhaustion |
| 4325 | +- **Solution**: Token bucket algorithm (10 req/s, burst of 20) |
| 4326 | + ```go |
| 4327 | + type RateLimitedCaddyClient struct { |
| 4328 | + client *CaddyAdminClient |
| 4329 | + limiter *rate.Limiter // golang.org/x/time/rate |
| 4330 | + } |
| 4331 | +
|
| 4332 | + func (r *RateLimitedCaddyClient) LoadConfig(ctx, config) error { |
| 4333 | + if err := r.limiter.Wait(ctx); err != nil { |
| 4334 | + return fmt.Errorf("rate limit exceeded: %w", err) |
| 4335 | + } |
| 4336 | + return r.client.LoadConfig(ctx, config) |
| 4337 | + } |
| 4338 | + ``` |
| 4339 | +- **Files to Change**: |
| 4340 | + - `pkg/hecate/caddy_admin_api.go` - Add rate limiting wrapper |
| 4341 | + - Update all call sites to use rate-limited client |
| 4342 | +- **Monitoring**: Log rate limit violations with source for forensics |
| 4343 | +
|
| 4344 | +#### P2 #7: DNS Validation Strictness |
| 4345 | +- **Priority**: P2 - Usability |
| 4346 | +- **Status**: PLANNED |
| 4347 | +- **Effort**: 1 week |
| 4348 | +- **Deadline**: 2025-11-22 |
| 4349 | +- **Current**: DNS check is warning (non-fatal) |
| 4350 | +- **Issue**: User may not notice warning, deploy broken config |
| 4351 | +- **Solution**: Add `--dev` and `--prod` flags to control strictness |
| 4352 | + ```bash |
| 4353 | + eos update hecate --add app --dns test.local --upstream 10.0.0.1 --dev # Warning |
| 4354 | + eos update hecate --add app --dns prod.com --upstream 10.0.0.1 --prod # Error |
| 4355 | + ``` |
| 4356 | +- **Files to Change**: |
| 4357 | + - `cmd/update/hecate.go` - Add --dev/--prod flags |
| 4358 | + - `pkg/hecate/add/add.go:384-402` - Use flag for DNS validation strictness |
| 4359 | +- **Vendor Evidence**: Docker Compose 2025: Use `compose.production.yaml` for prod config |
| 4360 | +
|
| 4361 | +--- |
| 4362 | +
|
| 4363 | +### 📅 Backlog (Q2 2026) |
| 4364 | +
|
| 4365 | +#### P3 #13: Circuit Breaker for Authentik API |
| 4366 | +- **Priority**: P3 - Resilience |
| 4367 | +- **Status**: BACKLOG |
| 4368 | +- **Effort**: 2-3 weeks |
| 4369 | +- **Deadline**: 2026-04-30 |
| 4370 | +- **Blind Spot**: If Authentik API flapping, Eos retries indefinitely |
| 4371 | +- **Solution**: Use `github.com/sony/gobreaker` for circuit breaker |
| 4372 | +- **Pattern**: Open circuit after 3 consecutive failures, retry after 60s |
| 4373 | +- **Impact**: Prevents long hangs when Authentik down, fails fast with clear error |
| 4374 | +
|
| 4375 | +#### P3 #15: Metrics/Observability for Caddy |
| 4376 | +- **Priority**: P3 - Operations |
| 4377 | +- **Status**: BACKLOG |
| 4378 | +- **Effort**: 2-3 months |
| 4379 | +- **Deadline**: 2026-06-30 |
| 4380 | +- **Blind Spot**: No visibility into Caddy performance (latency, error rates) |
| 4381 | +- **Solution**: Add `eos read hecate metrics` command |
| 4382 | + ```bash |
| 4383 | + # Output: |
| 4384 | + Caddy Metrics (Last 5 minutes): |
| 4385 | + Total Requests: 15,234 |
| 4386 | + Error Rate: 0.2% |
| 4387 | + P50 Latency: 45ms |
| 4388 | + P95 Latency: 120ms |
| 4389 | +
|
| 4390 | + Backend Health: |
| 4391 | + bionicgpt: Healthy (99.8% uptime) |
| 4392 | + wazuh: Degraded (2 failures in 5min) |
| 4393 | + ``` |
| 4394 | +- **Implementation**: Use Caddy Admin API `/metrics` or parse JSON logs |
| 4395 | +- **Vendor Evidence**: Caddy docs: `/reverse_proxy/upstreams` endpoint for backend status |
| 4396 | +
|
| 4397 | +--- |
| 4398 | +
|
| 4399 | +### 📊 Priority Matrix |
| 4400 | +
|
| 4401 | +| Priority | Items | Timeline | Effort | Impact | |
| 4402 | +|----------|-------|----------|--------|--------| |
| 4403 | +| **P0** | 2 fixes | ✅ Complete | 1 hour | Usability + Observability | |
| 4404 | +| **P1** | 2 items | Nov 2025 | 1-2 weeks | Security + Reliability | |
| 4405 | +| **P2** | 4 items | Q1 2026 | 6-8 weeks | Completeness + Resilience | |
| 4406 | +| **P3** | 2 items | Q2 2026 | 3-5 months | Operations + Monitoring | |
| 4407 | +
|
| 4408 | +--- |
| 4409 | +
|
| 4410 | +### 🎯 Success Metrics |
| 4411 | +
|
| 4412 | +**November 2025** (This Month): |
| 4413 | +- [ ] P1 #6: Admin API network segmentation deployed |
| 4414 | +- [ ] P1 #10: Token discovery simplified, migration plan announced |
| 4415 | +
|
| 4416 | +**Q1 2026** (Next Quarter): |
| 4417 | +- [ ] P2 #14: `--remove` flag fully implemented |
| 4418 | +- [ ] P2 #12: All backups verified with SHA256 |
| 4419 | +- [ ] P2 #11: Rate limiting prevents API DoS |
| 4420 | +- [ ] P2 #7: Production deployments fail on DNS issues |
| 4421 | +
|
| 4422 | +**Q2 2026** (Backlog): |
| 4423 | +- [ ] P3 #13: Circuit breaker prevents Authentik cascade failures |
| 4424 | +- [ ] P3 #15: Operators have visibility into Caddy performance |
| 4425 | +
|
| 4426 | +--- |
| 4427 | +
|
| 4428 | +### 📚 References |
| 4429 | +
|
| 4430 | +- **Adversarial Analysis Date**: 2025-10-31 |
| 4431 | +- **Vendor Documentation**: Caddy 2025, Authentik 2025, Docker Compose 2025 |
| 4432 | +- **Industry Standards**: OWASP, NIST, SOC2, PCI-DSS |
| 4433 | +- **Compliance**: Human-centric, Evidence-based, Sustainable Innovation (CLAUDE.md) |
| 4434 | +
|
0 commit comments