Skip to content

Commit 3513f6a

Browse files
feat: enhance security and reliability of Hecate deployment
- Added comprehensive security roadmap with prioritized improvements (P0-P3) - Implemented P0 fixes for backend health check feedback and Docker SDK logging - Added network segmentation plan for Caddy Admin API to prevent container escape - Enhanced self-enrollment configuration with improved CAPTCHA controls and domain handling - Updated command deprecation notices for enable subcommand with migration guidance - Added detailed success
1 parent f039a26 commit 3513f6a

6 files changed

Lines changed: 447 additions & 49 deletions

File tree

ROADMAP.md

Lines changed: 295 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Eos Development Roadmap
22

3-
**Last Updated**: 2025-10-30
3+
**Last Updated**: 2025-10-31
44
**Version**: 1.2
55

66
---
@@ -3455,6 +3455,21 @@ sudo eos update hecate --migrate-to-vault # Migrates existing .env to Vault
34553455
- **Change**: Add `127.0.0.1:2019:2019` to Caddy ports
34563456
- **Impact**: Enables Option B drift detection, Option C precipitate, oauth2-signout injection
34573457
- **Effort**: 5 minutes
3458+
3459+
4. **Domain Auto-Detection via Redirect URIs** 📅 DEFERRED (P2 - Polish)
3460+
- **File**: [pkg/hecate/self_enrollment.go:129-164](pkg/hecate/self_enrollment.go#L129-L164)
3461+
- **Current**: Matches app slug to domain prefix (e.g., "bionicgpt" → "bionicgpt.example.com")
3462+
- **Problem**: Fails when user chooses different subdomain (e.g., "chat.example.com" for bionicgpt)
3463+
- **Solution**: Query Authentik application's `redirect_uris` field via API
3464+
- Extract domain from redirect URI: `https://chat.codemonkey.net.au/akprox/callback` → `chat.codemonkey.net.au`
3465+
- Match extracted domain against Caddy routes
3466+
- **Pros**: True auto-detection, works regardless of subdomain naming convention
3467+
- **Cons**: Additional API call, assumes redirect URIs configured correctly
3468+
- **Rationale**: Current workaround (explicit `--dns` flag) is acceptable for now
3469+
- **Target**: 2026-Q1 (low priority, user feedback needed)
3470+
- **Effort**: 2-3 hours
3471+
- **Testing**: Test with apps using non-slug subdomains
3472+
- **Reference**: Authentik API `/api/v3/core/applications/{id}/` returns `redirect_uris` array
34583473
- **Testing**: Verify `curl http://localhost:2019/config/` works from host
34593474
34603475
4. **Missing HTTP/3 UDP Port** ([Issue #8](https://github.com/CodeMonkeyCybersecurity/eos/issues/TBD))
@@ -4138,3 +4153,282 @@ authentik-worker:
41384153
3. **Off-site Storage** - S3 or B2?
41394154
- Recommendation: B2 (cost-effective, good restic support)
41404155
4156+
---
4157+
4158+
## 🔐 Hecate Security & Reliability Improvements (2025-10-31 Adversarial Analysis)
4159+
4160+
**Last Updated**: 2025-10-31
4161+
**Status**: P0 Complete, P1-P3 Planned
4162+
**Owner**: Henry + Claude
4163+
**Context**: Comprehensive adversarial analysis of 26 command files + 83 package files identified improvements
4164+
4165+
---
4166+
4167+
### ✅ Completed (2025-10-31)
4168+
4169+
#### P0 #8: Backend Health Check Timeout Feedback ✅
4170+
- **Priority**: P0 - Usability
4171+
- **Status**: ✅ COMPLETE
4172+
- **Effort**: 30 minutes
4173+
- **Impact**: Human-centric - users see progress during 10s backend checks
4174+
- **Implementation**: [pkg/hecate/add/bionicgpt.go:153-181](pkg/hecate/add/bionicgpt.go#L153-L181)
4175+
- **Changes**:
4176+
- Added context-aware timeout with progress feedback
4177+
- Shows "Waiting for backend response... (Xs/10s)" every 2 seconds
4178+
- Prevents user confusion during network delays
4179+
- **Evidence**: Follows "Technology serves humans" principle from CLAUDE.md
4180+
4181+
#### P0 #9: Docker SDK Fallback Logging ✅
4182+
- **Priority**: P1 - Observability
4183+
- **Status**: ✅ COMPLETE
4184+
- **Effort**: 20 minutes
4185+
- **Impact**: Production troubleshooting, telemetry-enabled
4186+
- **Implementation**: [pkg/hecate/caddy_admin_api.go:76-97](pkg/hecate/caddy_admin_api.go#L76-L97)
4187+
- **Changes**:
4188+
- Replaced `fmt.Fprintf(stderr)` with structured logging (zap)
4189+
- Added error context, remediation steps, strategy tracking
4190+
- Complies with CLAUDE.md Rule #1 (ONLY use otelzap.Ctx)
4191+
- **Before**: Silent failures, no telemetry
4192+
- **After**: Structured logs with error details, remediation guidance
4193+
4194+
---
4195+
4196+
### 📅 This Month (November 2025)
4197+
4198+
#### P1 #6: Admin API Network Segmentation
4199+
- **Priority**: P1 - Security
4200+
- **Status**: PLANNED
4201+
- **Effort**: 2-3 hours
4202+
- **Deadline**: 2025-11-15
4203+
- **CVSS**: 7.2 (High) - Container compromise → full proxy control
4204+
- **Risk**: Caddy Admin API accessible to ALL containers on Docker bridge
4205+
- **Attack Scenario**:
4206+
1. Attacker compromises any container in Hecate stack
4207+
2. From container: `curl http://hecate-caddy:2019/config/` → retrieve full config
4208+
3. Attacker modifies config → routes traffic to malicious backend
4209+
- **Solution**:
4210+
```yaml
4211+
# docker-compose.yml
4212+
services:
4213+
caddy:
4214+
networks:
4215+
- caddy_admin # Separate network for Admin API
4216+
- caddy_proxy # Existing proxy network
4217+
4218+
networks:
4219+
caddy_admin:
4220+
internal: true # No external routing
4221+
```
4222+
- **Impact**: Limits blast radius of container compromise
4223+
- **Vendor Evidence**: Caddy docs 2025: "Protect admin endpoint... bind to permissioned unix socket"
4224+
- **Files to Change**:
4225+
- `pkg/hecate/types_docker.go` - Add admin network
4226+
- `assets/hecate/docker-compose.yml` - Update template
4227+
- Documentation update
4228+
4229+
#### P1 #10: Authentik Token Discovery Cleanup
4230+
- **Priority**: P1 - Reliability/Security
4231+
- **Status**: PLANNED
4232+
- **Effort**: 4-6 hours (with migration plan)
4233+
- **Deadline**: 2025-12-01 (1 month migration window)
4234+
- **Current Issues**:
4235+
- 5 different env var names (AUTHENTIK_API_TOKEN, AUTHENTIK_TOKEN, AUTHENTIK_API_KEY, etc.)
4236+
- 2 file locations (/opt/hecate/.env, /opt/bionicgpt/.env)
4237+
- Bootstrap token used as API key (never expires, root privileges)
4238+
- **Target State**:
4239+
```yaml
4240+
# /opt/hecate/.env (SINGLE location)
4241+
AUTHENTIK_BOOTSTRAP_TOKEN=<admin-login-token> # UI login only
4242+
AUTHENTIK_API_TOKEN=<dedicated-api-token> # API access, 365d expiry
4243+
```
4244+
- **Migration Plan**:
4245+
- **Month 1** (Nov 2025): Add deprecation warnings for legacy vars
4246+
- **Month 3** (Jan 2026): Fail with error if legacy vars used (with migration steps)
4247+
- **Month 6** (Apr 2026): Remove legacy code paths entirely
4248+
- **Files to Change**:
4249+
- `pkg/hecate/add/bionicgpt.go:390-488` - Simplify token discovery
4250+
- `pkg/hecate/auth.go:362-423` - Remove legacy fallbacks
4251+
- `pkg/hecate/authentik/export.go` - Update token retrieval
4252+
- **Vendor Evidence**: Authentik 2023.2+ invalidates all sessions on logout
4253+
4254+
---
4255+
4256+
### 📅 Next Quarter (Q1 2026)
4257+
4258+
#### P2 #14: Implement `--remove` Flag
4259+
- **Priority**: P2 - Completeness
4260+
- **Status**: PLANNED
4261+
- **Effort**: 2-3 weeks
4262+
- **Deadline**: 2026-01-31
4263+
- **Current State**: Returns "not yet implemented" with manual workaround
4264+
- **Impact**: Completes CRUD operations for Hecate routes
4265+
- **Design**: Use same 8-phase pattern as `--add`:
4266+
```
4267+
Phase 1: Validation (service exists)
4268+
Phase 2: Pre-flight checks (Caddy running)
4269+
Phase 3: Backup (BEFORE removal)
4270+
Phase 4: Service-specific cleanup (Authentik resources)
4271+
Phase 5: Remove route from Caddyfile
4272+
Phase 6: Validate and reload Caddy
4273+
Phase 7: Verify route is gone
4274+
Phase 8: Cleanup backups
4275+
```
4276+
- **Files to Create**:
4277+
- `pkg/hecate/remove/remove.go` - Business logic (mirror of add.go)
4278+
- `pkg/hecate/remove/validation.go` - Input validation
4279+
- `pkg/hecate/remove/integrators.go` - Service-specific cleanup
4280+
- **Integration Points**:
4281+
- `cmd/update/hecate.go:286-302` - Replace stub with delegation
4282+
- Authentik cleanup: Delete proxy provider, application
4283+
- Caddyfile: Remove route block, reload Caddy
4284+
- **Testing**: Add integration test for add → remove → verify gone
4285+
4286+
#### P2 #12: Backup Integrity Verification
4287+
- **Priority**: P2 - Reliability
4288+
- **Status**: PLANNED
4289+
- **Effort**: 1 week
4290+
- **Deadline**: 2025-11-30
4291+
- **Current Gap**: Backups created but never verified
4292+
- **Risk**: Corrupt backup discovered only during emergency restore
4293+
- **Solution**:
4294+
```go
4295+
func BackupCaddyfile(rc *RuntimeContext) (string, error) {
4296+
// Create backup
4297+
backupPath := fmt.Sprintf("%s/Caddyfile.backup.%s", BackupDir, timestamp)
4298+
copyFile(CaddyfilePath, backupPath)
4299+
4300+
// VERIFY: Read back and checksum
4301+
originalHash := sha256File(CaddyfilePath)
4302+
backupHash := sha256File(backupPath)
4303+
4304+
if originalHash != backupHash {
4305+
os.Remove(backupPath) // Delete corrupt backup
4306+
return "", fmt.Errorf("backup verification failed")
4307+
}
4308+
4309+
logger.Info("Backup verified", zap.String("checksum", backupHash[:16]))
4310+
return backupPath, nil
4311+
}
4312+
```
4313+
- **Files to Change**:
4314+
- `pkg/hecate/add/backup.go` - Add verification logic
4315+
- Add SHA256 helper function
4316+
- **Testing**: Test with corrupted backup, ensure detection
4317+
- **Vendor Evidence**: Docker Compose 2025 best practices: "Configure health checks"
4318+
4319+
#### P2 #11: Rate Limiting on Admin API
4320+
- **Priority**: P2 - Security (DoS prevention)
4321+
- **Status**: PLANNED
4322+
- **Effort**: 1-2 weeks
4323+
- **Deadline**: 2026-01-15
4324+
- **Risk**: Attacker floods Admin API → DoS via resource exhaustion
4325+
- **Solution**: Token bucket algorithm (10 req/s, burst of 20)
4326+
```go
4327+
type RateLimitedCaddyClient struct {
4328+
client *CaddyAdminClient
4329+
limiter *rate.Limiter // golang.org/x/time/rate
4330+
}
4331+
4332+
func (r *RateLimitedCaddyClient) LoadConfig(ctx, config) error {
4333+
if err := r.limiter.Wait(ctx); err != nil {
4334+
return fmt.Errorf("rate limit exceeded: %w", err)
4335+
}
4336+
return r.client.LoadConfig(ctx, config)
4337+
}
4338+
```
4339+
- **Files to Change**:
4340+
- `pkg/hecate/caddy_admin_api.go` - Add rate limiting wrapper
4341+
- Update all call sites to use rate-limited client
4342+
- **Monitoring**: Log rate limit violations with source for forensics
4343+
4344+
#### P2 #7: DNS Validation Strictness
4345+
- **Priority**: P2 - Usability
4346+
- **Status**: PLANNED
4347+
- **Effort**: 1 week
4348+
- **Deadline**: 2025-11-22
4349+
- **Current**: DNS check is warning (non-fatal)
4350+
- **Issue**: User may not notice warning, deploy broken config
4351+
- **Solution**: Add `--dev` and `--prod` flags to control strictness
4352+
```bash
4353+
eos update hecate --add app --dns test.local --upstream 10.0.0.1 --dev # Warning
4354+
eos update hecate --add app --dns prod.com --upstream 10.0.0.1 --prod # Error
4355+
```
4356+
- **Files to Change**:
4357+
- `cmd/update/hecate.go` - Add --dev/--prod flags
4358+
- `pkg/hecate/add/add.go:384-402` - Use flag for DNS validation strictness
4359+
- **Vendor Evidence**: Docker Compose 2025: Use `compose.production.yaml` for prod config
4360+
4361+
---
4362+
4363+
### 📅 Backlog (Q2 2026)
4364+
4365+
#### P3 #13: Circuit Breaker for Authentik API
4366+
- **Priority**: P3 - Resilience
4367+
- **Status**: BACKLOG
4368+
- **Effort**: 2-3 weeks
4369+
- **Deadline**: 2026-04-30
4370+
- **Blind Spot**: If Authentik API flapping, Eos retries indefinitely
4371+
- **Solution**: Use `github.com/sony/gobreaker` for circuit breaker
4372+
- **Pattern**: Open circuit after 3 consecutive failures, retry after 60s
4373+
- **Impact**: Prevents long hangs when Authentik down, fails fast with clear error
4374+
4375+
#### P3 #15: Metrics/Observability for Caddy
4376+
- **Priority**: P3 - Operations
4377+
- **Status**: BACKLOG
4378+
- **Effort**: 2-3 months
4379+
- **Deadline**: 2026-06-30
4380+
- **Blind Spot**: No visibility into Caddy performance (latency, error rates)
4381+
- **Solution**: Add `eos read hecate metrics` command
4382+
```bash
4383+
# Output:
4384+
Caddy Metrics (Last 5 minutes):
4385+
Total Requests: 15,234
4386+
Error Rate: 0.2%
4387+
P50 Latency: 45ms
4388+
P95 Latency: 120ms
4389+
4390+
Backend Health:
4391+
bionicgpt: Healthy (99.8% uptime)
4392+
wazuh: Degraded (2 failures in 5min)
4393+
```
4394+
- **Implementation**: Use Caddy Admin API `/metrics` or parse JSON logs
4395+
- **Vendor Evidence**: Caddy docs: `/reverse_proxy/upstreams` endpoint for backend status
4396+
4397+
---
4398+
4399+
### 📊 Priority Matrix
4400+
4401+
| Priority | Items | Timeline | Effort | Impact |
4402+
|----------|-------|----------|--------|--------|
4403+
| **P0** | 2 fixes | ✅ Complete | 1 hour | Usability + Observability |
4404+
| **P1** | 2 items | Nov 2025 | 1-2 weeks | Security + Reliability |
4405+
| **P2** | 4 items | Q1 2026 | 6-8 weeks | Completeness + Resilience |
4406+
| **P3** | 2 items | Q2 2026 | 3-5 months | Operations + Monitoring |
4407+
4408+
---
4409+
4410+
### 🎯 Success Metrics
4411+
4412+
**November 2025** (This Month):
4413+
- [ ] P1 #6: Admin API network segmentation deployed
4414+
- [ ] P1 #10: Token discovery simplified, migration plan announced
4415+
4416+
**Q1 2026** (Next Quarter):
4417+
- [ ] P2 #14: `--remove` flag fully implemented
4418+
- [ ] P2 #12: All backups verified with SHA256
4419+
- [ ] P2 #11: Rate limiting prevents API DoS
4420+
- [ ] P2 #7: Production deployments fail on DNS issues
4421+
4422+
**Q2 2026** (Backlog):
4423+
- [ ] P3 #13: Circuit breaker prevents Authentik cascade failures
4424+
- [ ] P3 #15: Operators have visibility into Caddy performance
4425+
4426+
---
4427+
4428+
### 📚 References
4429+
4430+
- **Adversarial Analysis Date**: 2025-10-31
4431+
- **Vendor Documentation**: Caddy 2025, Authentik 2025, Docker Compose 2025
4432+
- **Industry Standards**: OWASP, NIST, SOC2, PCI-DSS
4433+
- **Compliance**: Human-centric, Evidence-based, Sustainable Innovation (CLAUDE.md)
4434+

cmd/update/hecate.go

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,11 @@ Examples:
6464
eos update hecate certs # Only renew certificates
6565
eos update hecate k3s # Update k3s deployment
6666
67+
# Enable features (OAuth2 signout, self-enrollment)
68+
eos update hecate --enable oauth2-signout # Add logout handlers to protected routes
69+
eos update hecate --enable self-enrollment --app bionicgpt --dns chat.example.com
70+
eos update hecate --enable self-enrollment --app bionicgpt --dns chat.example.com --dry-run
71+
6772
# Fix Caddy configuration drift (Admin API binding + network name)
6873
eos update hecate --fix caddy # Apply both fixes and restart Caddy
6974
eos update hecate --fix caddy --dry-run # Preview fixes without applying
@@ -215,7 +220,9 @@ func init() {
215220
updateHecateCmd.Flags().String("authentik-host", "hecate-server-1", "Authentik hostname (used with --enable)")
216221
updateHecateCmd.Flags().Int("authentik-port", hecate.AuthentikPort, "Authentik port (used with --enable)")
217222
updateHecateCmd.Flags().Bool("skip-caddyfile", false, "Skip Caddyfile updates (used with --enable, advanced usage)")
218-
updateHecateCmd.Flags().Bool("enable-captcha", false, "Enable captcha for self-enrollment (used with --enable self-enrollment)")
223+
updateHecateCmd.Flags().Bool("enable-captcha", true, "Enable captcha for self-enrollment (default: true, uses test keys initially)")
224+
updateHecateCmd.Flags().Bool("disable-captcha", false, "Disable captcha protection (NOT RECOMMENDED for production)")
225+
updateHecateCmd.Flags().Bool("require-approval", false, "New users inactive until admin approves (default: active immediately)")
219226

220227
// Optional flags for --add
221228
updateHecateCmd.Flags().Bool("sso", false, "Enable SSO for this route (NOTE: BionicGPT always uses Authentik forward auth regardless of this flag)")

0 commit comments

Comments
 (0)