Adaptive Resilience Engineering Framework v2.0.0 Author: Dennis 'dnoice' Smaltz | digiSpace Technical Studio
aref/
| -- Dockerfile # Python 3.11-slim, pip install ., default uvicorn CMD
| -- docker-compose.yml # Full stack: 5 services + dashboard + postgres/redis/prometheus/grafana
| -- pyproject.toml # Hatchling build, all dependencies, CLI entrypoints
| -- README.md # Stub (single heading)
| -- .dockerignore
| -- .hintrc
|
| -- aref/ # Core AREF framework package
| | -- __init__.py # v2.0.0, package root
| |
| | -- core/ # Shared infrastructure
| | | -- __init__.py
| | | -- config.py # Pydantic-settings config (env-driven, per-pillar)
| | | -- events.py # Async in-process event bus (pub/sub + history)
| | | -- metrics.py # Prometheus metrics + AREF formula engine (MTTD/MTTR/CRS)
| | | -- models.py # Domain objects: Incident, ActionItem, ServiceInfo, HealthCheck
| | | -- logging.py # structlog configuration
| |
| | -- detection/ # Pillar I: Early warning & anomaly identification
| | | -- __init__.py
| | | -- engine.py # DetectionEngine — orchestrates all detectors, alert lifecycle
| | | -- anomaly.py # Z-score + Isolation Forest anomaly detection
| | | -- threshold.py # ThresholdDetector — metric > threshold with consecutive confirmation
| | | -- synthetic.py # SyntheticProber — active HTTP health checks
| | | -- sli_tracker.py # SLI/SLO tracking + error budget computation
| |
| | -- absorption/ # Pillar II: Impact containment & graceful degradation
| | | -- __init__.py
| | | -- circuit_breaker.py # CircuitBreaker + CircuitBreakerRegistry (singleton)
| | | -- bulkhead.py # Semaphore-based concurrency limits per partition
| | | -- rate_limiter.py # Token bucket rate limiting
| | | -- blast_radius.py # Dependency graph + blast radius analysis
| | | -- degradation.py # 4-tier graceful degradation (Full/Reduced/Minimal/Emergency)
| |
| | -- adaptation/ # Pillar III: Real-time reconfiguration
| | | -- __init__.py
| | | -- engine.py # AdaptationEngine — subscribes to alerts, executes strategies
| | | -- decision_tree.py # 6-step decision tree: classify anomaly -> select strategy
| | | -- feature_flags.py # FeatureFlagManager — toggle features, shed non-critical load
| | | -- traffic_shifter.py # TrafficShifter — weighted route redistribution
| | | -- scaler.py # AutoScaler — simulated horizontal scaling
| |
| | -- recovery/ # Pillar IV: Service restoration
| | | -- __init__.py
| | | -- engine.py # RecoveryEngine — tiered T0-T4 recovery orchestration
| | | -- runbooks.py # RunbookExecutor — YAML-loadable, step-by-step runbooks
| |
| | -- evolution/ # Pillar V: Post-incident learning
| | | -- __init__.py
| | | -- engine.py # EvolutionEngine — auto-generates reviews on recovery_resolved
| | | -- post_incident.py # PostIncidentReviewGenerator — 6-step structured reviews
| | | -- tracker.py # ActionTracker — action items to completion
| | | -- patterns.py # PatternMatcher — incident recurrence detection
| | | -- knowledge_base.py # KnowledgeBase — incident data + lessons repository
| |
| | -- maturity/ # Maturity assessment + CRS scoring
| | | -- __init__.py
| | | -- model.py # MaturityAssessor — L1-L5 per pillar, gap analysis, CRS
| |
| | -- dashboard/ # Control Plane: FastAPI app + web UI
| | | -- __init__.py
| | | -- app.py # 1176 lines — API + lifespan wiring for all engines
| | | -- templates/
| | | | -- index.html # 1229 lines — SPA dashboard (6 tabs)
| | | -- static/
| | | -- css/dashboard.css # 2149 lines — full dashboard styling
| | | -- js/dashboard.js # 2758 lines — client-side polling, charts, nav
| | | -- svg/icons.svg # 109 lines — SVG icon sprite
| |
| | -- cli/ # Rich terminal CLI
| | -- __init__.py
| | -- main.py # Click commands: status, pillars, maturity, chaos, timeline, serve
|
| -- chaos/ # Chaos engineering (top-level, not inside aref/)
| | -- __init__.py
| | -- injector.py # FaultInjector — latency/error/crash injection + auto-rollback
| | -- experiments.py # 5 pre-defined experiments + ExperimentRunner
|
| -- services/ # Microservice stack (FastAPI apps)
| | -- __init__.py
| | -- base.py # create_service() factory: health, readyz, info, metrics, CORS
| | -- gateway/
| | | -- __init__.py
| | | -- gateway_app.py # API Gateway: routing, retries, circuit awareness, order pipeline
| | -- orders/
| | | -- __init__.py
| | | -- orders_app.py # Order lifecycle state machine, audit trail
| | -- payments/
| | | -- __init__.py
| | | -- payments_app.py # Multi-provider payments, auto-failover, refunds, settlement
| | -- inventory/
| | | -- __init__.py
| | | -- inventory_app.py # Stock management, reservations with TTL, replenishment
| | -- notifications/
| | -- __init__.py
| | -- notifications_app.py # Multi-channel dispatch, priority shedding, retry backoff
|
| -- scripts/
| | -- __init__.py
| | -- demo.py # 700 lines — 12-step Rich terminal demo with CLI args
|
| -- tests/
| | -- __init__.py
| | -- unit/
| | | -- __init__.py
| | | -- test_core.py # Config, EventBus, MetricsEngine, Models (4 test classes)
| | | -- test_pillars.py # All 5 pillars + maturity (10 test classes)
| | -- integration/
| | -- __init__.py
| | -- test_full_pipeline.py # Full D-A-A-R-E pipeline + event correlation (2 tests)
|
| -- config/
| | -- prometheus.yml # Scrape config for all 6 services at /metrics
| | -- grafana/
| | -- provisioning/
| | -- datasources/
| | -- prometheus.yml # Prometheus as default Grafana datasource
|
| -- runbooks/
| | -- payment_failure.yml # T0 + T1 runbook for payment provider outage (YAML)
|
| -- docs/
| -- blueprint.pdf # AREF Framework blueprint document
| -- assets/
| -- AREF_Framework_v2.0.0.docx
| -- five-pillars.jpg
| -- pillar-3.jpg
| -- roadmap.jpg
| -- safety.jpg
| -- standards/
| -- DOCSTRING_STANDARDS.md # ALWAYS ADHERE TO THIS STANDARD FOR ALL FILES
| Module | Owns |
|---|---|
aref/core/ |
Config (pydantic-settings), event bus (pub/sub + history), Prometheus metrics registry, AREF formula engine (MTTD/MTTR/CRS/availability/error budget), domain models (Incident, ActionItem, HealthCheck), structlog setup |
aref/detection/ |
4 detection classes: threshold-based (metric > threshold with consecutive confirmation), anomaly (Z-score + Isolation Forest), synthetic probing (active HTTP health checks), SLI/SLO tracking + error budget computation. Alert lifecycle (fire/ack/resolve), fatigue monitoring |
aref/absorption/ |
Circuit breakers (CLOSED/HALF_OPEN/OPEN state machine + singleton registry), bulkheads (semaphore concurrency limits), token bucket rate limiters, blast radius analyzer (BFS dependency graph traversal + containment %), 4-tier graceful degradation manager |
aref/adaptation/ |
6-step decision tree (classify anomaly -> select minimum-impact strategy), feature flag manager (shed non-critical load), traffic shifter (weighted route redistribution), auto-scaler (simulated horizontal scaling). Subscribes to detection alerts + circuit breaker events |
aref/recovery/ |
Tiered recovery engine (T0-T4 automatic escalation), runbook executor (YAML-loadable step-by-step procedures with role/timeout/escalation), T0/T1 fully automated, T2+ requires human coordination |
aref/evolution/ |
Post-incident review generator (6-step blameless process), action item tracker (target: >85% completion, >8/quarter), pattern matcher (recurrence detection, target: <10%), knowledge base (lessons + incident data store) |
aref/maturity/ |
L1-L5 maturity assessor (Reactive through Optimizing), per-pillar criteria evaluation with lambda checks, gap analysis, CRS scoring across all 4 risk profiles |
aref/dashboard/ |
FastAPI control plane (1176 lines), lifespan wiring for all 5 engines, REST API (19 endpoints), web SPA dashboard (6 tabs: Overview, Detection, Absorption, Adaptation, Services, Timeline), Prometheus /metrics export |
aref/cli/ |
Click-based Rich terminal CLI: aref status, aref pillars, aref maturity, aref chaos [experiment], aref chaos-stop, aref timeline, aref serve |
chaos/ |
FaultInjector (latency/error/crash/timeout injection with safety bounds: max 90% rate, max 5min duration, auto-rollback), 5 pre-defined experiments, ExperimentRunner |
services/ |
5 FastAPI microservices + shared factory. Gateway routes to downstream services with retries/circuit awareness. Each service has /health, /healthz, /readyz, /info, /metrics, /chaos/enable, /chaos/disable |
services/base.py |
create_service() factory: request/error counting, response time headers, X-Service/X-Request-ID headers, /health (uptime, requests, errors, memory), /healthz (liveness), /readyz (readiness), /info (metadata), CORS, unhandled exception middleware |
scripts/ |
demo.py — 12-step end-to-end demo: preflight -> baseline CRS -> traffic -> deep health -> chaos inject -> incident traffic -> pillar deep-dive -> timeline -> chaos rollback -> recovery traffic -> final CRS -> comparison |
tests/ |
Unit tests for core (config, events, metrics, models) + all 5 pillars + maturity (14 test classes). Integration tests for full D-A-A-R-E pipeline + event correlation |
config/ |
Prometheus scrape config (10s interval, all 6 services), Grafana datasource provisioning |
runbooks/ |
YAML runbook definitions (payment_failure.yml has T0 + T1 procedures) |
Uses pydantic-settings with SettingsConfigDict(env_prefix="AREF_..."). Configuration cascades:
- Environment variables (highest priority) — prefix-based:
AREF_ENVIRONMENT,AREF_DEBUG,AREF_LOG_LEVEL,AREF_RISK_PROFILEAREF_DB_HOST,AREF_DB_PORT,AREF_DB_NAME,AREF_DB_USER,AREF_DB_PASSWORDAREF_REDIS_HOST,AREF_REDIS_PORTAREF_DETECTION_MTTD_TARGET_SECONDS,AREF_DETECTION_THRESHOLD_CHECK_INTERVAL, ...AREF_ABSORPTION_CIRCUIT_BREAKER_FAILURE_THRESHOLD, ...AREF_ADAPTATION_LATENCY_TARGET_SECONDS, ...AREF_RECOVERY_MTTR_TARGET_SECONDS,AREF_RECOVERY_T0_TARGET_SECONDS, ...AREF_EVOLUTION_IMPROVEMENT_VELOCITY_TARGET, ...
- Hardcoded defaults (fallback) — all values have sensible defaults aligned with blueprint
- No TOML/YAML config files used — pure env vars + defaults
| Section | Class | Key Fields |
|---|---|---|
| Root | AREFConfig |
environment, debug, log_level, risk_profile (4 profiles), api_host, api_port, dashboard_port |
| Database | DatabaseConfig |
PostgreSQL DSN builder, pool_min=2, pool_max=10 |
| Redis | RedisConfig |
Redis URL builder, db=0 |
| Detection | DetectionConfig |
mttd_target_seconds=300, threshold_check_interval=10, anomaly_check_interval=30, synthetic_probe_interval=15, alert_fatigue_max_per_week=50, alert_to_action_ratio=3.0 |
| Absorption | AbsorptionConfig |
blast_radius_target_pct=95, circuit_breaker_failure_threshold=5, circuit_breaker_recovery_timeout=30, bulkhead_max_concurrent=50, rate_limit_requests_per_second=100 |
| Adaptation | AdaptationConfig |
latency_target_seconds=30, scale_up_cpu_threshold=80, feature_flag_error_budget_trigger=0.5, adaptation_window_seconds=120 |
| Recovery | RecoveryConfig |
mttr_target_seconds=900, t0=300s, t1=900s, t2=3600s, t3=14400s, t4=14 days, runbook_drill_interval_days=90 |
| Evolution | EvolutionConfig |
improvement_velocity_target=8, action_completion_rate_target=85, recurrence_rate_target=10, post_incident_review_deadline_hours=72 |
From blueprint Section 6.3:
| Profile | Detection | Absorption | Adaptation | Recovery | Evolution |
|---|---|---|---|---|---|
| availability_critical | 0.30 | 0.25 | 0.20 | 0.15 | 0.10 |
| data_integrity_critical | 0.20 | 0.20 | 0.15 | 0.30 | 0.15 |
| balanced (default) | 0.20 | 0.20 | 0.20 | 0.20 | 0.20 |
| innovation_heavy | 0.15 | 0.15 | 0.25 | 0.15 | 0.30 |
config/prometheus.yml: Scrapes all 6 services at their/metricsendpoint every 10s: dashboard:8080, gateway:8000, orders:8001, payments:8002, inventory:8003, notifications:8004config/grafana/provisioning/datasources/prometheus.yml: Auto-provisions Prometheus as default datasource athttp://prometheus:9090
On startup, lifespan() wires everything:
1. Initialize all 5 engines (Detection, Adaptation, Recovery, Evolution) + Chaos Injector
2. Start engine event loop subscriptions:
- Adaptation subscribes to: detection.alert_fired, absorption.circuit_breaker_opened
- Recovery subscribes to: adaptation.escalate_to_recovery, detection.alert_fired (EMERGENCY)
- Evolution subscribes to: recovery.recovery_resolved
3. Wire Detection subsystems:
- Synthetic probes: 5 targets (one per service /health endpoint)
- SLI/SLO: 2 SLIs per service (availability + latency_p99) with SLOs
- Anomaly streams: 2 per service (response_latency + error_rate), Z=3.0 threshold
- Threshold rules: 2 per service (error_rate: warn>5%/crit>15%, latency: warn>300ms/crit>1s)
- Background data feed: polls all 5 services /health every 10s, feeds metrics into all 4 subsystems
4. Wire Absorption subsystems:
- Circuit breakers: 12 total (one per service->dependency pair in dependency map)
- Rate limiters: 5 (one per service, 100 req/s, burst 20)
- Bulkheads: 5 (one per service, 50 max concurrent)
- Blast radius graph: 7 nodes (5 services + postgres + redis), edges from dependency map
- Degradation: 5 services x 4 tiers (Full/Reduced/Minimal/Emergency)
5. Wire Adaptation subsystems:
- Feature flags: 30 total (6 per service: analytics, recommendations, search, experimental, core_api, auth)
- Traffic routes: 5 services x 2 routes (primary at 100%, backup at 0%)
- Auto-scaler: 5 services, current=1, min=1, max=10
6. Seed Evolution data:
- 5 incident patterns for recurrence matching
- Knowledge base entries from past incidents
- Seeded post-incident reviews + tracked action items
gateway --> orders, payments, inventory, notifications
orders --> payments, inventory, notifications
payments --> postgres, redis
inventory --> postgres, redis
notifications --> redis
The event bus is in-process pub/sub with topic routing:
Topic format: "category.event_type"
Wildcards: "category.*" or "*" (all events)
Flow:
Detection fires "detection.alert_fired"
-> Adaptation._on_alert() classifies anomaly, selects strategy, executes
-> If adaptation window exceeded: publishes "adaptation.escalate_to_recovery"
-> Recovery.begin_recovery() starts T0, escalates through T1-T4
-> On resolution: publishes "recovery.recovery_resolved"
-> Evolution auto-generates review, extracts actions, matches patterns
Chaos "chaos.fault_injected" / "chaos.fault_rolled_back" tracked in timeline
All events stored in bus._history (capped at 10,000) for timeline reconstruction
POST /api/orders (gateway:8000)
1. Create order -> orders:8001/orders
2. Process payment -> payments:8002/payments/process
3. Reserve stock -> inventory:8003/inventory/reserve
4. Send receipt -> notifications:8004/notifications/send
Returns: {order, payment, stock, notification, pipeline: {total_ms, steps: [{step, status, latency_ms}]}}
@dataclass
class Event:
category: EventCategory # detection | absorption | adaptation | recovery | evolution | system | chaos
event_type: str # "alert_fired", "circuit_breaker_opened", etc.
severity: EventSeverity # info | warning | critical | emergency
payload: dict[str, Any] # Arbitrary data
source: str # "detection_engine", "chaos_injector", etc.
event_id: str # UUID hex[:12]
timestamp: float # time.time()
correlation_id: str | None # Links related events across pillars@dataclass
class Incident:
incident_id: str # "INC-{uuid[:8]}"
severity: IncidentSeverity # sev1 (critical) | sev2 (major) | sev3 (minor) | sev4 (low)
status: IncidentStatus # detected -> absorbing -> adapting -> recovering -> resolved -> post_review -> closed
recovery_tier: RecoveryTier # T0_EMERGENCY(0) | T1_MINIMUM(1) | T2_FUNCTIONAL(2) | T3_FULL(3) | T4_HARDENING(4)
detection_class: DetectionClass # threshold | anomaly | synthetic | change_correlation | human_observation | predictive
anomaly_class: AnomalyClass # transient | persistent | escalating
timeline: list[dict] # Chronological action entries
correlation_id: str # Links to event bus timeline
# Timestamps for MTTD/MTTR: onset_time, detected_time, resolved_time@dataclass
class ActionItem:
action_id: str # "ACT-{uuid[:8]}"
incident_id: str
title: str
description: str
owner: str
priority: str # high | medium | low
status: str # open | completed
due_date: float | None
pillar: str # detection | absorption | adaptation | recovery | evolutionclass CircuitBreaker:
# States: CLOSED -> OPEN -> HALF_OPEN -> CLOSED
name: str # "gateway->orders"
service: str # "gateway"
dependency: str # "orders"
failure_threshold: int # 5 (default from config)
recovery_timeout: float # 30s
half_open_max_calls: int # 3
# async call(fn) -> raises CircuitBreakerError when OPEN@dataclass
class FeatureFlag:
name: str # "gateway.analytics"
service: str # "gateway"
enabled: bool # True
critical: bool # Critical flags cannot be auto-disabled
description: str@dataclass
class TrafficRoute:
target: str # "orders-primary"
weight: int # 0-100 percentage
healthy: bool| Tier | Name | Window | Executor | Approval |
|---|---|---|---|---|
| T0 | Emergency Stabilization | 0-5 min | Automated | None required |
| T1 | Minimum Viable Recovery | 5-15 min | Incident Commander | None required |
| T2 | Functional Recovery | 15-60 min | Engineering Teams | IC approval |
| T3 | Full Restoration | 1-4 hours | Cross-functional | Management |
| T4 | Post-Incident Hardening | 1-2 weeks | Platform team | Planning |
Blueprint critical requirement: "T0 and T1 must execute without external dependencies or approvals."
Default runbooks loaded: T0 (5 automated steps), T1 (6 IC-led steps), T2 (5 engineering steps).
Additional runbooks loadable from YAML (see runbooks/payment_failure.yml).
5 levels per pillar (aligned with ISO 22316):
| Level | Name | Description |
|---|---|---|
| L1 | Reactive | Ad-hoc, tribal knowledge |
| L2 | Managed | Basic monitoring, runbooks exist |
| L3 | Defined | Standardized processes, regular drills |
| L4 | Measured | Quantitative-driven, automated |
| L5 | Optimizing | Predictive, antifragile |
Assessment uses lambda-based criteria checks against a context dict per pillar. CRS = SUM(w_i * M_i) for i=1..5, weighted by risk profile.
| Service | Image/Build | Port | Depends On |
|---|---|---|---|
| postgres | postgres:16-alpine | 5432 | - |
| redis | redis:7-alpine | 6379 | - |
| prometheus | prom/prometheus:latest | 9090 | - |
| grafana | grafana/grafana:latest | 3000 | - |
| gateway | Build from Dockerfile | 8000 | redis |
| orders | Build from Dockerfile | 8001 | postgres, redis |
| payments | Build from Dockerfile | 8002 | - |
| inventory | Build from Dockerfile | 8003 | - |
| notifications | Build from Dockerfile | 8004 | redis |
| dashboard | Build from Dockerfile | 8080 | gateway, orders, payments, inventory, notifications, prometheus |
FROM python:3.11-slim
# Installs gcc, libpq-dev, curl
# COPY . . then pip install --no-cache-dir .
# Default CMD: uvicorn aref.dashboard.app:app --host 0.0.0.0 --port 8080Each service overrides CMD in docker-compose.yml:
uvicorn services.{name}.{name}_app:app --host 0.0.0.0 --port {port}docker compose up --build # Build + start all services
docker compose up -d # Detached mode
docker compose up dashboard # Just the dashboard (starts dependencies)pip install -e ".[dev]"
# Start services individually:
uvicorn aref.dashboard.app:app --port 8080
uvicorn services.gateway.gateway_app:app --port 8000
uvicorn services.orders.orders_app:app --port 8001
uvicorn services.payments.payments_app:app --port 8002
uvicorn services.inventory.inventory_app:app --port 8003
uvicorn services.notifications.notifications_app:app --port 8004| Variable | Purpose | Values |
|---|---|---|
AREF_ENVIRONMENT |
Switches service URL resolution | docker = container hostnames, anything else = localhost |
AREF_DASHBOARD_PORT |
Dashboard port override | Default: 8080 |
AREF_DEBUG |
Debug mode | Default: true |
AREF_LOG_LEVEL |
Log verbosity | Default: INFO |
AREF_RISK_PROFILE |
CRS weight profile | balanced (default), availability_critical, data_integrity_critical, innovation_heavy |
AREF_DB_* |
PostgreSQL connection | HOST, PORT, NAME, USER, PASSWORD |
AREF_REDIS_* |
Redis connection | HOST, PORT, DB, PASSWORD |
AREF_DETECTION_* |
Detection thresholds | See DetectionConfig fields |
AREF_ABSORPTION_* |
Absorption thresholds | See AbsorptionConfig fields |
AREF_ADAPTATION_* |
Adaptation thresholds | See AdaptationConfig fields |
AREF_RECOVERY_* |
Recovery time targets | See RecoveryConfig fields |
AREF_EVOLUTION_* |
Evolution velocity targets | See EvolutionConfig fields |
GF_SECURITY_ADMIN_PASSWORD |
Grafana admin password | Default: aref |
| Component | Status | Notes |
|---|---|---|
| Core framework (config, events, metrics, models, logging) | Complete | Singleton patterns, async event bus, all AREF formulas |
| Detection pillar | Complete | All 4 detection classes operational, live data feed every 10s, alert lifecycle |
| Absorption pillar | Complete | Circuit breakers, bulkheads, rate limiters, blast radius, degradation — all wired |
| Adaptation pillar | Complete | Decision tree, feature flags, traffic shifting, scaling — event-driven |
| Recovery pillar | Complete | T0-T4 tiered engine, runbook executor, YAML loading |
| Evolution pillar | Complete | Auto-review generation, action tracking, pattern matching, knowledge base |
| Maturity model | Complete | L1-L5 assessment, CRS scoring across 4 risk profiles |
| Chaos engineering | Complete | 5 experiments, fault injector with safety bounds + auto-rollback |
| 5 microservices | Complete | Gateway, Orders, Payments, Inventory, Notifications — all enhanced |
| Service factory | Complete | Shared health/readyz/info/metrics endpoints |
| Dashboard API | Complete | 19 REST endpoints covering all pillars |
| Dashboard UI | Complete | 6 tabs: Overview, Detection, Absorption, Adaptation, Services, Timeline |
| CLI | Complete | 7 commands: status, pillars, maturity, chaos, chaos-stop, timeline, serve |
| Demo script | Complete | 12-step Rich terminal demo with CLI args, pillar deep-dive, comparison |
| Prometheus integration | Complete | /metrics endpoints on all services, scrape config |
| Grafana provisioning | Partial | Datasource provisioned, no dashboard JSON defined |
| Tests | Partial | 14 unit test classes + 2 integration tests covering core + all pillars |
- Database persistence: PostgreSQL is provisioned in Docker but not used — all state is in-memory
- Redis usage: Redis is provisioned but not used for caching/queuing — services use in-memory structures
- Grafana dashboards: Datasource is provisioned but no pre-built dashboard JSON panels
- Authentication/authorization: CORS is wide-open (
allow_origins=["*"]), no auth on endpoints - Real scaling: AutoScaler is simulated (in-memory counter), not wired to Docker/K8s
- Predictive detection: Level 5 maturity criteria reference predictive detection, not implemented
- Dashboard Evolution tab: Placeholder or partially built (noted as in-progress in prior sessions)
- Dashboard Services tab: Had visual issues (status dots, empty panels) noted but not fully debugged
- Schema fallback: Listed in adaptation strategies but not implemented
- Alembic migrations: Listed as dependency but no migrations directory exists
- Textual TUI: Listed as dependency but not used (CLI uses Click + Rich instead)
- No .env file: All config relies on env vars or hardcoded defaults, no .env.example template
- No git repository: Project has not been initialized with git
The demo exercises the full D-A-A-R-E lifecycle through the running stack:
- Preflight health check (all 6 services)
- Baseline CRS snapshot
- Baseline traffic (order pipeline through gateway)
- Service deep health (gateway's deep-health endpoint)
- Chaos injection (any of 5 experiments via API)
- Incident traffic (observe failover, provider switching)
- During-incident CRS + pillar deep-dive (Detection, Absorption, Adaptation, Recovery, Evolution)
- Event timeline
- Chaos rollback
- Recovery traffic
- Final CRS snapshot
- Before/after comparison + traffic summary
What it does NOT exercise: maturity assessment, runbook execution, evolution auto-review generation (these happen in-process via event bus, not exposed through the demo's HTTP-only approach).
Runtime:
| Package | Version | Purpose |
|---|---|---|
| fastapi | >=0.110.0 | Web framework for services + dashboard |
| uvicorn[standard] | >=0.27.0 | ASGI server |
| httpx | >=0.27.0 | Async HTTP client (inter-service calls, synthetic probes) |
| sqlalchemy[asyncio] | >=2.0.25 | ORM (provisioned, not actively used) |
| asyncpg | >=0.29.0 | PostgreSQL async driver (provisioned, not actively used) |
| alembic | >=1.13.0 | DB migrations (provisioned, no migrations defined) |
| redis[hiredis] | >=5.0.0 | Redis client (provisioned, not actively used) |
| prometheus-client | >=0.20.0 | Metrics exposition |
| prometheus-fastapi-instrumentator | >=7.0.0 | Auto-instrument FastAPI |
| pydantic | >=2.6.0 | Data validation |
| pydantic-settings | >=2.1.0 | Env-driven config |
| rich | >=13.7.0 | Terminal UI (CLI + demo) |
| textual | >=0.50.0 | TUI framework (listed, not used) |
| click | >=8.1.0 | CLI command framework |
| numpy | >=1.26.0 | Anomaly detection math |
| scikit-learn | >=1.4.0 | Isolation Forest model |
| apscheduler | >=3.10.0 | Task scheduling (listed, not actively used) |
| tenacity | >=8.2.0 | Retry library (listed, not actively used) |
| pyyaml | >=6.0.0 | Runbook YAML parsing |
| structlog | >=24.1.0 | Structured logging |
| jinja2 | >=3.1.0 | Template rendering |
| plotly | >=5.18.0 | Charts (listed, dashboard uses JS canvas instead) |
Dev:
| Package | Version | Purpose |
|---|---|---|
| pytest | >=8.0.0 | Testing |
| pytest-asyncio | >=0.23.0 | Async test support |
| pytest-cov | >=4.1.0 | Coverage |
| ruff | >=0.2.0 | Linting |
| mypy | >=1.8.0 | Type checking |
No Node/npm dependencies. The dashboard is pure HTML/CSS/JS served by FastAPI's StaticFiles.
[project.scripts]
aref = "aref.cli.main:cli" # aref status | aref pillars | aref chaos ...
aref-demo = "scripts.demo:main" # aref-demo --experiment ... --orders ... --fast[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[tool.hatch.build.targets.wheel]
packages = ["aref", "services", "chaos", "scripts"]All 5 microservice files follow the {service}_app.py convention:
services/gateway/gateway_app.pyservices/orders/orders_app.pyservices/payments/payments_app.pyservices/inventory/inventory_app.pyservices/notifications/notifications_app.py
This is reflected in both docker-compose.yml uvicorn commands and local development:
command: uvicorn services.gateway.gateway_app:app --host 0.0.0.0 --port 8000All major files use the established metadata format:
"""
Metadata
- Title: ...
- File Name: ...
- Relative Path: ...
- Artifact Type: script
- Version: 2.0.0
- Date: 2026-03-13
- Author: Dennis 'dnoice' Smaltz
- A.I. Acknowledgement: Anthropic - Claude Opus 4
- Signature: ︻デ═─── ... | Aim Twice, Shoot Once!
Description:
...
Key Features:
...
Usage Instructions:
...
---------
"""- Package
__init__.pyfiles contain single-line docstrings describing the module - Engine classes follow the pattern:
__init__,start(),stop(),get_status() -> dict - Singleton access via
get_X()/reset_X()module-level functions (config, event bus, metrics engine, circuit breaker registry) - Test files:
test_{module}.pywithTest{Class}classes
- Version: 2.0.0 (defined in
aref/__init__.pyandpyproject.toml) - Git: Not initialized — no
.gitdirectory, no branches, no commit history - Recent major work: Service enhancement (all 5 services expanded 2-4x), service factory (
base.py) rewrite, dashboard Timeline tab build, demo script rewrite with Rich UI
| Method | Endpoint | Returns |
|---|---|---|
| GET | / |
Dashboard HTML SPA |
| GET | /api/aref/status |
CRS, pillar scores, risk profile, chaos status |
| GET | /api/aref/services |
All service health + versions |
| GET | /api/aref/alerts |
Active alerts from detection engine |
| GET | /api/aref/timeline |
Events (up to 100) + summary (by_category, by_severity, by_source) |
| GET | /api/aref/detection |
Threshold status, anomaly stats, SLI tracker, active alerts |
| GET | /api/aref/absorption |
Circuit breakers, rate limiters, bulkheads, blast radius, degradation |
| GET | /api/aref/adaptation |
Feature flags, traffic shifter, scaler, active/total adaptations |
| GET | /api/aref/recovery |
Active recoveries, total recovered, runbooks |
| GET | /api/aref/evolution |
Reviews, action tracker stats, knowledge base, improvement velocity, patterns |
| GET | /api/aref/maturity |
Full maturity assessment across all profiles |
| GET | /api/aref/metrics |
MTTD, MTTR, availability, error budgets, CRS per profile |
| POST | /api/aref/chaos/start |
Start chaos experiment {"experiment": "..."} |
| POST | /api/aref/chaos/stop |
Stop all chaos + rollback |
| GET | /api/aref/chaos/status |
Active injections |
| GET | /metrics |
Prometheus metrics (text format) |
| Endpoint | Purpose |
|---|---|
/health |
Uptime, total_requests, total_errors, error_rate, memory_mb |
/healthz |
K8s liveness probe (200 OK) |
/readyz |
K8s readiness probe (200 or 503) |
/info |
Service name, version, dependencies, python version, platform, PID |
/metrics |
Prometheus metrics |
/chaos/enable |
Enable chaos injection (type, rate, delay) |
/chaos/disable |
Disable chaos injection |
Gateway (8000): /api/orders (full pipeline), /api/services/deep-health, /api/gateway/stats, /api/gateway/requests
Orders (8001): /orders (CRUD + filters), /orders/{id}/confirm, /orders/{id}/cancel, /orders/audit/log, /orders/stats/summary
Payments (8002): /payments/process, /payments/{id}/refund, /payments/settle, /payments/providers/health, /payments/providers/reset, /payments/stats/summary
Inventory (8003): /inventory/stock, /inventory/reserve, /inventory/release/{id}, /inventory/replenish, /inventory/adjust, /inventory/movements, /inventory/catalog, /inventory/stats
Notifications (8004): /notifications/send, /notifications/batch, /notifications/shed-level, /notifications/templates, /notifications/history, /notifications/stats
| Name | Target | Fault | Rate | Duration | Hypothesis |
|---|---|---|---|---|---|
payment_provider_failure |
payments | error | 80% | 120s | CB opens in 30s, system switches to backup provider |
order_service_latency |
orders | latency (3s) | 70% | 90s | Anomaly detection fires in 3min, auto-scaler adds instances |
inventory_degradation |
inventory | error | 60% | 60s | Inventory degrades to cached/minimal mode, orders still process |
notification_overload |
notifications | latency (5s) | 90% | 60s | Feature flags disable notifications, core flow unaffected |
cascading_failure |
payments | error | 95% | 180s | Full D-A-A-R-E pipeline activates, recovery in 15min, auto-review |
Safety bounds enforced: max 90% fault rate, max 5 minute duration, auto-rollback on expiry.
Goal: Develop a comprehensive refactoring and restructuring plan for the AREF v2.0.0 codebase, focused on four pillars of code quality: separation of concerns, modularity, maintainability, and extensibility for ongoing improvements and fixes.
What we need to produce:
-
Architectural Review — Evaluate the current codebase against software engineering best practices, identifying where responsibilities are tangled, where modules are doing too much, where coupling is too tight, and where the boundaries between framework, services, and infrastructure are blurred.
-
Dependency Audit — Flag the provisioned-but-unused dependencies (SQLAlchemy, asyncpg, alembic, redis, textual, apscheduler, tenacity, plotly) and recommend keep/remove/defer decisions for each. Clean up the dependency footprint so what's declared matches what's actually used.
-
Separation of Concerns Assessment — Specifically examine:
- The 1176-line
dashboard/app.pythat handles lifespan wiring, engine initialization, seed data, AND 19 API endpoints in a single file - Service files that mix business logic, chaos injection, and metrics registration in flat scripts
- The boundary between the AREF framework package (
aref/) and the demo microservices (services/) — are they properly decoupled? - Whether
aref/core/is the right home for everything currently in it, or if some pieces should be elevated or redistributed - The monolithic
dashboard.css(2149 lines) anddashboard.js(2758 lines) — the stylesheet bundles design tokens, reset, layout, sidebar, cards, every pillar-specific panel style, metrics page components, animations, and timeline into a single file with no modular structure. The JavaScript bundles configuration, state management, API client, polling loop, DOM rendering for all six tabs, SVG chart generation, gauge animation, chaos controls, and navigation into a single IIFE. Both need to be decomposed into properly separated modules (CSS partials or scoped files per component/view, JS modules per concern — API layer, state, rendering per tab, shared utilities, chart components).
- The 1176-line
-
Modularity Plan — Propose concrete file splits, module reorganizations, and interface boundaries that would allow:
- Individual pillars to be developed, tested, and deployed independently
- The dashboard to be a thin API layer over the engines, not the wiring hub
- Seed data and demo concerns to be fully separated from production code paths
- New pillar features (predictive detection, schema fallback, etc.) to be added without touching existing files
-
Maintainability Roadmap — Address the known gaps that impact long-term health:
- Git initialization and branching strategy
.env.exampletemplate for onboarding new developers- Test coverage expansion strategy (what's tested, what's not, priority order)
- Documentation gaps (README is a stub, no CONTRIBUTING guide, no architecture diagram)
- Linting/formatting enforcement (ruff and mypy are declared as dev deps but no config or CI)
-
Prioritized Action Plan — Deliver the refactoring work as a phased, ordered list — not a wish list, but a sequenced plan where each phase builds on the previous one, with clear "done" criteria per phase. Phase 1 should be achievable in a single working session; later phases can span multiple sessions.
Constraints:
- No functional regressions — the demo, dashboard, CLI, and all services must continue to work after each phase
- Preserve the existing docstring standard on all new/modified files
- Maintain the
*_app.pynaming convention for services - Keep the in-memory architecture for now — database persistence is a future phase, not part of this refactor
- All recommendations should be concrete (specific files, specific splits, specific moves) — not abstract principles
︻デ═─── ✦ ✦ ✦