From d0ab0e27912a256a29427386072faefa31a8538b Mon Sep 17 00:00:00 2001 From: weroperking <139503221+weroperking@users.noreply.github.com> Date: Wed, 29 Apr 2026 01:19:43 +0000 Subject: [PATCH] docs: add self-hosted production readiness roadmap Comprehensive audit and gap analysis scoped to self-hosted production. Identifies real blockers across security, stability, operations, and multi-instance scaling. Excludes SaaS-scale items not relevant for teams running their own instance. Includes prioritized 3-phase plan. --- docs/guides/production-readiness-roadmap.md | 332 ++++++++++++++++++++ 1 file changed, 332 insertions(+) create mode 100644 docs/guides/production-readiness-roadmap.md diff --git a/docs/guides/production-readiness-roadmap.md b/docs/guides/production-readiness-roadmap.md new file mode 100644 index 0000000..045521e --- /dev/null +++ b/docs/guides/production-readiness-roadmap.md @@ -0,0 +1,332 @@ +# Self-Hosted Production Readiness Roadmap + +A pragmatic technical audit scoped specifically for **self-hosted production** — teams running BetterBase on their own infrastructure (Docker Compose, VPS, or Kubernetes). This document strips away SaaS-scale distractions and focuses only on what is required to run a stable, secure, single-tenant instance that teams can confidently deploy and maintain. + +## Executive Summary + +BetterBase's self-hosted stack (Postgres, MinIO, Inngest, Nginx, server + dashboard) is already functional and deployable. The gaps between "it runs" and "a team can rely on it in production" are smaller than a generic SaaS audit would suggest. + +| Pillar | Status | Real Blockers for Self-Hosting | +|--------|--------|-------------------------------| +| Security | Beta | Rate limiting, scope enforcement, HTML escaping | +| Stability | Beta | Transactional migrations, deep health checks, graceful shutdown | +| Operations | Alpha | Backups, Prometheus metrics, K8s manifests | +| Multi-Instance Scaling | Alpha | Redis bridge for WS + rate limits across replicas | +| Developer Experience | Good | React-only client is limiting but not production-blocking | + +**Overall readiness: 80%.** The remaining 20% is operational hardening and multi-replica support. Several items from a generic enterprise audit are **not required** for self-hosting and have been intentionally excluded below. + +--- + +## What We Are NOT Doing (Out of Scope) + +These are valid features for a managed SaaS or unicorn-scale platform, but they are unnecessary overhead for teams self-hosting their own backend. + +| Item | Why It's Excluded | +|------|-------------------| +| Read replica routing | Self-hosted teams scale vertically or use managed Postgres (Neon/RDS) which handles this | +| Multi-language SDKs (Python, Go, etc.) | TypeScript client is sufficient for v1; OpenAPI export can come later | +| React-free client SDK refactor | Important for DX, but not a production stability issue | +| API versioning (`/v1/`) | Self-hosted teams control their own upgrade cadence | +| OpenAPI/Swagger generation | Nice to have, but docs and examples are sufficient for self-hosted teams | +| Chaos engineering / failure injection | Overkill for typical self-hosted deployments | +| Circuit breakers | Single-tenant instances with local MinIO/Postgres do not need SaaS-style circuit breaking | +| Terraform / Pulumi IaC | Teams bring their own infrastructure; Docker Compose and K8s are the interface | +| One-click deploy scripts for Railway/Render/Fly | These platforms already support Docker; maintain official Compose instead | +| MFA / TOTP for admin accounts | A single team's admin login can be protected by SSO proxy or VPN in practice | +| IP allowlisting | Same as above — handled at the network/VPN layer for most self-hosted setups | +| Application-level column encryption | Postgres at-rest encryption is sufficient for single-tenant self-hosting | +| RS256 JWT / asymmetric signing | HS256 is fine when the secret is rotated and stored in container env/secrets | +| Replacing Neon CDC polling | Self-hosted teams can use standard Postgres with `LISTEN/NOTIFY`, which already works | + +--- + +## 1. Security (Must-Have for Self-Hosting) + +### 1.1 Rate Limiting on Auth Endpoints + +**Gap:** The `rate_limits` table exists (`packages/server/migrations/019_rate_limits.sql`) but no middleware uses it. Admin login and device verification are vulnerable to brute force and credential stuffing. + +**Evidence:** + +```typescript +// packages/server/src/routes/admin/auth.ts +authRoutes.post("/login", async (c) => { + // No rate limiting applied +}); +``` + +**Required:** + +1. Implement a sliding-window rate limiter backed by `betterbase_meta.rate_limits`. +2. Apply to `/admin/auth/login`, `/device/verify`, and `/device/token`. +3. Add progressive delays after repeated failures (5 min, 15 min, 1 hour). +4. Emit structured security events for audit logging. + +### 1.2 API Key Scope Enforcement + +**Gap:** API keys include `scopes` in the database (`008_api_keys.sql`), but `requireAdmin` middleware does not validate them. A read-only key currently grants full admin access. + +**Evidence:** + +```typescript +// packages/server/src/lib/admin-middleware.ts +if (keyRows.length === 0) return c.json({ error: "Invalid API key" }, 401); +// Scope check is missing +``` + +**Required:** + +1. Add scope requirements to privileged routes. +2. Reject requests where the API key lacks the required scope. +3. Add tests proving scope-based denial. + +### 1.3 HTML Injection in Device Verification + +**Gap:** The device verification page interpolates `userCode` directly into HTML without escaping. + +**Evidence:** + +```typescript +// packages/server/src/routes/device/index.ts +`` +``` + +**Required:** + +1. Escape HTML entities (`<`, `>`, `"`, `&`) before interpolation. +2. Add `Content-Security-Policy: default-src 'self'` to the response. + +### 1.4 Security Headers + +**Gap:** No global security headers are set. Missing `X-Frame-Options`, `X-Content-Type-Options`, `Strict-Transport-Security`. + +**Required:** + +1. Add a Hono middleware that sets: + +```typescript +c.header("X-Content-Type-Options", "nosniff"); +c.header("X-Frame-Options", "DENY"); +c.header("Strict-Transport-Security", "max-age=31536000; includeSubDomains"); +``` + +### 1.5 Secret Management + +**Gap:** `.env.self-hosted.example` uses placeholder secrets. No validation enforces strong secrets or rotation. + +**Required:** + +1. Add startup validation: `BETTERBASE_JWT_SECRET` must be >= 32 chars and not a known weak value. +2. Document secret rotation procedure in [SELF_HOSTED.md](../../SELF_HOSTED.md). + +--- + +## 2. Stability & Reliability + +### 2.1 Transactional Migrations + +**Gap:** The migration runner executes SQL without per-file transactions. A failed migration can leave the database in an inconsistent state. + +**Evidence:** + +```typescript +// packages/server/src/lib/migrate.ts +const sql = await readFile(join(MIGRATIONS_DIR, file), "utf-8"); +await pool.query(sql); // No BEGIN/COMMIT +await pool.query("INSERT INTO betterbase_meta.migrations (filename) VALUES ($1)", [file]); +``` + +**Required:** + +1. Wrap each migration in `BEGIN; ... COMMIT;` with rollback on failure. +2. Abort server startup if any migration fails. +3. Add idempotency checks (`IF NOT EXISTS`) to all DDL. + +### 2.2 Deep Health Checks + +**Gap:** `/health` returns a static JSON object. It does not verify database connectivity, storage accessibility, or Inngest reachability. + +**Required:** + +1. Implement deep health checks: + +```typescript +app.get("/health", async (c) => { + const db = await checkDatabase(); + const storage = await checkStorage(); + const inngest = await checkInngest(); + const ok = db && storage && inngest; + return c.json({ db, storage, inngest }, ok ? 200 : 503); +}); +``` + +2. Add `/ready` (dependencies up, migrations complete) and `/live` (process alive) for orchestrators. + +### 2.3 Graceful Shutdown + +**Gap:** The server does not handle `SIGTERM`. In-flight requests and database connections may be dropped during restarts or deploys. + +**Required:** + +1. On `SIGTERM`, stop accepting new connections. +2. Wait for active requests to finish (with a timeout, e.g., 30s). +3. Close the `pg` pool and WebSocket server cleanly. +4. Exit with code 0. + +### 2.4 E2E Test Suite + +**Gap:** Extensive unit tests exist, but no end-to-end tests validate the full self-hosted stack (client → Nginx → server → Postgres → MinIO). + +**Required:** + +1. Add E2E tests for critical self-hosted flows: + - Admin login → project creation → function call + - Storage upload via signed URL + - Realtime subscription over WebSocket +2. Run E2E tests in CI against `docker-compose.self-hosted.yml`. + +--- + +## 3. Operations (Running It Day-to-Day) + +### 3.1 Backup & Recovery + +**Gap:** No automated backup strategy exists for Postgres or MinIO. Self-hosted teams are responsible for their own data but have no guidance or tooling. + +**Required:** + +1. Add a Postgres backup sidecar to `docker-compose.self-hosted.yml` (e.g., `pg_dump` cron or `wal-g`). +2. Document restore procedure in [SELF_HOSTED.md](../../SELF_HOSTED.md). +3. Add MinIO bucket versioning and replication guidance. + +### 3.2 Metrics & Monitoring + +**Gap:** No Prometheus metrics endpoint exists. Operators cannot observe request rates, error rates, or database latency. + +**Required:** + +1. Add `/metrics` endpoint exporting Prometheus format: + - `http_requests_total` (method, route, status) + - `db_query_duration_seconds` + - `ws_connections_active` +2. Provide a Grafana dashboard JSON for self-hosted deployments. +3. Optionally add `docker-compose.observability.yml` with Prometheus + Grafana. + +### 3.3 Log Aggregation + +**Gap:** Pino logs are structured but not shipped anywhere by default. Debugging a multi-container self-hosted deployment is painful. + +**Required:** + +1. Document how to forward logs to Loki, Datadog, or CloudWatch. +2. Ensure `request_id` is present in every log line for traceability. + +### 3.4 Upgrade Path + +**Gap:** No documented procedure for upgrading a self-hosted instance to a new BetterBase release. + +**Required:** + +1. Document upgrade checklist: + - Backup database + - Pull new image / rebuild + - Run migrations (automatic on startup) + - Verify `/health` +2. Provide a `docker-compose.self-hosted.yml` that pins image tags for reproducible upgrades. + +--- + +## 4. Multi-Instance Scaling (For Teams Running Multiple Replicas) + +### 4.1 Shared State for WebSockets + +**Gap:** WebSocket tickets and subscriptions are stored in-memory. If two server containers run behind a load balancer, a client connected to instance A will miss invalidations triggered on instance B. + +**Required:** + +1. Add optional Redis integration. +2. Use Redis for: + - WebSocket ticket storage (instead of `Map`) + - Cross-instance pub/sub for realtime invalidations +3. Document: "Run a single server container, or add Redis and run multiple." + +### 4.2 Shared Rate Limiting + +**Gap:** The `rate_limits` table is Postgres-backed (good), but if implemented as in-memory caching it would break across replicas. Ensure the implementation queries the database or uses Redis. + +**Required:** + +1. Implement rate limiting against Postgres or Redis so it works consistently across replicas. + +### 4.3 Kubernetes Manifests + +**Gap:** Only Docker Compose exists. Teams running K8s must write their own manifests. + +**Required:** + +1. Provide basic K8s manifests: + - `Deployment` for `betterbase-server` + - `Deployment` for dashboard (nginx) + - `Service` and `Ingress` + - `StatefulSet` or external managed Postgres note +2. Add a simple Helm chart with `values.yaml` for configuration. + +--- + +## Implementation Roadmap + +### Phase 1: Security & Stability (Weeks 1–3) + +| Task | Severity | Deliverable | +|------|----------|-------------| +| Rate limiting middleware | Critical | `packages/server/src/middleware/rate-limit.ts` | +| Enforce API key scopes | Critical | Updated `requireAdmin` + tests | +| Transactional migrations | High | Updated `migrate.ts` with `BEGIN/COMMIT` | +| Escape HTML in device verify | High | Updated `device/index.ts` + CSP | +| Security headers middleware | Medium | Global Hono middleware | +| Deep health checks | High | `/health`, `/ready`, `/live` | +| Graceful shutdown | Medium | `SIGTERM` handler in server entry | + +### Phase 2: Operations (Weeks 3–5) + +| Task | Severity | Deliverable | +|------|----------|-------------| +| Prometheus `/metrics` | High | Metrics endpoint + Grafana dashboard | +| Backup sidecar + docs | High | Compose service + restore guide | +| E2E test suite | High | Tests against self-hosted Compose stack | +| Upgrade documentation | Medium | Update [SELF_HOSTED.md](../../SELF_HOSTED.md) | +| Log aggregation guide | Low | Docs for Loki/CloudWatch forwarding | + +### Phase 3: Multi-Instance (Weeks 5–7) + +| Task | Severity | Deliverable | +|------|----------|-------------| +| Redis integration for WS | Medium | Cross-instance pub/sub | +| Redis-backed rate limits | Medium | Shared state across replicas | +| Kubernetes manifests | Medium | `k8s/` directory + Helm chart | + +--- + +## Acceptance Criteria for Self-Hosted Production + +All of the following must be true to declare BetterBase **production-ready for self-hosting**: + +1. **Auth abuse gate:** Login and device verification enforce rate limits and temporary lockouts. +2. **Scope gate:** API key scopes are enforced on protected routes. +3. **Migration gate:** Each migration executes atomically; startup aborts on failure. +4. **Health gate:** `/health` checks database, storage, and Inngest; `/ready` and `/live` exist. +5. **Shutdown gate:** Server handles `SIGTERM` gracefully without dropping in-flight requests. +6. **E2E gate:** Critical flows (auth, CRUD, storage, realtime) pass E2E tests in CI against the self-hosted Compose stack. +7. **Backup gate:** Automated Postgres backups are documented and runnable via Compose. +8. **Metrics gate:** Prometheus metrics are available at `/metrics` with a reference Grafana dashboard. + +--- + +## Related + +- [SELF_HOSTED.md](../../SELF_HOSTED.md) - Self-hosted deployment guide +- [Production Checklist](./production-checklist.md) - Pre-deployment checklist +- [Deployment](./deployment.md) - Deployment platform guides +- [Security Best Practices](./security-best-practices.md) - Security hardening +- [Hardening Review v3](../core/hardening-review-v3.md) - Backend security audit