Skip to content

Self-Serve MDR: Security review / threat model #937

@bjagg

Description

@bjagg

Umbrella tracking issue for the security posture of the self-serve MDR feature shipped under issues #882, #883, and #884. Filed at PM request for visibility.

Each item below is a gap or concern that warrants its own analysis (and, in most cases, its own follow-up issue or PR). Pre-existing security findings caught during the 2026-05-26 dev re-deploy session are at the bottom.

Authentication / identity

Cryptographic material

  • HMAC secret sharingmdr__auth__jwt_secret_key currently signs HS256 JWTs and HMACs the workspace cookie and HMACs invite tokens. Rotating the key atomically invalidates all three (operationally noisy). Consider separating per concern. See components/lif/mdr_auth/{workspace_cookie,invite_token}.py.
  • Secret rotation runbook — How is MdrAuthJwtSecretKey rotated, and what's the downstream blast radius? Not currently documented.

Invite tokens

  • Reusable until expiry — Invite tokens are self-contained (no server-side store), so they are effectively reusable until the 7-day TTL. Single-use enforcement is explicitly deferred from v1. Decide if this is acceptable for the demo audience or needs to ship before broader rollout.
  • Inviter accountability — Token carries the inviter's Cognito sub but no audit log on the server side; if an invite is misused, we can't trace it back without DB-level forensics.

Tenant isolation correctness

  • Cross-tenant query test — Tenant routing depends entirely on SET search_path per request. We should have an integration test that asserts a request bearing tenant A's cookie cannot read data in tenant B's schema, including via SQL injection vectors and accidental superuser fallthrough.
  • search_path fallback — When the resolved tenant schema doesn't exist (e.g., the bug we hit on 2026-05-26 where Lambda failed silently to provision), what's the documented fallback? Should it be "deny" rather than "silently use the next schema in search_path"?

IAM / least privilege

  • Post-confirmation Lambda IAM scope — The Lambda's role currently grants SSM read for one specific key + Cognito AdminAddUserToGroup. Audit whether it's exactly that or wider.
  • MDR API task role — Does it have any privileges (e.g., S3, KMS) beyond what tenant routing needs?

Logging / audit / leakage

Operational (CFN drift; demo prep finding)

  • Stack drift detection — On 2026-05-26 we discovered the dev CFN stacks were 6 weeks stale: code in main had moved past what was deployed. Should we wire a drift detector that alerts when dev-lif-mdr-cognito or dev-lif-mdr-api falls more than N commits behind main? Same risk on demo.
  • Flyway migration application gating — The 6-week drift hid the fact that V1.2/V1.3/V1.4 migrations had never run on dev. ECS task expected new schema state; DB was on V1.1. Recovery required a SAM redeploy. Consider failing MDR startup health-check if flyway_schema_history doesn't match the expected highest version baked into the image.

Related issues

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions