Add graceful degradation for external service dependencies#192
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When Postgres is unavailable at startup the application crashes or enters a boot loop, and when Redis is down the health endpoint returns 500 instead of a meaningful degraded status. There is no structured startup logging to show which services are reachable, making it hard to diagnose transient dependency failures.
Solution
Added a lazy DB availability probe (
probe_db/is_db_available) indb/session.pythat never raises — the engine object is created unconditionally (SQLAlchemy defers the TCP handshake to the first query) and a cheapSELECT 1is used to detect reachability. The lifespan inmain.pynow probes Postgres, Redis, and Metrics at startup and emits structured[STARTUP] Service: STATE (note)log lines followed by a summary line (Application ready (all services healthy)orApplication ready (degraded mode)). The root/healthendpoint now returns 503 with"postgres": "unavailable"when the DB probe fails, and the deep/api/v1/healthendpoint returns 503 when Postgres is down while keeping 200 for Redis/Discord failures (non-critical). The_boot_guardmiddleware is unchanged — it only blocks on hard config errors, not DB unavailability, so the app continues to serve all routes in degraded mode.Changes
backend/app/db/session.pybackend/app/api/v1/endpoints/health.pybackend/app/main.pyGenerated by Railway