Skip to content

feat: resilient background job manager with retry & monitoring#382

Open
robotica4us-collab wants to merge 1 commit intorohitdash08:mainfrom
robotica4us-collab:feat/resilient-background-jobs
Open

feat: resilient background job manager with retry & monitoring#382
robotica4us-collab wants to merge 1 commit intorohitdash08:mainfrom
robotica4us-collab:feat/resilient-background-jobs

Conversation

@robotica4us-collab
Copy link

Implements a production-ready background job system that wraps APScheduler with automatic retry, dead-letter tracking, and full observability.

Features:

  • Configurable exponential backoff retry (per-job RetryPolicy)
  • Dead-letter queue for permanently failed jobs with admin reset
  • Per-job execution history with timing and error details
  • Prometheus metrics (executions, retries, dead-letters, duration)
  • Redis-backed state persistence for crash recovery
  • Admin API endpoints (/jobs/status, /jobs/dead-letters, /jobs//reset)
  • Unauthenticated health check (/jobs/health) for monitoring systems
  • Automatic Flask app context injection for job functions

The reminder processing job is wired up as the first managed job, replacing the manual /reminders/run endpoint with automatic processing.

Includes 17 tests covering:

  • RetryPolicy exponential backoff and delay capping
  • JobState history capping
  • Job registration and status reporting
  • Dead-letter detection and reset
  • Flask integration (health endpoint, admin auth)
  • Successful execution resets attempts
  • Failing jobs retry then dead-letter
  • Transient failures recover on retry
  • Redis persistence and crash recovery

Closes #130

Summary

  • What changed:
  • Why:

Validation

  • Frontend lint: cd app && npm run lint
  • Frontend tests: cd app && npm test -- --runInBand
  • Backend tests: ./scripts/test-backend.ps1
  • Updated docs if needed

Security and Ownership

  • PR opened from a fork (not direct push to main)
  • CODEOWNERS review requested

Checklist

  • No secrets added
  • No unrelated files changed
  • Breaking changes documented

Implements a production-ready background job system that wraps APScheduler
with automatic retry, dead-letter tracking, and full observability.

Features:
- Configurable exponential backoff retry (per-job RetryPolicy)
- Dead-letter queue for permanently failed jobs with admin reset
- Per-job execution history with timing and error details
- Prometheus metrics (executions, retries, dead-letters, duration)
- Redis-backed state persistence for crash recovery
- Admin API endpoints (/jobs/status, /jobs/dead-letters, /jobs/<id>/reset)
- Unauthenticated health check (/jobs/health) for monitoring systems
- Automatic Flask app context injection for job functions

The reminder processing job is wired up as the first managed job,
replacing the manual /reminders/run endpoint with automatic processing.

Includes 17 tests covering:
- RetryPolicy exponential backoff and delay capping
- JobState history capping
- Job registration and status reporting
- Dead-letter detection and reset
- Flask integration (health endpoint, admin auth)
- Successful execution resets attempts
- Failing jobs retry then dead-letter
- Transient failures recover on retry
- Redis persistence and crash recovery

Closes rohitdash08#130
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Resilient background job retry & monitoring

1 participant