feat: resilient background job manager with retry & monitoring by robotica4us-collab · Pull Request #382 · rohitdash08/FinMind

robotica4us-collab · 2026-03-13T03:18:08Z

Implements a production-ready background job system that wraps APScheduler with automatic retry, dead-letter tracking, and full observability.

Features:

Configurable exponential backoff retry (per-job RetryPolicy)
Dead-letter queue for permanently failed jobs with admin reset
Per-job execution history with timing and error details
Prometheus metrics (executions, retries, dead-letters, duration)
Redis-backed state persistence for crash recovery
Admin API endpoints (/jobs/status, /jobs/dead-letters, /jobs//reset)
Unauthenticated health check (/jobs/health) for monitoring systems
Automatic Flask app context injection for job functions

The reminder processing job is wired up as the first managed job, replacing the manual /reminders/run endpoint with automatic processing.

Includes 17 tests covering:

RetryPolicy exponential backoff and delay capping
JobState history capping
Job registration and status reporting
Dead-letter detection and reset
Flask integration (health endpoint, admin auth)
Successful execution resets attempts
Failing jobs retry then dead-letter
Transient failures recover on retry
Redis persistence and crash recovery

Closes #130

Summary

What changed:
Why:

Validation

Frontend lint: cd app && npm run lint
Frontend tests: cd app && npm test -- --runInBand
Backend tests: ./scripts/test-backend.ps1
Updated docs if needed

Security and Ownership

PR opened from a fork (not direct push to main)
CODEOWNERS review requested

Checklist

No secrets added
No unrelated files changed
Breaking changes documented

Implements a production-ready background job system that wraps APScheduler with automatic retry, dead-letter tracking, and full observability. Features: - Configurable exponential backoff retry (per-job RetryPolicy) - Dead-letter queue for permanently failed jobs with admin reset - Per-job execution history with timing and error details - Prometheus metrics (executions, retries, dead-letters, duration) - Redis-backed state persistence for crash recovery - Admin API endpoints (/jobs/status, /jobs/dead-letters, /jobs/<id>/reset) - Unauthenticated health check (/jobs/health) for monitoring systems - Automatic Flask app context injection for job functions The reminder processing job is wired up as the first managed job, replacing the manual /reminders/run endpoint with automatic processing. Includes 17 tests covering: - RetryPolicy exponential backoff and delay capping - JobState history capping - Job registration and status reporting - Dead-letter detection and reset - Flask integration (health endpoint, admin auth) - Successful execution resets attempts - Failing jobs retry then dead-letter - Transient failures recover on retry - Redis persistence and crash recovery Closes rohitdash08#130

robotica4us-collab requested a review from rohitdash08 as a code owner March 13, 2026 03:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: resilient background job manager with retry & monitoring#382

feat: resilient background job manager with retry & monitoring#382
robotica4us-collab wants to merge 1 commit intorohitdash08:mainfrom
robotica4us-collab:feat/resilient-background-jobs

robotica4us-collab commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

robotica4us-collab commented Mar 13, 2026

Summary

Validation

Security and Ownership

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant