Celery Integration for Poolboy #147

marcosmamorim · 2026-01-02T23:10:43Z

Overview

This PR introduces Celery-based distributed task processing for Poolboy, enabling horizontal scalability and improved performance through asynchronous background processing of resource management operations.

Motivation

Poolboy supports a manager mode architecture where multiple operator replicas distribute event handling across pods. This approach provides good reliability and basic load distribution.

Celery integration complements this architecture by adding:

True parallel processing - Heavy operations (templating, provisioning) run in dedicated worker processes without blocking event handling
Independent scaling - Workers scale separately from the operator via HPA, responding dynamically to workload
Task prioritization - Different queue configurations for different operation types
Enhanced observability - Per-task metrics, execution tracking, and monitoring via Flower dashboard
Robust retry handling - Automatic retry with exponential backoff for transient failures

Solution Architecture

The implementation adds three new Kubernetes components alongside the existing Poolboy operator:

Redis - Message broker and result backend, also used for distributed locks and shared cache
Celery Workers - Execute resource management tasks in parallel
Celery Beat Scheduler (optional) - Periodic task scheduling for reconciliation

Processing Flow

When workers are enabled:

The operator receives Kubernetes events via Kopf handlers
Instead of processing directly, it dispatches a Celery task with the resource definition
Workers pick up tasks from partitioned queues and execute the same business logic
Distributed locks ensure only one worker processes a given resource at a time
Periodic maintenance uses Celery's group primitive to batch process multiple resources in parallel

The operator falls back to synchronous processing when workers are disabled, ensuring backward compatibility.

Key Features

Feature Flags per Resource Type

Each resource class can be individually enabled for worker processing:

useWorkers.resourcePool.enabled
useWorkers.resourceHandle.enabled
useWorkers.resourceClaim.enabled

This allows gradual migration and rollback per component.

Operation Modes

Three modes control how periodic reconciliation is handled:

Mode	Description	Use Case
`scheduler`	Celery Beat dispatches periodic tasks	Default - recommended
`daemon`	Kopf daemons dispatch to workers	Lower latency
`both`	Both mechanisms run	High availability

Partitioned Queues

Tasks are routed to partitioned queues using consistent hashing of namespace/name:

Same resource → same partition → sequential processing (event ordering preserved)
Different resources → parallel processing across partitions and workers
Configurable partition count per resource type

Distributed Locking

Redis-based distributed locks prevent concurrent processing of the same resource across workers:

Lock acquired before execution, released after
Tasks retry with exponential backoff if lock is unavailable
Lock timeout prevents deadlocks from crashed workers

Shared Redis Cache

A unified cache system enables state sharing between operator and workers:

Object caches (ResourceHandle, ResourceClaim, ResourcePool) stored in Redis
Workers can access cached objects without API round-trips
Eliminates the need for separate "processed" markers
Automatic backend selection: in-memory for standalone, Redis for distributed

Horizontal Pod Autoscaler

Workers support HPA based on CPU/memory utilization for automatic scaling under load.

Components Migrated

Component	Status	Notes
ResourcePool	Complete	Pool maintenance via workers
ResourceHandle	Complete	Handle provisioning and lifecycle management via workers
ResourceClaim	Complete	Claim reconciliation via workers (with binding caveat)
ResourceProvider	—	Static configuration, no worker benefit
ResourceWatch	—	Watch management remains in operator

Note: ResourceClaim binding still occurs in the operator to leverage in-memory cache for pool handle discovery. Once bound, subsequent updates are processed by workers.

Configuration

All configuration is managed through Helm values.yaml:

# Enable worker infrastructure
worker:
  enabled: true
  replicas: 2
  config:
    concurrency: 4
    pool: prefork

# Enable workers per resource type
useWorkers:
  resourcePool:
    enabled: true
    daemonMode: "scheduler"
    partitions: 2
  resourceHandle:
    enabled: true
    daemonMode: "scheduler"
    partitions: 4
  resourceClaim:
    enabled: true
    daemonMode: "scheduler"
    partitions: 4

# Redis for message broker and cache
redis:
  enabled: true
  persistence:
    enabled: true

# Optional: Celery Beat scheduler for periodic tasks
scheduler:
  enabled: true

Observability

Task execution tracked via Celery's built-in mechanisms
Flower dashboard available for task monitoring (optional deployment)
Redis metrics available via standard Redis exporters
Worker pod metrics for HPA scaling decisions

Introduce Celery workers for asynchronous processing of resource management operations, enabling horizontal scalability and improved performance for Poolboy. Key changes: - Add Redis as message broker and shared cache backend - Implement Celery workers for ResourcePool, ResourceHandle, and ResourceClaim processing - Add partitioned queues with consistent hashing for event ordering - Implement distributed locking to prevent concurrent resource access - Create unified cache system shared between operator and workers - Add Celery Beat scheduler for periodic reconciliation tasks - Support three operation modes: daemon, scheduler, or both - Add HPA configuration for worker auto-scaling - Maintain backward compatibility with synchronous fallback The implementation uses feature flags to enable workers per resource type, allowing gradual migration and easy rollback. All existing business logic is preserved - workers execute the same class methods that previously ran synchronously in the operator.

The 'operator' directory conflicts with Python's stdlib 'operator' module when Celery imports it. Renaming the entry point to main.py and using the KOPF_OPERATORS env var eliminates the need for the poolboy_worker.py workaround. Changes: - Rename operator/operator.py to operator/main.py - Add KOPF_OPERATORS=main.py to operator deployment - Simplify worker/scheduler to use direct celery command - Remove poolboy_worker.py workaround

The celery command requires the operator directory in PYTHONPATH to find the 'tasks' module during autodiscovery. Changes: - Add workingDir: /opt/app-root/operator to worker and scheduler - Add PYTHONPATH=/opt/app-root/operator environment variable - Use direct celery command instead of shell wrapper

marcosmamorim added 3 commits January 2, 2026 18:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Celery Integration for Poolboy #147

Celery Integration for Poolboy #147

Uh oh!

marcosmamorim commented Jan 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Celery Integration for Poolboy #147

Are you sure you want to change the base?

Celery Integration for Poolboy #147

Uh oh!

Conversation

marcosmamorim commented Jan 2, 2026

Overview

Motivation

Solution Architecture

Processing Flow

Key Features

Feature Flags per Resource Type

Operation Modes

Partitioned Queues

Distributed Locking

Shared Redis Cache

Horizontal Pod Autoscaler

Components Migrated

Configuration

Observability

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant