Skip to content

Suggestion for : Domain-Aware Database Routing for Pulpcore #7819

Description

@YasenT

Domain-Aware Database Routing for Pulpcore

Problem Statement

A single RDS instance hosts hundreds of Pulp domains. A handful of hot domains (heavy sync, large content libraries) are bottlenecking the shared instance. The goal is to selectively offload specific domains to dedicated satellite RDS instances while the majority remain on the original.

This requires solving four design problems: (1) migration orchestration against N RDS instances, (2) failure handling when one instance fails mid-migration, (3) startup gating so pods verify all instances are ready, and (4) rollback of partial multi-DB migrations and domain moves.


Current State

  • DOMAIN_ENABLED = True, hundreds of domains on one RDS
  • Single DATABASES["default"] entry pointing to that RDS
  • No DATABASE_ROUTERS, no .using() calls, no multi-DB code
  • Workers coordinate via pg_notify and advisory locks on the single instance
  • All pods (API, content, worker) connect to one database

Target Architecture

flowchart TB
  subgraph pods [Pulp Pods]
    API["API pods (all connect to all RDS)"]
    Content["Content pods (all connect to all RDS)"]
    Worker["Worker pods (all connect to all RDS)"]
  end

  subgraph originalRDS ["Original RDS (default)"]
    ControlPlane["Control plane: Task, AppStatus, Role,\nAccessPolicy, ProgressReport, SystemID,\nDjango auth/contenttypes/sessions"]
    DefaultDomains["Data plane: ~hundreds of domains\n(repos, content, artifacts, distributions...)"]
    DomainTable["Domain table (authoritative)\nwith database_alias field"]
  end

  subgraph sat1 ["Satellite RDS 1"]
    Sat1Schema["Full schema (identical)"]
    Sat1Data["Data plane: 1-3 hot domains"]
    Sat1Domain["Domain table (replica)"]
  end

  subgraph sat2 ["Satellite RDS 2"]
    Sat2Schema["Full schema (identical)"]
    Sat2Data["Data plane: 1-3 hot domains"]
    Sat2Domain["Domain table (replica)"]
  end

  pods --> originalRDS
  pods --> sat1
  pods --> sat2
Loading

Key properties:

  • The original RDS is BOTH control plane AND default data plane. It does not change structurally.
  • Satellite RDS instances start empty, receive the full schema, then receive moved domain data.
  • Every pod connects directly to every RDS instance. 95%+ of queries still hit the original.
  • A domain starts on default and can be moved to a satellite. Moving it back is equally valid.

Model Classification

Control plane -- always on the original RDS (default), never routed:

  • Domain (authoritative copy; replicated read-only to satellites)
  • Task, TaskGroup, TaskSchedule, CreatedResource, ProfileArtifact
  • AppStatus, SystemID
  • AccessPolicy, Role, UserRole, GroupRole
  • ProgressReport, GroupProgressReport
  • Django built-ins (auth_*, django_content_type, django_migrations, django_admin_log, django_session)

Data plane -- routed by Domain.database_alias:

  • Repository, RepositoryVersion, RepositoryContent, RepositoryVersionContentDetails
  • Content, Artifact, ContentArtifact, RemoteArtifact, PulpTemporaryFile
  • Remote
  • Publication, PublishedArtifact, PublishedMetadata
  • Distribution, ContentGuard (and all subtypes)
  • Upload, UploadChunk
  • Exporter, Export, Importer, Import and subtypes
  • AlternateContentSource, AlternateContentSourcePath
  • SigningService and subtypes
  • UpstreamPulp
  • ALL plugin-defined models

Why Control Plane Stays on the Original RDS

Worker coordination (pg_notify, advisory locks, SELECT ... FOR UPDATE SKIP LOCKED) requires a single PostgreSQL instance. Tasks, progress reports, and scheduling are tightly coupled to this coordination. Splitting them would require rearchitecting the entire tasking system for no immediate benefit -- the control-plane tables are small and not the bottleneck.


Placement Map

A new field on the Domain model:

# pulpcore/app/models/domain.py
class Domain(BaseModel, AutoAddObjPermsMixin):
    # ... existing fields ...
    database_alias = models.SlugField(
        default="default",
        help_text="DATABASES alias where this domain's data-plane objects reside.",
    )
  • Defaults to "default" -- zero behavioral change for all existing domains
  • Only changed when a domain is explicitly moved to a satellite via move-domain tooling
  • Cached in-process (Django cache framework, invalidated on Domain save)
  • Validated against settings.DATABASES keys on save

Django DB Router

New file: pulpcore/app/db_router.py

class PulpDomainRouter:
    CONTROL_PLANE_LABELS = {
        "core.domain", "core.task", "core.taskgroup", "core.taskschedule",
        "core.createdresource", "core.appstatus", "core.systemid",
        "core.accesspolicy", "core.role", "core.userrole", "core.grouprole",
        "core.progressreport", "core.groupprogressreport", "core.profileartifact",
    }
    DJANGO_APPS = {"auth", "contenttypes", "admin", "sessions"}

    def _is_control_plane(self, model):
        label = f"{model._meta.app_label}.{model._meta.model_name}"
        return (label in self.CONTROL_PLANE_LABELS
                or model._meta.app_label in self.DJANGO_APPS)

    def _resolve_db(self, model, **hints):
        if self._is_control_plane(model):
            return "default"
        # 1. Check instance hint (most reliable -- Django passes the object being saved)
        instance = hints.get("instance")
        if instance and hasattr(instance, "pulp_domain"):
            domain = instance.pulp_domain
            if domain:
                return getattr(domain, "database_alias", "default")
        # 2. Check ContextVar (set by middleware for HTTP requests, by task runner for tasks)
        domain = get_domain()
        if domain:
            return getattr(domain, "database_alias", "default")
        # 3. Safe default: original RDS
        return "default"

    def db_for_read(self, model, **hints):
        return self._resolve_db(model, **hints)

    def db_for_write(self, model, **hints):
        return self._resolve_db(model, **hints)

    def allow_relation(self, obj1, obj2, **hints):
        return True

    def allow_migrate(self, db, app_label, model_name=None, **hints):
        return True  # Identical schema everywhere

Known Router Limitations and Mitigations

Problem: db_for_read cannot see queryset filters. When code does Repository.objects.filter(pulp_domain=X), the router sees Repository but not which domain.

Mitigation: The router falls back to the ContextVar, which the middleware sets for every HTTP request and with_task_context() sets for every task. For the common API/task codepath, this works correctly.

Remaining gaps requiring explicit .using() calls:

  • Management commands that iterate over multiple domains -- must wrap each iteration in with_domain(d) context manager AND call .using(d.database_alias) on querysets
  • Orphan cleanup and data repair tasks that scan all domains -- must iterate per-domain with explicit routing
  • Import/export operations crossing domains -- must explicitly .using() the source and target

These gaps are enumerated and tracked as Phase 1 work items, not hand-waved away.

Safe failure mode: If the router has no domain context, it returns "default" (the original RDS). For a domain that has been moved to a satellite, this means the query hits the original RDS where the data no longer exists (after cleanup). This returns empty results rather than corrupt data. During the migration window (before old data is cleaned up), it returns the stale copy -- also safe.


Problem 1: Migration Orchestration Against N RDS Instances

How It Works

All RDS instances (original + satellites) run identical schemas. Django's migrate is invoked once per database alias.

New management command: migrate-all

class Command(BaseCommand):
    def add_arguments(self, parser):
        parser.add_argument("--target", nargs=2, metavar=("APP", "MIGRATION"),
                            help="Target migration (for rollback)")
        parser.add_argument("--parallel", action="store_true",
                            help="Migrate databases in parallel (use with caution)")

    def handle(self, *args, **options):
        aliases = list(settings.DATABASES.keys())
        # Always migrate 'default' first (control plane + Domain table)
        aliases.remove("default")
        aliases.insert(0, "default")

        with advisory_lock("pulp_migration_orchestrator"):
            for alias in aliases:
                self._migrate_one(alias, options)

    def _migrate_one(self, alias, options):
        args = ["migrate", "--database", alias, "--noinput"]
        if options.get("target"):
            args.extend(options["target"])
        try:
            call_command(*args)
            MigrationStatus.objects.update_or_create(
                database_alias=alias,
                defaults={"status": "complete", "completed_at": now()})
        except Exception as e:
            MigrationStatus.objects.update_or_create(
                database_alias=alias,
                defaults={"status": "failed", "error": str(e)})
            raise

Critical: default migrates first. This ensures the Domain table and control-plane schema are up to date before satellites are migrated. After default is migrated, the Domain replication sync runs to populate Domain rows on satellites. Then satellite migrations proceed (they may reference Domain PKs in FK defaults via get_domain_pk()).

get_domain_pk() Bootstrap Fix

The existing get_domain_pk() function uses raw SQL against connection.cursor(), which during migrate --database=data_1 queries data_1. If the Domain table on data_1 is empty, the default domain PK lookup fails.

Fix: Modify get_domain_pk() to explicitly query default when called during migration:

def get_domain_pk():
    if _inside_migration():
        # During migration, always read from control DB
        with connections["default"].cursor() as cursor:
            cursor.execute("SELECT pulp_id FROM core_domain WHERE name = 'default'")
            ...
    else:
        # Normal runtime path (unchanged)
        ...

post_migrate Hook Fix

Pulpcore's post_migrate hooks populate AccessPolicy and Role objects. These are control-plane models that must only be written to default.

Fix: Guard the hooks:

def _populate_access_policies(sender, **kwargs):
    db_alias = kwargs.get("using", "default")
    if db_alias != "default":
        return  # Only populate on control DB
    # ... existing logic ...

Problem 2: Failure Handling

Scenario What happens Recovery
Satellite RDS unreachable during migration migrate-all fails at connection for that alias. default and prior satellites already migrated. Fix connectivity, re-run migrate-all. Already-migrated DBs are no-ops.
Satellite fails mid-transactional-migration PostgreSQL rolls back the transaction. django_migrations not updated. Re-run. Migration retries cleanly.
Satellite fails mid-non-transactional migration (AddIndexConcurrently) Partial state. django_migrations not updated. Re-run. CREATE INDEX CONCURRENTLY IF NOT EXISTS is idempotent.
Original RDS fails Lock acquisition fails. Nothing migrates. Fix original RDS, re-run.
Pod crashes mid-migrate-all Advisory lock released on disconnect. Some DBs migrated, some not. New pod re-runs. Idempotent.

Schema version skew tolerance: Because Pulp already requires code to be backward-compatible with the previous schema version (enforced by RequireVersion), a state where default is at migration N and a satellite is at N-1 is safe. The code running against the satellite simply doesn't use the new column/index yet.

Non-transactional migration safety rule: All RunPython operations in data migrations must be idempotent. Enforced via code review and a CI check.


Problem 3: Startup Gating

Current flow (single DB)

init container: wait_on_postgres.py -> wait_on_database_migrations.sh
then: start API/content/worker

Multi-DB flow

sequenceDiagram
  participant MigrationJob as Migration Job
  participant OrigRDS as Original RDS
  participant Sat1 as Satellite 1
  participant Sat2 as Satellite 2
  participant Init as Init Container
  participant Pod as API / Worker / Content

  MigrationJob->>OrigRDS: migrate-all (default first)
  MigrationJob->>Sat1: migrate-all (satellites after)
  MigrationJob->>Sat2: migrate-all

  Init->>OrigRDS: wait for connectivity + check showmigrations
  Init->>Sat1: wait for connectivity + check showmigrations
  Init->>Sat2: wait for connectivity + check showmigrations
  Note over Init: All RDS instances migrated?
  Init->>Pod: Start
Loading

New startup gate script: wait_on_all_databases.py

  • Reads DATABASES from settings
  • For each alias: connect (with retry + backoff), run showmigrations --database=<alias>, verify no unapplied migrations
  • Exit 0 only when ALL instances are ready
  • Configurable per-instance timeout (satellite being provisioned may take longer)

Updated /status/ endpoint:

  • Adds databases array to response:
{
  "database_connection": {"connected": true},
  "databases": [
    {"alias": "default", "connected": true, "migrations_complete": true},
    {"alias": "data_1", "connected": true, "migrations_complete": true},
    {"alias": "data_2", "connected": false, "migrations_complete": null}
  ]
}

Graceful degradation (post-startup):

  • If a satellite goes offline, requests for domains on that satellite return 503 Service Unavailable with a clear error: "Database for domain 'X' is currently unavailable"
  • Tasks for affected domains remain in waiting state
  • /status/ reports the degraded satellite
  • All other domains (on the original RDS and healthy satellites) continue normally
  • A try/except OperationalError wrapper in the router translates connection failures to the 503

Problem 4: Rollback

Schema migration rollback

# Roll back all instances to migration core 0151
pulpcore-manager migrate-all --target core 0151

Runs migrate core 0151 --database=<alias> for each alias. Order: satellites first (reverse of apply), then default last.

Django schema migrations are reversible by default. RunPython data migrations must define reverse_code (enforced via lint).

Domain move rollback

This is the more important rollback scenario. If a domain was moved to a satellite and something goes wrong:

  1. Before old data cleanup: Flip Domain.database_alias back to "default". Immediate rollback, zero data loss. The stale copy on the satellite is orphaned but harmless.
  2. After old data cleanup: Reverse the move -- copy data from satellite back to original, flip alias. Same process as the original move, just in the opposite direction.

Design rule: old data on the original RDS is NOT deleted until the move is verified AND an explicit cleanup command is run. This provides a rollback window of arbitrary length.


Domain Movement Procedure (Phase 2 -- Critical Path)

This is the hardest operational problem. Moving a domain with millions of content units between RDS instances.

Two movement strategies are available. Choose based on how much downtime is acceptable for the domain being moved.

Strategy A: Read-Only Cutover (simpler, longer downtime)

The domain is set to read-only for the entire duration of the data copy. Simpler to implement but the domain is unavailable for writes during the full copy, which may take hours for large domains.

stateDiagram-v2
  [*] --> Preparation
  Preparation --> ReadOnlyMode: set domain.moving=true
  ReadOnlyMode --> DataCopy: reject writes for this domain
  DataCopy --> Verification: pg_dump filtered by pulp_domain_id
  Verification --> Cutover: row counts + checksums match
  Cutover --> Monitoring: update database_alias, clear moving flag
  Monitoring --> Cleanup: verify in production for N days
  Cleanup --> [*]: delete old rows from original RDS
Loading

Step 1 -- Preparation:

  • Estimate domain size: SELECT COUNT(*), pg_total_relation_size(...) FROM <table> WHERE pulp_domain_id = <pk>
  • Verify satellite RDS has sufficient storage and is fully migrated
  • Verify no in-flight tasks for the domain

Step 2 -- Read-only mode:

  • Set Domain.moving = True (new boolean field)
  • Middleware rejects write operations (POST/PUT/PATCH/DELETE) for the domain with 409 Conflict: "Domain is being migrated"
  • In-flight tasks for the domain are allowed to complete but no new tasks are dispatched
  • Content serving (GET) continues from the original RDS

Step 3 -- Data copy:

  • Option A: pg_dump with --table and --where filtering by pulp_domain_id, then pg_restore to the satellite. Fastest for large datasets, but requires pg_dump >= 16 for row filtering.
  • Option B: Application-level Django dumpdata with domain filtering + loaddata --database=<satellite>. Slower but portable.
  • Option C: Direct INSERT INTO satellite.table SELECT * FROM original.table WHERE pulp_domain_id = <pk> via dblink or postgres_fdw. Requires network access between RDS instances.

Step 4 -- Verification:

  • Compare row counts per table between original and satellite for the domain
  • Compare checksums (e.g., MD5(array_agg(pk ORDER BY pk))) per table
  • Verify FK integrity on the satellite

Step 5 -- Cutover:

  • Update Domain.database_alias to the satellite alias
  • Replicate the Domain change to all satellites
  • Clear Domain.moving flag
  • Invalidate the in-process domain cache
  • All new queries for this domain now route to the satellite

Step 6 -- Monitoring:

  • Observe for N days (configurable, default 7)
  • Verify no errors for the moved domain
  • Verify performance is acceptable on the satellite

Step 7 -- Cleanup:

  • Explicit command: pulpcore-manager cleanup-moved-domain <domain-name>
  • Deletes rows from the original RDS where pulp_domain_id = <pk>
  • Until this runs, rollback is instant (flip alias back)

Strategy B: Incremental Sync with Final Blocking Cutover (minimal downtime)

The domain remains fully operational during the bulk data copy. Only a brief blocking window is needed at the end to sync the last delta and cut over. This significantly reduces downtime at the cost of more complex tooling.

stateDiagram-v2
  [*] --> Preparation
  Preparation --> BulkSync: domain stays fully operational
  BulkSync --> DeltaSync: repeat until delta is small
  DeltaSync --> BlockingCutover: set domain.moving=true
  BlockingCutover --> FinalSync: sync remaining delta
  FinalSync --> Verification: row counts + checksums match
  Verification --> Cutover: update database_alias, clear moving flag
  Cutover --> Monitoring: verify in production for N days
  Monitoring --> Cleanup: delete old rows from original RDS
  Cleanup --> [*]
Loading

Step 1 -- Preparation:

  • Same as Strategy A: estimate size, verify satellite, verify migrations

Step 2 -- Bulk sync (non-blocking):

  • Copy all existing domain data to the satellite while the domain is fully operational
  • Users continue reading and writing normally against the original RDS
  • Track the sync point: record a high-watermark timestamp (sync_started_at) or use pulp_created/pulp_last_updated to identify what was copied
  • This is the longest step (hours for large domains) but has zero user impact

Step 3 -- Delta sync (non-blocking, repeatable):

  • After the bulk sync completes, sync only rows created or modified since the bulk sync started
  • Use pulp_last_updated > sync_started_at to identify the delta
  • The domain is still fully operational -- new writes continue landing on the original RDS
  • Each delta sync is smaller and faster than the previous one
  • Repeat until the delta is small enough that the final blocking sync will be fast (target: seconds to low minutes)

Step 4 -- Blocking cutover (brief downtime):

  • Set Domain.moving = True -- middleware rejects writes, no new tasks dispatched
  • Wait for in-flight tasks to complete (or timeout)
  • Run one final delta sync to copy any writes that landed between the last non-blocking sync and the block
  • This window should be very short (seconds to minutes) since the delta is small

Step 5 -- Verification:

  • Same as Strategy A: row counts, checksums, FK integrity
  • Additionally verify that no rows on the original have pulp_last_updated after the final sync point

Step 6 -- Cutover:

  • Update Domain.database_alias to the satellite alias
  • Replicate the Domain change to all satellites
  • Clear Domain.moving flag
  • All new queries route to the satellite

Step 7-8 -- Monitoring and Cleanup:

  • Same as Strategy A

Handling deletes during incremental sync

Rows deleted on the original RDS between sync passes will be missed by the pulp_last_updated delta query. Two approaches:

  • Soft-delete tracking: Log deletes for the domain during the sync window (e.g., a DomainMoveDeleteLog table recording deleted PKs per table). Replay deletes on the satellite during each delta sync.
  • Full reconciliation on final sync: During the blocking cutover, do a full PK comparison between original and satellite for the domain and remove any PKs on the satellite that no longer exist on the original. Acceptable because the blocking window already exists and the domain is small enough that PK comparison is fast after multiple delta syncs have brought the data close.

Strategy comparison

Aspect Strategy A (Read-Only) Strategy B (Incremental Sync)
Downtime for writes Entire copy duration (hours) Final sync only (seconds to minutes)
Implementation complexity Low High (delta tracking, delete handling)
Risk of data inconsistency None (domain is frozen) Low (final blocking sync + verification)
Suitable for Small/medium domains, maintenance windows Large domains, no tolerance for extended downtime
Rollback Flip alias back Flip alias back (same)

Cross-Database Query Handling

No distributed transactions -- accepted trade-off

A task that writes to both planes (e.g., CreatedResource on control DB + Repository on satellite) has no atomicity guarantee. If the satellite write succeeds but the control write fails, data is orphaned.

Mitigation:

  • CreatedResource is advisory (used for task result reporting, not data integrity). An orphaned data-plane object without a CreatedResource entry is benign.
  • A periodic reconciliation task detects and cleans up orphaned references.
  • This is the same class of problem as existing crash-during-task scenarios, which Pulp already handles via orphan cleanup.

Admin cross-domain queries

For admin operations that need to query across all domains (capacity planning, usage reports):

  • Iterate over unique database_alias values from Domain
  • For each alias, run the query with .using(alias)
  • Merge results in Python
  • This is acceptable for admin tooling; it is not needed for the normal API path

Management commands

Commands that iterate over domains (orphan cleanup, data repair) must be updated:

for domain in Domain.objects.all():
    with with_domain(domain):
        qs = Content.objects.using(domain.database_alias).filter(pulp_domain=domain)
        # ... process ...

This is a Phase 1 audit item -- identify all such commands and add explicit routing.


Upstream Pulpcore Changes Required

New files

  • pulpcore/app/db_router.py -- PulpDomainRouter
  • pulpcore/app/management/commands/migrate_all.py -- multi-DB migration orchestration
  • pulpcore/app/management/commands/move_domain.py -- domain movement tooling (Phase 2)
  • pulpcore/app/management/commands/cleanup_moved_domain.py -- post-move cleanup (Phase 2)
  • pulpcore/app/models/migration_status.py -- MigrationStatus model

Modified files

  • pulpcore/app/models/domain.py -- add database_alias and moving fields; add post_save signal for cross-DB replication
  • pulpcore/app/settings.py -- conditional DATABASE_ROUTERS when len(DATABASES) > 1
  • pulpcore/app/util.py -- fix get_domain_pk() to use default during migrations
  • pulpcore/app/apps.py -- guard post_migrate hooks to only run on default
  • pulpcore/app/views/status.py -- add per-database health to /status/
  • pulpcore/middleware.py -- reject writes for domains with moving=True
  • pulpcore/tasking/tasks.py -- skip dispatch for domains with moving=True

Plugin impact

  • No plugin code changes required for Phase 1 -- routing is transparent via the Django router + ContextVar
  • Plugin management commands that iterate over domains need .using() calls (Phase 1 audit)
  • Plugin post_migrate hooks should be verified for using kwarg awareness

Phased Rollout

Phase 0: Prerequisites (weeks 1-3)

No multi-DB yet. Prepare the codebase.

  • Audit raw SQL: Find all connection.cursor(), RawSQL(), .extra() in pulpcore and plugins. Catalog which need explicit connections[alias] handling.
  • Audit management commands: Identify all commands that iterate over domains or query data-plane models without domain context.
  • Add database_alias field to Domain: Default "default", no behavioral change. Migration is a simple AddField.
  • Add moving field to Domain: Default False. No behavioral change.
  • Fix get_domain_pk(): Make it migration-safe (always query default during migrations).
  • Guard post_migrate hooks: Add if kwargs.get("using") != "default": return guard.
  • Verify RunPython idempotency: Audit all data migrations in core and plugins.
  • Connection limits: Verify RDS instance connection limits can handle the expected pod count. Each Pulp process opens a connection to every configured RDS instance.

Phase 1: Routing Layer (weeks 4-8)

Multi-DB infrastructure. No domains move yet.

  • Implement PulpDomainRouter with control-plane/data-plane classification.
  • Implement migrate-all command with MigrationStatus tracking.
  • Implement Domain table replication via post_save/post_delete signals, with retry logic and a sync-domains management command for manual reconciliation.
  • Update /status/ endpoint with per-database health.
  • Create wait_on_all_databases.py startup gate script.
  • Implement graceful degradation: 503 for unreachable satellites.
  • Integration tests: Configure two database aliases pointing at separate PostgreSQL databases. Verify: migration runs on both, routing works, queries hit the correct DB, graceful degradation on disconnect.
  • Fix management commands identified in Phase 0 audit.
  • Write upstream pulpcore RFC for community review.

Phase 2: Domain Movement (weeks 9-14)

The operational capability to actually move domains.

  • Implement move-domain command with the full procedure: read-only mode, data copy, verification, cutover, monitoring window.
  • Implement cleanup-moved-domain command for post-move deletion.
  • Implement domain size estimation tooling.
  • Choose and implement data copy strategy (pg_dump filtering vs. application-level vs. dblink).
  • End-to-end testing: Move a test domain with realistic data volume. Verify content serving, task execution, API operations, and rollback.
  • Performance benchmarking: Measure move time for domains of various sizes (1K, 100K, 1M, 10M content units).

Phase 3: Production Hardening (weeks 15-20)

  • Per-database monitoring dashboards: query latency, connection pool utilization, replication lag (Domain sync).
  • Alerting: satellite unreachable, migration status mismatch, Domain replication failure, orphaned cross-plane references.
  • Reconciliation task: periodic check for orphaned data-plane objects and stale cross-plane references.
  • Load testing: simulate production traffic patterns with domains distributed across satellites.
  • Operator runbook: how to provision a new satellite, move a domain, handle satellite failure, emergency rollback.
  • Upstream contribution: submit patches to pulpcore based on RFC feedback.

Risks and Mitigations

  • Router context gaps (management commands, cross-domain operations): Phase 0 audit identifies all cases. Phase 1 adds explicit .using() calls. Safe default is "default" (original RDS).
  • Domain replication signal failure: post_save signal includes retry with exponential backoff. sync-domains command for manual reconciliation. Periodic health check in /status/.
  • Connection count with 6+ satellites: Each Pulp process opens a connection to every RDS instance. With many satellites, verify RDS connection limits are sufficient and monitor per-DB connection metrics. Satellite instances serve few domains so their connection load is light.
  • No distributed transactions: Accepted. Cross-plane writes are advisory (CreatedResource). Reconciliation task handles orphans. Same failure class as crash-during-task (already handled).
  • Large domain move duration: For domains with 10M+ content units, the copy phase may take hours. The domain is read-only during this time. Mitigation: schedule moves during low-traffic windows; implement incremental sync for future improvement.
  • Upstream community pushback: Design is additive (no change for single-DB deployments). Present as opt-in feature gated on len(DATABASES) > 1. Prepare alternative approaches (per-plugin splitting, read replicas) for RFC discussion.

Open Questions

  1. Data copy strategy: pg_dump with --where (requires pg >= 16), application-level dump/load (portable but slow), or postgres_fdw (requires cross-instance network). Depends on RDS configuration and data volumes. Needs benchmarking in Phase 2.
  2. Should satellite RDS instances share the same pg_notify channel? No -- task coordination stays entirely on the original RDS. Workers only listen to default for task wakeups. Task execution routes data-plane queries to the satellite transparently.
  3. What about read replicas? This design is orthogonal to read replicas. A satellite RDS instance could itself have read replicas for read-heavy domains. Out of scope for this design.
  4. Content deduplication across domains on different satellites: Currently, Pulp deduplicates Content within a domain (same DB). Content shared between domains on different satellites will be duplicated. This is acceptable -- deduplication across satellites would require cross-DB queries.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions