Suggestion for : Domain-Aware Database Routing for Pulpcore

# Domain-Aware Database Routing for Pulpcore

## Problem Statement

A single RDS instance hosts hundreds of Pulp domains. A handful of hot domains (heavy sync, large content libraries) are bottlenecking the shared instance. The goal is to selectively offload specific domains to dedicated satellite RDS instances while the majority remain on the original.

This requires solving four design problems: (1) migration orchestration against N RDS instances, (2) failure handling when one instance fails mid-migration, (3) startup gating so pods verify all instances are ready, and (4) rollback of partial multi-DB migrations and domain moves.

---

## Current State

- `DOMAIN_ENABLED = True`, hundreds of domains on one RDS
- Single `DATABASES["default"]` entry pointing to that RDS
- No `DATABASE_ROUTERS`, no `.using()` calls, no multi-DB code
- Workers coordinate via `pg_notify` and advisory locks on the single instance
- All pods (API, content, worker) connect to one database

---

## Target Architecture

```mermaid
flowchart TB
  subgraph pods [Pulp Pods]
    API["API pods (all connect to all RDS)"]
    Content["Content pods (all connect to all RDS)"]
    Worker["Worker pods (all connect to all RDS)"]
  end

  subgraph originalRDS ["Original RDS (default)"]
    ControlPlane["Control plane: Task, AppStatus, Role,\nAccessPolicy, ProgressReport, SystemID,\nDjango auth/contenttypes/sessions"]
    DefaultDomains["Data plane: ~hundreds of domains\n(repos, content, artifacts, distributions...)"]
    DomainTable["Domain table (authoritative)\nwith database_alias field"]
  end

  subgraph sat1 ["Satellite RDS 1"]
    Sat1Schema["Full schema (identical)"]
    Sat1Data["Data plane: 1-3 hot domains"]
    Sat1Domain["Domain table (replica)"]
  end

  subgraph sat2 ["Satellite RDS 2"]
    Sat2Schema["Full schema (identical)"]
    Sat2Data["Data plane: 1-3 hot domains"]
    Sat2Domain["Domain table (replica)"]
  end

  pods --> originalRDS
  pods --> sat1
  pods --> sat2
```



**Key properties:**

- The original RDS is BOTH control plane AND default data plane. It does not change structurally.
- Satellite RDS instances start empty, receive the full schema, then receive moved domain data.
- Every pod connects directly to every RDS instance. 95%+ of queries still hit the original.
- A domain starts on `default` and can be moved to a satellite. Moving it back is equally valid.

### Model Classification

**Control plane** -- always on the original RDS (`default`), never routed:

- `Domain` (authoritative copy; replicated read-only to satellites)
- `Task`, `TaskGroup`, `TaskSchedule`, `CreatedResource`, `ProfileArtifact`
- `AppStatus`, `SystemID`
- `AccessPolicy`, `Role`, `UserRole`, `GroupRole`
- `ProgressReport`, `GroupProgressReport`
- Django built-ins (`auth_`*, `django_content_type`, `django_migrations`, `django_admin_log`, `django_session`)

**Data plane** -- routed by `Domain.database_alias`:

- `Repository`, `RepositoryVersion`, `RepositoryContent`, `RepositoryVersionContentDetails`
- `Content`, `Artifact`, `ContentArtifact`, `RemoteArtifact`, `PulpTemporaryFile`
- `Remote`
- `Publication`, `PublishedArtifact`, `PublishedMetadata`
- `Distribution`, `ContentGuard` (and all subtypes)
- `Upload`, `UploadChunk`
- `Exporter`, `Export`, `Importer`, `Import` and subtypes
- `AlternateContentSource`, `AlternateContentSourcePath`
- `SigningService` and subtypes
- `UpstreamPulp`
- ALL plugin-defined models

### Why Control Plane Stays on the Original RDS

Worker coordination (`pg_notify`, advisory locks, `SELECT ... FOR UPDATE SKIP LOCKED`) requires a single PostgreSQL instance. Tasks, progress reports, and scheduling are tightly coupled to this coordination. Splitting them would require rearchitecting the entire tasking system for no immediate benefit -- the control-plane tables are small and not the bottleneck.

---

## Placement Map

A new field on the `Domain` model:

```python
# pulpcore/app/models/domain.py
class Domain(BaseModel, AutoAddObjPermsMixin):
    # ... existing fields ...
    database_alias = models.SlugField(
        default="default",
        help_text="DATABASES alias where this domain's data-plane objects reside.",
    )
```

- Defaults to `"default"` -- zero behavioral change for all existing domains
- Only changed when a domain is explicitly moved to a satellite via `move-domain` tooling
- Cached in-process (Django cache framework, invalidated on Domain save)
- Validated against `settings.DATABASES` keys on save

---

## Django DB Router

New file: `pulpcore/app/db_router.py`

```python
class PulpDomainRouter:
    CONTROL_PLANE_LABELS = {
        "core.domain", "core.task", "core.taskgroup", "core.taskschedule",
        "core.createdresource", "core.appstatus", "core.systemid",
        "core.accesspolicy", "core.role", "core.userrole", "core.grouprole",
        "core.progressreport", "core.groupprogressreport", "core.profileartifact",
    }
    DJANGO_APPS = {"auth", "contenttypes", "admin", "sessions"}

    def _is_control_plane(self, model):
        label = f"{model._meta.app_label}.{model._meta.model_name}"
        return (label in self.CONTROL_PLANE_LABELS
                or model._meta.app_label in self.DJANGO_APPS)

    def _resolve_db(self, model, **hints):
        if self._is_control_plane(model):
            return "default"
        # 1. Check instance hint (most reliable -- Django passes the object being saved)
        instance = hints.get("instance")
        if instance and hasattr(instance, "pulp_domain"):
            domain = instance.pulp_domain
            if domain:
                return getattr(domain, "database_alias", "default")
        # 2. Check ContextVar (set by middleware for HTTP requests, by task runner for tasks)
        domain = get_domain()
        if domain:
            return getattr(domain, "database_alias", "default")
        # 3. Safe default: original RDS
        return "default"

    def db_for_read(self, model, **hints):
        return self._resolve_db(model, **hints)

    def db_for_write(self, model, **hints):
        return self._resolve_db(model, **hints)

    def allow_relation(self, obj1, obj2, **hints):
        return True

    def allow_migrate(self, db, app_label, model_name=None, **hints):
        return True  # Identical schema everywhere
```

### Known Router Limitations and Mitigations

**Problem:** `db_for_read` cannot see queryset filters. When code does `Repository.objects.filter(pulp_domain=X)`, the router sees `Repository` but not which domain.

**Mitigation:** The router falls back to the ContextVar, which the middleware sets for every HTTP request and `with_task_context()` sets for every task. For the common API/task codepath, this works correctly.

**Remaining gaps requiring explicit `.using()` calls:**

- Management commands that iterate over multiple domains -- must wrap each iteration in `with_domain(d)` context manager AND call `.using(d.database_alias)` on querysets
- Orphan cleanup and data repair tasks that scan all domains -- must iterate per-domain with explicit routing
- Import/export operations crossing domains -- must explicitly `.using()` the source and target

These gaps are enumerated and tracked as Phase 1 work items, not hand-waved away.

**Safe failure mode:** If the router has no domain context, it returns `"default"` (the original RDS). For a domain that has been moved to a satellite, this means the query hits the original RDS where the data no longer exists (after cleanup). This returns empty results rather than corrupt data. During the migration window (before old data is cleaned up), it returns the stale copy -- also safe.

---

## Problem 1: Migration Orchestration Against N RDS Instances

### How It Works

All RDS instances (original + satellites) run identical schemas. Django's `migrate` is invoked once per database alias.

New management command: `migrate-all`

```python
class Command(BaseCommand):
    def add_arguments(self, parser):
        parser.add_argument("--target", nargs=2, metavar=("APP", "MIGRATION"),
                            help="Target migration (for rollback)")
        parser.add_argument("--parallel", action="store_true",
                            help="Migrate databases in parallel (use with caution)")

    def handle(self, *args, **options):
        aliases = list(settings.DATABASES.keys())
        # Always migrate 'default' first (control plane + Domain table)
        aliases.remove("default")
        aliases.insert(0, "default")

        with advisory_lock("pulp_migration_orchestrator"):
            for alias in aliases:
                self._migrate_one(alias, options)

    def _migrate_one(self, alias, options):
        args = ["migrate", "--database", alias, "--noinput"]
        if options.get("target"):
            args.extend(options["target"])
        try:
            call_command(*args)
            MigrationStatus.objects.update_or_create(
                database_alias=alias,
                defaults={"status": "complete", "completed_at": now()})
        except Exception as e:
            MigrationStatus.objects.update_or_create(
                database_alias=alias,
                defaults={"status": "failed", "error": str(e)})
            raise
```

**Critical: `default` migrates first.** This ensures the Domain table and control-plane schema are up to date before satellites are migrated. After `default` is migrated, the Domain replication sync runs to populate Domain rows on satellites. Then satellite migrations proceed (they may reference Domain PKs in FK defaults via `get_domain_pk()`).

### `get_domain_pk()` Bootstrap Fix

The existing `get_domain_pk()` function uses raw SQL against `connection.cursor()`, which during `migrate --database=data_1` queries `data_1`. If the Domain table on `data_1` is empty, the default domain PK lookup fails.

**Fix:** Modify `get_domain_pk()` to explicitly query `default` when called during migration:

```python
def get_domain_pk():
    if _inside_migration():
        # During migration, always read from control DB
        with connections["default"].cursor() as cursor:
            cursor.execute("SELECT pulp_id FROM core_domain WHERE name = 'default'")
            ...
    else:
        # Normal runtime path (unchanged)
        ...
```

### `post_migrate` Hook Fix

Pulpcore's `post_migrate` hooks populate `AccessPolicy` and `Role` objects. These are control-plane models that must only be written to `default`.

**Fix:** Guard the hooks:

```python
def _populate_access_policies(sender, **kwargs):
    db_alias = kwargs.get("using", "default")
    if db_alias != "default":
        return  # Only populate on control DB
    # ... existing logic ...
```

---

## Problem 2: Failure Handling


| Scenario                                                                 | What happens                                                                                       | Recovery                                                                 |
| ------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------ |
| Satellite RDS unreachable during migration                               | `migrate-all` fails at connection for that alias. `default` and prior satellites already migrated. | Fix connectivity, re-run `migrate-all`. Already-migrated DBs are no-ops. |
| Satellite fails mid-transactional-migration                              | PostgreSQL rolls back the transaction. `django_migrations` not updated.                            | Re-run. Migration retries cleanly.                                       |
| Satellite fails mid-non-transactional migration (`AddIndexConcurrently`) | Partial state. `django_migrations` not updated.                                                    | Re-run. `CREATE INDEX CONCURRENTLY IF NOT EXISTS` is idempotent.         |
| Original RDS fails                                                       | Lock acquisition fails. Nothing migrates.                                                          | Fix original RDS, re-run.                                                |
| Pod crashes mid-`migrate-all`                                            | Advisory lock released on disconnect. Some DBs migrated, some not.                                 | New pod re-runs. Idempotent.                                             |


**Schema version skew tolerance:** Because Pulp already requires code to be backward-compatible with the previous schema version (enforced by `RequireVersion`), a state where `default` is at migration N and a satellite is at N-1 is safe. The code running against the satellite simply doesn't use the new column/index yet.

**Non-transactional migration safety rule:** All `RunPython` operations in data migrations must be idempotent. Enforced via code review and a CI check.

---

## Problem 3: Startup Gating

### Current flow (single DB)

```
init container: wait_on_postgres.py -> wait_on_database_migrations.sh
then: start API/content/worker
```

### Multi-DB flow

```mermaid
sequenceDiagram
  participant MigrationJob as Migration Job
  participant OrigRDS as Original RDS
  participant Sat1 as Satellite 1
  participant Sat2 as Satellite 2
  participant Init as Init Container
  participant Pod as API / Worker / Content

  MigrationJob->>OrigRDS: migrate-all (default first)
  MigrationJob->>Sat1: migrate-all (satellites after)
  MigrationJob->>Sat2: migrate-all

  Init->>OrigRDS: wait for connectivity + check showmigrations
  Init->>Sat1: wait for connectivity + check showmigrations
  Init->>Sat2: wait for connectivity + check showmigrations
  Note over Init: All RDS instances migrated?
  Init->>Pod: Start
```



**New startup gate script: `wait_on_all_databases.py`**

- Reads `DATABASES` from settings
- For each alias: connect (with retry + backoff), run `showmigrations --database=<alias>`, verify no unapplied migrations
- Exit 0 only when ALL instances are ready
- Configurable per-instance timeout (satellite being provisioned may take longer)

**Updated `/status/` endpoint:**

- Adds `databases` array to response:

```json
{
  "database_connection": {"connected": true},
  "databases": [
    {"alias": "default", "connected": true, "migrations_complete": true},
    {"alias": "data_1", "connected": true, "migrations_complete": true},
    {"alias": "data_2", "connected": false, "migrations_complete": null}
  ]
}
```

**Graceful degradation (post-startup):**

- If a satellite goes offline, requests for domains on that satellite return **503 Service Unavailable** with a clear error: `"Database for domain 'X' is currently unavailable"`
- Tasks for affected domains remain in `waiting` state
- `/status/` reports the degraded satellite
- All other domains (on the original RDS and healthy satellites) continue normally
- A `try/except OperationalError` wrapper in the router translates connection failures to the 503

---

## Problem 4: Rollback

### Schema migration rollback

```bash
# Roll back all instances to migration core 0151
pulpcore-manager migrate-all --target core 0151
```

Runs `migrate core 0151 --database=<alias>` for each alias. Order: satellites first (reverse of apply), then `default` last.

Django schema migrations are reversible by default. `RunPython` data migrations must define `reverse_code` (enforced via lint).

### Domain move rollback

This is the more important rollback scenario. If a domain was moved to a satellite and something goes wrong:

1. **Before old data cleanup:** Flip `Domain.database_alias` back to `"default"`. Immediate rollback, zero data loss. The stale copy on the satellite is orphaned but harmless.
2. **After old data cleanup:** Reverse the move -- copy data from satellite back to original, flip alias. Same process as the original move, just in the opposite direction.

**Design rule: old data on the original RDS is NOT deleted until the move is verified AND an explicit cleanup command is run.** This provides a rollback window of arbitrary length.

---

## Domain Movement Procedure (Phase 2 -- Critical Path)

This is the hardest operational problem. Moving a domain with millions of content units between RDS instances.

Two movement strategies are available. Choose based on how much downtime is acceptable for the domain being moved.

### Strategy A: Read-Only Cutover (simpler, longer downtime)

The domain is set to read-only for the entire duration of the data copy. Simpler to implement but the domain is unavailable for writes during the full copy, which may take hours for large domains.

```mermaid
stateDiagram-v2
  [*] --> Preparation
  Preparation --> ReadOnlyMode: set domain.moving=true
  ReadOnlyMode --> DataCopy: reject writes for this domain
  DataCopy --> Verification: pg_dump filtered by pulp_domain_id
  Verification --> Cutover: row counts + checksums match
  Cutover --> Monitoring: update database_alias, clear moving flag
  Monitoring --> Cleanup: verify in production for N days
  Cleanup --> [*]: delete old rows from original RDS
```

**Step 1 -- Preparation:**

- Estimate domain size: `SELECT COUNT(*), pg_total_relation_size(...) FROM <table> WHERE pulp_domain_id = <pk>`
- Verify satellite RDS has sufficient storage and is fully migrated
- Verify no in-flight tasks for the domain

**Step 2 -- Read-only mode:**

- Set `Domain.moving = True` (new boolean field)
- Middleware rejects write operations (POST/PUT/PATCH/DELETE) for the domain with 409 Conflict: `"Domain is being migrated"`
- In-flight tasks for the domain are allowed to complete but no new tasks are dispatched
- Content serving (GET) continues from the original RDS

**Step 3 -- Data copy:**

- Option A: `pg_dump` with `--table` and `--where` filtering by `pulp_domain_id`, then `pg_restore` to the satellite. Fastest for large datasets, but requires `pg_dump` >= 16 for row filtering.
- Option B: Application-level Django `dumpdata` with domain filtering + `loaddata --database=<satellite>`. Slower but portable.
- Option C: Direct `INSERT INTO satellite.table SELECT * FROM original.table WHERE pulp_domain_id = <pk>` via `dblink` or `postgres_fdw`. Requires network access between RDS instances.

**Step 4 -- Verification:**

- Compare row counts per table between original and satellite for the domain
- Compare checksums (e.g., `MD5(array_agg(pk ORDER BY pk))`) per table
- Verify FK integrity on the satellite

**Step 5 -- Cutover:**

- Update `Domain.database_alias` to the satellite alias
- Replicate the Domain change to all satellites
- Clear `Domain.moving` flag
- Invalidate the in-process domain cache
- All new queries for this domain now route to the satellite

**Step 6 -- Monitoring:**

- Observe for N days (configurable, default 7)
- Verify no errors for the moved domain
- Verify performance is acceptable on the satellite

**Step 7 -- Cleanup:**

- Explicit command: `pulpcore-manager cleanup-moved-domain <domain-name>`
- Deletes rows from the original RDS where `pulp_domain_id = <pk>`
- Until this runs, rollback is instant (flip alias back)

### Strategy B: Incremental Sync with Final Blocking Cutover (minimal downtime)

The domain remains fully operational during the bulk data copy. Only a brief blocking window is needed at the end to sync the last delta and cut over. This significantly reduces downtime at the cost of more complex tooling.

```mermaid
stateDiagram-v2
  [*] --> Preparation
  Preparation --> BulkSync: domain stays fully operational
  BulkSync --> DeltaSync: repeat until delta is small
  DeltaSync --> BlockingCutover: set domain.moving=true
  BlockingCutover --> FinalSync: sync remaining delta
  FinalSync --> Verification: row counts + checksums match
  Verification --> Cutover: update database_alias, clear moving flag
  Cutover --> Monitoring: verify in production for N days
  Monitoring --> Cleanup: delete old rows from original RDS
  Cleanup --> [*]
```

**Step 1 -- Preparation:**

- Same as Strategy A: estimate size, verify satellite, verify migrations

**Step 2 -- Bulk sync (non-blocking):**

- Copy all existing domain data to the satellite while the domain is fully operational
- Users continue reading and writing normally against the original RDS
- Track the sync point: record a high-watermark timestamp (`sync_started_at`) or use `pulp_created`/`pulp_last_updated` to identify what was copied
- This is the longest step (hours for large domains) but has zero user impact

**Step 3 -- Delta sync (non-blocking, repeatable):**

- After the bulk sync completes, sync only rows created or modified since the bulk sync started
- Use `pulp_last_updated > sync_started_at` to identify the delta
- The domain is still fully operational -- new writes continue landing on the original RDS
- Each delta sync is smaller and faster than the previous one
- Repeat until the delta is small enough that the final blocking sync will be fast (target: seconds to low minutes)

**Step 4 -- Blocking cutover (brief downtime):**

- Set `Domain.moving = True` -- middleware rejects writes, no new tasks dispatched
- Wait for in-flight tasks to complete (or timeout)
- Run one final delta sync to copy any writes that landed between the last non-blocking sync and the block
- This window should be very short (seconds to minutes) since the delta is small

**Step 5 -- Verification:**

- Same as Strategy A: row counts, checksums, FK integrity
- Additionally verify that no rows on the original have `pulp_last_updated` after the final sync point

**Step 6 -- Cutover:**

- Update `Domain.database_alias` to the satellite alias
- Replicate the Domain change to all satellites
- Clear `Domain.moving` flag
- All new queries route to the satellite

**Step 7-8 -- Monitoring and Cleanup:**

- Same as Strategy A

#### Handling deletes during incremental sync

Rows deleted on the original RDS between sync passes will be missed by the `pulp_last_updated` delta query. Two approaches:

- **Soft-delete tracking:** Log deletes for the domain during the sync window (e.g., a `DomainMoveDeleteLog` table recording deleted PKs per table). Replay deletes on the satellite during each delta sync.
- **Full reconciliation on final sync:** During the blocking cutover, do a full PK comparison between original and satellite for the domain and remove any PKs on the satellite that no longer exist on the original. Acceptable because the blocking window already exists and the domain is small enough that PK comparison is fast after multiple delta syncs have brought the data close.

#### Strategy comparison

| Aspect | Strategy A (Read-Only) | Strategy B (Incremental Sync) |
|--------|----------------------|-------------------------------|
| Downtime for writes | Entire copy duration (hours) | Final sync only (seconds to minutes) |
| Implementation complexity | Low | High (delta tracking, delete handling) |
| Risk of data inconsistency | None (domain is frozen) | Low (final blocking sync + verification) |
| Suitable for | Small/medium domains, maintenance windows | Large domains, no tolerance for extended downtime |
| Rollback | Flip alias back | Flip alias back (same) |

---

## Cross-Database Query Handling

### No distributed transactions -- accepted trade-off

A task that writes to both planes (e.g., `CreatedResource` on control DB + `Repository` on satellite) has no atomicity guarantee. If the satellite write succeeds but the control write fails, data is orphaned.

**Mitigation:**

- `CreatedResource` is advisory (used for task result reporting, not data integrity). An orphaned data-plane object without a `CreatedResource` entry is benign.
- A periodic reconciliation task detects and cleans up orphaned references.
- This is the same class of problem as existing crash-during-task scenarios, which Pulp already handles via orphan cleanup.

### Admin cross-domain queries

For admin operations that need to query across all domains (capacity planning, usage reports):

- Iterate over unique `database_alias` values from Domain
- For each alias, run the query with `.using(alias)`
- Merge results in Python
- This is acceptable for admin tooling; it is not needed for the normal API path

### Management commands

Commands that iterate over domains (orphan cleanup, data repair) must be updated:

```python
for domain in Domain.objects.all():
    with with_domain(domain):
        qs = Content.objects.using(domain.database_alias).filter(pulp_domain=domain)
        # ... process ...
```

This is a Phase 1 audit item -- identify all such commands and add explicit routing.

---

## Upstream Pulpcore Changes Required

### New files

- `pulpcore/app/db_router.py` -- `PulpDomainRouter`
- `pulpcore/app/management/commands/migrate_all.py` -- multi-DB migration orchestration
- `pulpcore/app/management/commands/move_domain.py` -- domain movement tooling (Phase 2)
- `pulpcore/app/management/commands/cleanup_moved_domain.py` -- post-move cleanup (Phase 2)
- `pulpcore/app/models/migration_status.py` -- `MigrationStatus` model

### Modified files

- `pulpcore/app/models/domain.py` -- add `database_alias` and `moving` fields; add `post_save` signal for cross-DB replication
- `pulpcore/app/settings.py` -- conditional `DATABASE_ROUTERS` when `len(DATABASES) > 1`
- `pulpcore/app/util.py` -- fix `get_domain_pk()` to use `default` during migrations
- `pulpcore/app/apps.py` -- guard `post_migrate` hooks to only run on `default`
- `pulpcore/app/views/status.py` -- add per-database health to `/status/`
- `pulpcore/middleware.py` -- reject writes for domains with `moving=True`
- `pulpcore/tasking/tasks.py` -- skip dispatch for domains with `moving=True`

### Plugin impact

- **No plugin code changes required for Phase 1** -- routing is transparent via the Django router + ContextVar
- Plugin management commands that iterate over domains need `.using()` calls (Phase 1 audit)
- Plugin `post_migrate` hooks should be verified for `using` kwarg awareness

---

## Phased Rollout

### Phase 0: Prerequisites (weeks 1-3)

No multi-DB yet. Prepare the codebase.

- **Audit raw SQL:** Find all `connection.cursor()`, `RawSQL()`, `.extra()` in pulpcore and plugins. Catalog which need explicit `connections[alias]` handling.
- **Audit management commands:** Identify all commands that iterate over domains or query data-plane models without domain context.
- **Add `database_alias` field to Domain:** Default `"default"`, no behavioral change. Migration is a simple `AddField`.
- **Add `moving` field to Domain:** Default `False`. No behavioral change.
- **Fix `get_domain_pk()`:** Make it migration-safe (always query `default` during migrations).
- **Guard `post_migrate` hooks:** Add `if kwargs.get("using") != "default": return` guard.
- **Verify `RunPython` idempotency:** Audit all data migrations in core and plugins.
- **Connection limits:** Verify RDS instance connection limits can handle the expected pod count. Each Pulp process opens a connection to every configured RDS instance.

### Phase 1: Routing Layer (weeks 4-8)

Multi-DB infrastructure. No domains move yet.

- **Implement `PulpDomainRouter`** with control-plane/data-plane classification.
- **Implement `migrate-all`** command with `MigrationStatus` tracking.
- **Implement Domain table replication** via `post_save`/`post_delete` signals, with retry logic and a `sync-domains` management command for manual reconciliation.
- **Update `/status/` endpoint** with per-database health.
- **Create `wait_on_all_databases.py`** startup gate script.
- **Implement graceful degradation:** 503 for unreachable satellites.
- **Integration tests:** Configure two database aliases pointing at separate PostgreSQL databases. Verify: migration runs on both, routing works, queries hit the correct DB, graceful degradation on disconnect.
- **Fix management commands** identified in Phase 0 audit.
- **Write upstream pulpcore RFC** for community review.

### Phase 2: Domain Movement (weeks 9-14)

The operational capability to actually move domains.

- **Implement `move-domain` command** with the full procedure: read-only mode, data copy, verification, cutover, monitoring window.
- **Implement `cleanup-moved-domain` command** for post-move deletion.
- **Implement domain size estimation tooling.**
- **Choose and implement data copy strategy** (pg_dump filtering vs. application-level vs. dblink).
- **End-to-end testing:** Move a test domain with realistic data volume. Verify content serving, task execution, API operations, and rollback.
- **Performance benchmarking:** Measure move time for domains of various sizes (1K, 100K, 1M, 10M content units).

### Phase 3: Production Hardening (weeks 15-20)

- **Per-database monitoring dashboards:** query latency, connection pool utilization, replication lag (Domain sync).
- **Alerting:** satellite unreachable, migration status mismatch, Domain replication failure, orphaned cross-plane references.
- **Reconciliation task:** periodic check for orphaned data-plane objects and stale cross-plane references.
- **Load testing:** simulate production traffic patterns with domains distributed across satellites.
- **Operator runbook:** how to provision a new satellite, move a domain, handle satellite failure, emergency rollback.
- **Upstream contribution:** submit patches to pulpcore based on RFC feedback.

---

## Risks and Mitigations

- **Router context gaps (management commands, cross-domain operations):** Phase 0 audit identifies all cases. Phase 1 adds explicit `.using()` calls. Safe default is `"default"` (original RDS).
- **Domain replication signal failure:** `post_save` signal includes retry with exponential backoff. `sync-domains` command for manual reconciliation. Periodic health check in `/status/`.
- **Connection count with 6+ satellites:** Each Pulp process opens a connection to every RDS instance. With many satellites, verify RDS connection limits are sufficient and monitor per-DB connection metrics. Satellite instances serve few domains so their connection load is light.
- **No distributed transactions:** Accepted. Cross-plane writes are advisory (`CreatedResource`). Reconciliation task handles orphans. Same failure class as crash-during-task (already handled).
- **Large domain move duration:** For domains with 10M+ content units, the copy phase may take hours. The domain is read-only during this time. Mitigation: schedule moves during low-traffic windows; implement incremental sync for future improvement.
- **Upstream community pushback:** Design is additive (no change for single-DB deployments). Present as opt-in feature gated on `len(DATABASES) > 1`. Prepare alternative approaches (per-plugin splitting, read replicas) for RFC discussion.

---

## Open Questions

1. **Data copy strategy:** `pg_dump` with `--where` (requires pg >= 16), application-level dump/load (portable but slow), or `postgres_fdw` (requires cross-instance network). Depends on RDS configuration and data volumes. Needs benchmarking in Phase 2.
2. **Should satellite RDS instances share the same `pg_notify` channel?** No -- task coordination stays entirely on the original RDS. Workers only listen to `default` for task wakeups. Task execution routes data-plane queries to the satellite transparently.
3. **What about read replicas?** This design is orthogonal to read replicas. A satellite RDS instance could itself have read replicas for read-heavy domains. Out of scope for this design.
4. **Content deduplication across domains on different satellites:** Currently, Pulp deduplicates `Content` within a domain (same DB). Content shared between domains on different satellites will be duplicated. This is acceptable -- deduplication across satellites would require cross-DB queries.



Scenario	What happens	Recovery
Satellite RDS unreachable during migration	`migrate-all` fails at connection for that alias. `default` and prior satellites already migrated.	Fix connectivity, re-run `migrate-all`. Already-migrated DBs are no-ops.
Satellite fails mid-transactional-migration	PostgreSQL rolls back the transaction. `django_migrations` not updated.	Re-run. Migration retries cleanly.
Satellite fails mid-non-transactional migration (`AddIndexConcurrently`)	Partial state. `django_migrations` not updated.	Re-run. `CREATE INDEX CONCURRENTLY IF NOT EXISTS` is idempotent.
Original RDS fails	Lock acquisition fails. Nothing migrates.	Fix original RDS, re-run.
Pod crashes mid-`migrate-all`	Advisory lock released on disconnect. Some DBs migrated, some not.	New pod re-runs. Idempotent.

Aspect	Strategy A (Read-Only)	Strategy B (Incremental Sync)
Downtime for writes	Entire copy duration (hours)	Final sync only (seconds to minutes)
Implementation complexity	Low	High (delta tracking, delete handling)
Risk of data inconsistency	None (domain is frozen)	Low (final blocking sync + verification)
Suitable for	Small/medium domains, maintenance windows	Large domains, no tolerance for extended downtime
Rollback	Flip alias back	Flip alias back (same)

Uh oh!

Suggestion for : Domain-Aware Database Routing for Pulpcore #7819

Description

Domain-Aware Database Routing for Pulpcore

Problem Statement

Current State

Target Architecture

Model Classification

Why Control Plane Stays on the Original RDS

Placement Map

Django DB Router

Known Router Limitations and Mitigations

Problem 1: Migration Orchestration Against N RDS Instances

How It Works

get_domain_pk() Bootstrap Fix

post_migrate Hook Fix

Problem 2: Failure Handling

Problem 3: Startup Gating

Current flow (single DB)

Multi-DB flow

Problem 4: Rollback

Schema migration rollback

Domain move rollback

Domain Movement Procedure (Phase 2 -- Critical Path)

Strategy A: Read-Only Cutover (simpler, longer downtime)

Strategy B: Incremental Sync with Final Blocking Cutover (minimal downtime)

Handling deletes during incremental sync

Strategy comparison

Cross-Database Query Handling

No distributed transactions -- accepted trade-off

Admin cross-domain queries

Management commands

Upstream Pulpcore Changes Required

New files

Modified files

Plugin impact

Phased Rollout

Phase 0: Prerequisites (weeks 1-3)

Phase 1: Routing Layer (weeks 4-8)

Phase 2: Domain Movement (weeks 9-14)

Phase 3: Production Hardening (weeks 15-20)

Risks and Mitigations

Open Questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`get_domain_pk()` Bootstrap Fix

`post_migrate` Hook Fix