You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A single RDS instance hosts hundreds of Pulp domains. A handful of hot domains (heavy sync, large content libraries) are bottlenecking the shared instance. The goal is to selectively offload specific domains to dedicated satellite RDS instances while the majority remain on the original.
This requires solving four design problems: (1) migration orchestration against N RDS instances, (2) failure handling when one instance fails mid-migration, (3) startup gating so pods verify all instances are ready, and (4) rollback of partial multi-DB migrations and domain moves.
Current State
DOMAIN_ENABLED = True, hundreds of domains on one RDS
Single DATABASES["default"] entry pointing to that RDS
No DATABASE_ROUTERS, no .using() calls, no multi-DB code
Workers coordinate via pg_notify and advisory locks on the single instance
All pods (API, content, worker) connect to one database
Target Architecture
flowchart TB
subgraph pods [Pulp Pods]
API["API pods (all connect to all RDS)"]
Content["Content pods (all connect to all RDS)"]
Worker["Worker pods (all connect to all RDS)"]
end
subgraph originalRDS ["Original RDS (default)"]
ControlPlane["Control plane: Task, AppStatus, Role,\nAccessPolicy, ProgressReport, SystemID,\nDjango auth/contenttypes/sessions"]
DefaultDomains["Data plane: ~hundreds of domains\n(repos, content, artifacts, distributions...)"]
DomainTable["Domain table (authoritative)\nwith database_alias field"]
end
subgraph sat1 ["Satellite RDS 1"]
Sat1Schema["Full schema (identical)"]
Sat1Data["Data plane: 1-3 hot domains"]
Sat1Domain["Domain table (replica)"]
end
subgraph sat2 ["Satellite RDS 2"]
Sat2Schema["Full schema (identical)"]
Sat2Data["Data plane: 1-3 hot domains"]
Sat2Domain["Domain table (replica)"]
end
pods --> originalRDS
pods --> sat1
pods --> sat2
Loading
Key properties:
The original RDS is BOTH control plane AND default data plane. It does not change structurally.
Satellite RDS instances start empty, receive the full schema, then receive moved domain data.
Every pod connects directly to every RDS instance. 95%+ of queries still hit the original.
A domain starts on default and can be moved to a satellite. Moving it back is equally valid.
Model Classification
Control plane -- always on the original RDS (default), never routed:
Domain (authoritative copy; replicated read-only to satellites)
Worker coordination (pg_notify, advisory locks, SELECT ... FOR UPDATE SKIP LOCKED) requires a single PostgreSQL instance. Tasks, progress reports, and scheduling are tightly coupled to this coordination. Splitting them would require rearchitecting the entire tasking system for no immediate benefit -- the control-plane tables are small and not the bottleneck.
Placement Map
A new field on the Domain model:
# pulpcore/app/models/domain.pyclassDomain(BaseModel, AutoAddObjPermsMixin):
# ... existing fields ...database_alias=models.SlugField(
default="default",
help_text="DATABASES alias where this domain's data-plane objects reside.",
)
Defaults to "default" -- zero behavioral change for all existing domains
Only changed when a domain is explicitly moved to a satellite via move-domain tooling
Cached in-process (Django cache framework, invalidated on Domain save)
Problem:db_for_read cannot see queryset filters. When code does Repository.objects.filter(pulp_domain=X), the router sees Repository but not which domain.
Mitigation: The router falls back to the ContextVar, which the middleware sets for every HTTP request and with_task_context() sets for every task. For the common API/task codepath, this works correctly.
Remaining gaps requiring explicit .using() calls:
Management commands that iterate over multiple domains -- must wrap each iteration in with_domain(d) context manager AND call .using(d.database_alias) on querysets
Orphan cleanup and data repair tasks that scan all domains -- must iterate per-domain with explicit routing
Import/export operations crossing domains -- must explicitly .using() the source and target
These gaps are enumerated and tracked as Phase 1 work items, not hand-waved away.
Safe failure mode: If the router has no domain context, it returns "default" (the original RDS). For a domain that has been moved to a satellite, this means the query hits the original RDS where the data no longer exists (after cleanup). This returns empty results rather than corrupt data. During the migration window (before old data is cleaned up), it returns the stale copy -- also safe.
Problem 1: Migration Orchestration Against N RDS Instances
How It Works
All RDS instances (original + satellites) run identical schemas. Django's migrate is invoked once per database alias.
Critical: default migrates first. This ensures the Domain table and control-plane schema are up to date before satellites are migrated. After default is migrated, the Domain replication sync runs to populate Domain rows on satellites. Then satellite migrations proceed (they may reference Domain PKs in FK defaults via get_domain_pk()).
get_domain_pk() Bootstrap Fix
The existing get_domain_pk() function uses raw SQL against connection.cursor(), which during migrate --database=data_1 queries data_1. If the Domain table on data_1 is empty, the default domain PK lookup fails.
Fix: Modify get_domain_pk() to explicitly query default when called during migration:
defget_domain_pk():
if_inside_migration():
# During migration, always read from control DBwithconnections["default"].cursor() ascursor:
cursor.execute("SELECT pulp_id FROM core_domain WHERE name = 'default'")
...
else:
# Normal runtime path (unchanged)
...
post_migrate Hook Fix
Pulpcore's post_migrate hooks populate AccessPolicy and Role objects. These are control-plane models that must only be written to default.
Fix: Guard the hooks:
def_populate_access_policies(sender, **kwargs):
db_alias=kwargs.get("using", "default")
ifdb_alias!="default":
return# Only populate on control DB# ... existing logic ...
Problem 2: Failure Handling
Scenario
What happens
Recovery
Satellite RDS unreachable during migration
migrate-all fails at connection for that alias. default and prior satellites already migrated.
Fix connectivity, re-run migrate-all. Already-migrated DBs are no-ops.
Satellite fails mid-transactional-migration
PostgreSQL rolls back the transaction. django_migrations not updated.
Re-run. CREATE INDEX CONCURRENTLY IF NOT EXISTS is idempotent.
Original RDS fails
Lock acquisition fails. Nothing migrates.
Fix original RDS, re-run.
Pod crashes mid-migrate-all
Advisory lock released on disconnect. Some DBs migrated, some not.
New pod re-runs. Idempotent.
Schema version skew tolerance: Because Pulp already requires code to be backward-compatible with the previous schema version (enforced by RequireVersion), a state where default is at migration N and a satellite is at N-1 is safe. The code running against the satellite simply doesn't use the new column/index yet.
Non-transactional migration safety rule: All RunPython operations in data migrations must be idempotent. Enforced via code review and a CI check.
If a satellite goes offline, requests for domains on that satellite return 503 Service Unavailable with a clear error: "Database for domain 'X' is currently unavailable"
Tasks for affected domains remain in waiting state
/status/ reports the degraded satellite
All other domains (on the original RDS and healthy satellites) continue normally
A try/except OperationalError wrapper in the router translates connection failures to the 503
Problem 4: Rollback
Schema migration rollback
# Roll back all instances to migration core 0151
pulpcore-manager migrate-all --target core 0151
Runs migrate core 0151 --database=<alias> for each alias. Order: satellites first (reverse of apply), then default last.
Django schema migrations are reversible by default. RunPython data migrations must define reverse_code (enforced via lint).
Domain move rollback
This is the more important rollback scenario. If a domain was moved to a satellite and something goes wrong:
Before old data cleanup: Flip Domain.database_alias back to "default". Immediate rollback, zero data loss. The stale copy on the satellite is orphaned but harmless.
After old data cleanup: Reverse the move -- copy data from satellite back to original, flip alias. Same process as the original move, just in the opposite direction.
Design rule: old data on the original RDS is NOT deleted until the move is verified AND an explicit cleanup command is run. This provides a rollback window of arbitrary length.
Domain Movement Procedure (Phase 2 -- Critical Path)
This is the hardest operational problem. Moving a domain with millions of content units between RDS instances.
Two movement strategies are available. Choose based on how much downtime is acceptable for the domain being moved.
The domain is set to read-only for the entire duration of the data copy. Simpler to implement but the domain is unavailable for writes during the full copy, which may take hours for large domains.
stateDiagram-v2
[*] --> Preparation
Preparation --> ReadOnlyMode: set domain.moving=true
ReadOnlyMode --> DataCopy: reject writes for this domain
DataCopy --> Verification: pg_dump filtered by pulp_domain_id
Verification --> Cutover: row counts + checksums match
Cutover --> Monitoring: update database_alias, clear moving flag
Monitoring --> Cleanup: verify in production for N days
Cleanup --> [*]: delete old rows from original RDS
Loading
Step 1 -- Preparation:
Estimate domain size: SELECT COUNT(*), pg_total_relation_size(...) FROM <table> WHERE pulp_domain_id = <pk>
Verify satellite RDS has sufficient storage and is fully migrated
Verify no in-flight tasks for the domain
Step 2 -- Read-only mode:
Set Domain.moving = True (new boolean field)
Middleware rejects write operations (POST/PUT/PATCH/DELETE) for the domain with 409 Conflict: "Domain is being migrated"
In-flight tasks for the domain are allowed to complete but no new tasks are dispatched
Content serving (GET) continues from the original RDS
Step 3 -- Data copy:
Option A: pg_dump with --table and --where filtering by pulp_domain_id, then pg_restore to the satellite. Fastest for large datasets, but requires pg_dump >= 16 for row filtering.
Option B: Application-level Django dumpdata with domain filtering + loaddata --database=<satellite>. Slower but portable.
Option C: Direct INSERT INTO satellite.table SELECT * FROM original.table WHERE pulp_domain_id = <pk> via dblink or postgres_fdw. Requires network access between RDS instances.
Step 4 -- Verification:
Compare row counts per table between original and satellite for the domain
Compare checksums (e.g., MD5(array_agg(pk ORDER BY pk))) per table
Verify FK integrity on the satellite
Step 5 -- Cutover:
Update Domain.database_alias to the satellite alias
Replicate the Domain change to all satellites
Clear Domain.moving flag
Invalidate the in-process domain cache
All new queries for this domain now route to the satellite
Deletes rows from the original RDS where pulp_domain_id = <pk>
Until this runs, rollback is instant (flip alias back)
Strategy B: Incremental Sync with Final Blocking Cutover (minimal downtime)
The domain remains fully operational during the bulk data copy. Only a brief blocking window is needed at the end to sync the last delta and cut over. This significantly reduces downtime at the cost of more complex tooling.
stateDiagram-v2
[*] --> Preparation
Preparation --> BulkSync: domain stays fully operational
BulkSync --> DeltaSync: repeat until delta is small
DeltaSync --> BlockingCutover: set domain.moving=true
BlockingCutover --> FinalSync: sync remaining delta
FinalSync --> Verification: row counts + checksums match
Verification --> Cutover: update database_alias, clear moving flag
Cutover --> Monitoring: verify in production for N days
Monitoring --> Cleanup: delete old rows from original RDS
Cleanup --> [*]
Loading
Step 1 -- Preparation:
Same as Strategy A: estimate size, verify satellite, verify migrations
Step 2 -- Bulk sync (non-blocking):
Copy all existing domain data to the satellite while the domain is fully operational
Users continue reading and writing normally against the original RDS
Track the sync point: record a high-watermark timestamp (sync_started_at) or use pulp_created/pulp_last_updated to identify what was copied
This is the longest step (hours for large domains) but has zero user impact
Step 3 -- Delta sync (non-blocking, repeatable):
After the bulk sync completes, sync only rows created or modified since the bulk sync started
Use pulp_last_updated > sync_started_at to identify the delta
The domain is still fully operational -- new writes continue landing on the original RDS
Each delta sync is smaller and faster than the previous one
Repeat until the delta is small enough that the final blocking sync will be fast (target: seconds to low minutes)
Step 4 -- Blocking cutover (brief downtime):
Set Domain.moving = True -- middleware rejects writes, no new tasks dispatched
Wait for in-flight tasks to complete (or timeout)
Run one final delta sync to copy any writes that landed between the last non-blocking sync and the block
This window should be very short (seconds to minutes) since the delta is small
Step 5 -- Verification:
Same as Strategy A: row counts, checksums, FK integrity
Additionally verify that no rows on the original have pulp_last_updated after the final sync point
Step 6 -- Cutover:
Update Domain.database_alias to the satellite alias
Replicate the Domain change to all satellites
Clear Domain.moving flag
All new queries route to the satellite
Step 7-8 -- Monitoring and Cleanup:
Same as Strategy A
Handling deletes during incremental sync
Rows deleted on the original RDS between sync passes will be missed by the pulp_last_updated delta query. Two approaches:
Soft-delete tracking: Log deletes for the domain during the sync window (e.g., a DomainMoveDeleteLog table recording deleted PKs per table). Replay deletes on the satellite during each delta sync.
Full reconciliation on final sync: During the blocking cutover, do a full PK comparison between original and satellite for the domain and remove any PKs on the satellite that no longer exist on the original. Acceptable because the blocking window already exists and the domain is small enough that PK comparison is fast after multiple delta syncs have brought the data close.
Strategy comparison
Aspect
Strategy A (Read-Only)
Strategy B (Incremental Sync)
Downtime for writes
Entire copy duration (hours)
Final sync only (seconds to minutes)
Implementation complexity
Low
High (delta tracking, delete handling)
Risk of data inconsistency
None (domain is frozen)
Low (final blocking sync + verification)
Suitable for
Small/medium domains, maintenance windows
Large domains, no tolerance for extended downtime
Rollback
Flip alias back
Flip alias back (same)
Cross-Database Query Handling
No distributed transactions -- accepted trade-off
A task that writes to both planes (e.g., CreatedResource on control DB + Repository on satellite) has no atomicity guarantee. If the satellite write succeeds but the control write fails, data is orphaned.
Mitigation:
CreatedResource is advisory (used for task result reporting, not data integrity). An orphaned data-plane object without a CreatedResource entry is benign.
A periodic reconciliation task detects and cleans up orphaned references.
This is the same class of problem as existing crash-during-task scenarios, which Pulp already handles via orphan cleanup.
Admin cross-domain queries
For admin operations that need to query across all domains (capacity planning, usage reports):
Iterate over unique database_alias values from Domain
For each alias, run the query with .using(alias)
Merge results in Python
This is acceptable for admin tooling; it is not needed for the normal API path
Management commands
Commands that iterate over domains (orphan cleanup, data repair) must be updated:
fordomaininDomain.objects.all():
withwith_domain(domain):
qs=Content.objects.using(domain.database_alias).filter(pulp_domain=domain)
# ... process ...
This is a Phase 1 audit item -- identify all such commands and add explicit routing.
pulpcore/app/models/migration_status.py -- MigrationStatus model
Modified files
pulpcore/app/models/domain.py -- add database_alias and moving fields; add post_save signal for cross-DB replication
pulpcore/app/settings.py -- conditional DATABASE_ROUTERS when len(DATABASES) > 1
pulpcore/app/util.py -- fix get_domain_pk() to use default during migrations
pulpcore/app/apps.py -- guard post_migrate hooks to only run on default
pulpcore/app/views/status.py -- add per-database health to /status/
pulpcore/middleware.py -- reject writes for domains with moving=True
pulpcore/tasking/tasks.py -- skip dispatch for domains with moving=True
Plugin impact
No plugin code changes required for Phase 1 -- routing is transparent via the Django router + ContextVar
Plugin management commands that iterate over domains need .using() calls (Phase 1 audit)
Plugin post_migrate hooks should be verified for using kwarg awareness
Phased Rollout
Phase 0: Prerequisites (weeks 1-3)
No multi-DB yet. Prepare the codebase.
Audit raw SQL: Find all connection.cursor(), RawSQL(), .extra() in pulpcore and plugins. Catalog which need explicit connections[alias] handling.
Audit management commands: Identify all commands that iterate over domains or query data-plane models without domain context.
Add database_alias field to Domain: Default "default", no behavioral change. Migration is a simple AddField.
Add moving field to Domain: Default False. No behavioral change.
Fix get_domain_pk(): Make it migration-safe (always query default during migrations).
Guard post_migrate hooks: Add if kwargs.get("using") != "default": return guard.
Verify RunPython idempotency: Audit all data migrations in core and plugins.
Connection limits: Verify RDS instance connection limits can handle the expected pod count. Each Pulp process opens a connection to every configured RDS instance.
Phase 1: Routing Layer (weeks 4-8)
Multi-DB infrastructure. No domains move yet.
Implement PulpDomainRouter with control-plane/data-plane classification.
Implement migrate-all command with MigrationStatus tracking.
Implement Domain table replication via post_save/post_delete signals, with retry logic and a sync-domains management command for manual reconciliation.
Update /status/ endpoint with per-database health.
Implement graceful degradation: 503 for unreachable satellites.
Integration tests: Configure two database aliases pointing at separate PostgreSQL databases. Verify: migration runs on both, routing works, queries hit the correct DB, graceful degradation on disconnect.
Fix management commands identified in Phase 0 audit.
Write upstream pulpcore RFC for community review.
Phase 2: Domain Movement (weeks 9-14)
The operational capability to actually move domains.
Implement move-domain command with the full procedure: read-only mode, data copy, verification, cutover, monitoring window.
Implement cleanup-moved-domain command for post-move deletion.
Implement domain size estimation tooling.
Choose and implement data copy strategy (pg_dump filtering vs. application-level vs. dblink).
End-to-end testing: Move a test domain with realistic data volume. Verify content serving, task execution, API operations, and rollback.
Performance benchmarking: Measure move time for domains of various sizes (1K, 100K, 1M, 10M content units).
Phase 3: Production Hardening (weeks 15-20)
Per-database monitoring dashboards: query latency, connection pool utilization, replication lag (Domain sync).
Domain replication signal failure:post_save signal includes retry with exponential backoff. sync-domains command for manual reconciliation. Periodic health check in /status/.
Connection count with 6+ satellites: Each Pulp process opens a connection to every RDS instance. With many satellites, verify RDS connection limits are sufficient and monitor per-DB connection metrics. Satellite instances serve few domains so their connection load is light.
No distributed transactions: Accepted. Cross-plane writes are advisory (CreatedResource). Reconciliation task handles orphans. Same failure class as crash-during-task (already handled).
Large domain move duration: For domains with 10M+ content units, the copy phase may take hours. The domain is read-only during this time. Mitigation: schedule moves during low-traffic windows; implement incremental sync for future improvement.
Upstream community pushback: Design is additive (no change for single-DB deployments). Present as opt-in feature gated on len(DATABASES) > 1. Prepare alternative approaches (per-plugin splitting, read replicas) for RFC discussion.
Open Questions
Data copy strategy:pg_dump with --where (requires pg >= 16), application-level dump/load (portable but slow), or postgres_fdw (requires cross-instance network). Depends on RDS configuration and data volumes. Needs benchmarking in Phase 2.
Should satellite RDS instances share the same pg_notify channel? No -- task coordination stays entirely on the original RDS. Workers only listen to default for task wakeups. Task execution routes data-plane queries to the satellite transparently.
What about read replicas? This design is orthogonal to read replicas. A satellite RDS instance could itself have read replicas for read-heavy domains. Out of scope for this design.
Content deduplication across domains on different satellites: Currently, Pulp deduplicates Content within a domain (same DB). Content shared between domains on different satellites will be duplicated. This is acceptable -- deduplication across satellites would require cross-DB queries.
Domain-Aware Database Routing for Pulpcore
Problem Statement
A single RDS instance hosts hundreds of Pulp domains. A handful of hot domains (heavy sync, large content libraries) are bottlenecking the shared instance. The goal is to selectively offload specific domains to dedicated satellite RDS instances while the majority remain on the original.
This requires solving four design problems: (1) migration orchestration against N RDS instances, (2) failure handling when one instance fails mid-migration, (3) startup gating so pods verify all instances are ready, and (4) rollback of partial multi-DB migrations and domain moves.
Current State
DOMAIN_ENABLED = True, hundreds of domains on one RDSDATABASES["default"]entry pointing to that RDSDATABASE_ROUTERS, no.using()calls, no multi-DB codepg_notifyand advisory locks on the single instanceTarget Architecture
flowchart TB subgraph pods [Pulp Pods] API["API pods (all connect to all RDS)"] Content["Content pods (all connect to all RDS)"] Worker["Worker pods (all connect to all RDS)"] end subgraph originalRDS ["Original RDS (default)"] ControlPlane["Control plane: Task, AppStatus, Role,\nAccessPolicy, ProgressReport, SystemID,\nDjango auth/contenttypes/sessions"] DefaultDomains["Data plane: ~hundreds of domains\n(repos, content, artifacts, distributions...)"] DomainTable["Domain table (authoritative)\nwith database_alias field"] end subgraph sat1 ["Satellite RDS 1"] Sat1Schema["Full schema (identical)"] Sat1Data["Data plane: 1-3 hot domains"] Sat1Domain["Domain table (replica)"] end subgraph sat2 ["Satellite RDS 2"] Sat2Schema["Full schema (identical)"] Sat2Data["Data plane: 1-3 hot domains"] Sat2Domain["Domain table (replica)"] end pods --> originalRDS pods --> sat1 pods --> sat2Key properties:
defaultand can be moved to a satellite. Moving it back is equally valid.Model Classification
Control plane -- always on the original RDS (
default), never routed:Domain(authoritative copy; replicated read-only to satellites)Task,TaskGroup,TaskSchedule,CreatedResource,ProfileArtifactAppStatus,SystemIDAccessPolicy,Role,UserRole,GroupRoleProgressReport,GroupProgressReportauth_*,django_content_type,django_migrations,django_admin_log,django_session)Data plane -- routed by
Domain.database_alias:Repository,RepositoryVersion,RepositoryContent,RepositoryVersionContentDetailsContent,Artifact,ContentArtifact,RemoteArtifact,PulpTemporaryFileRemotePublication,PublishedArtifact,PublishedMetadataDistribution,ContentGuard(and all subtypes)Upload,UploadChunkExporter,Export,Importer,Importand subtypesAlternateContentSource,AlternateContentSourcePathSigningServiceand subtypesUpstreamPulpWhy Control Plane Stays on the Original RDS
Worker coordination (
pg_notify, advisory locks,SELECT ... FOR UPDATE SKIP LOCKED) requires a single PostgreSQL instance. Tasks, progress reports, and scheduling are tightly coupled to this coordination. Splitting them would require rearchitecting the entire tasking system for no immediate benefit -- the control-plane tables are small and not the bottleneck.Placement Map
A new field on the
Domainmodel:"default"-- zero behavioral change for all existing domainsmove-domaintoolingsettings.DATABASESkeys on saveDjango DB Router
New file:
pulpcore/app/db_router.pyKnown Router Limitations and Mitigations
Problem:
db_for_readcannot see queryset filters. When code doesRepository.objects.filter(pulp_domain=X), the router seesRepositorybut not which domain.Mitigation: The router falls back to the ContextVar, which the middleware sets for every HTTP request and
with_task_context()sets for every task. For the common API/task codepath, this works correctly.Remaining gaps requiring explicit
.using()calls:with_domain(d)context manager AND call.using(d.database_alias)on querysets.using()the source and targetThese gaps are enumerated and tracked as Phase 1 work items, not hand-waved away.
Safe failure mode: If the router has no domain context, it returns
"default"(the original RDS). For a domain that has been moved to a satellite, this means the query hits the original RDS where the data no longer exists (after cleanup). This returns empty results rather than corrupt data. During the migration window (before old data is cleaned up), it returns the stale copy -- also safe.Problem 1: Migration Orchestration Against N RDS Instances
How It Works
All RDS instances (original + satellites) run identical schemas. Django's
migrateis invoked once per database alias.New management command:
migrate-allCritical:
defaultmigrates first. This ensures the Domain table and control-plane schema are up to date before satellites are migrated. Afterdefaultis migrated, the Domain replication sync runs to populate Domain rows on satellites. Then satellite migrations proceed (they may reference Domain PKs in FK defaults viaget_domain_pk()).get_domain_pk()Bootstrap FixThe existing
get_domain_pk()function uses raw SQL againstconnection.cursor(), which duringmigrate --database=data_1queriesdata_1. If the Domain table ondata_1is empty, the default domain PK lookup fails.Fix: Modify
get_domain_pk()to explicitly querydefaultwhen called during migration:post_migrateHook FixPulpcore's
post_migratehooks populateAccessPolicyandRoleobjects. These are control-plane models that must only be written todefault.Fix: Guard the hooks:
Problem 2: Failure Handling
migrate-allfails at connection for that alias.defaultand prior satellites already migrated.migrate-all. Already-migrated DBs are no-ops.django_migrationsnot updated.AddIndexConcurrently)django_migrationsnot updated.CREATE INDEX CONCURRENTLY IF NOT EXISTSis idempotent.migrate-allSchema version skew tolerance: Because Pulp already requires code to be backward-compatible with the previous schema version (enforced by
RequireVersion), a state wheredefaultis at migration N and a satellite is at N-1 is safe. The code running against the satellite simply doesn't use the new column/index yet.Non-transactional migration safety rule: All
RunPythonoperations in data migrations must be idempotent. Enforced via code review and a CI check.Problem 3: Startup Gating
Current flow (single DB)
Multi-DB flow
New startup gate script:
wait_on_all_databases.pyDATABASESfrom settingsshowmigrations --database=<alias>, verify no unapplied migrationsUpdated
/status/endpoint:databasesarray to response:{ "database_connection": {"connected": true}, "databases": [ {"alias": "default", "connected": true, "migrations_complete": true}, {"alias": "data_1", "connected": true, "migrations_complete": true}, {"alias": "data_2", "connected": false, "migrations_complete": null} ] }Graceful degradation (post-startup):
"Database for domain 'X' is currently unavailable"waitingstate/status/reports the degraded satellitetry/except OperationalErrorwrapper in the router translates connection failures to the 503Problem 4: Rollback
Schema migration rollback
# Roll back all instances to migration core 0151 pulpcore-manager migrate-all --target core 0151Runs
migrate core 0151 --database=<alias>for each alias. Order: satellites first (reverse of apply), thendefaultlast.Django schema migrations are reversible by default.
RunPythondata migrations must definereverse_code(enforced via lint).Domain move rollback
This is the more important rollback scenario. If a domain was moved to a satellite and something goes wrong:
Domain.database_aliasback to"default". Immediate rollback, zero data loss. The stale copy on the satellite is orphaned but harmless.Design rule: old data on the original RDS is NOT deleted until the move is verified AND an explicit cleanup command is run. This provides a rollback window of arbitrary length.
Domain Movement Procedure (Phase 2 -- Critical Path)
This is the hardest operational problem. Moving a domain with millions of content units between RDS instances.
Two movement strategies are available. Choose based on how much downtime is acceptable for the domain being moved.
Strategy A: Read-Only Cutover (simpler, longer downtime)
The domain is set to read-only for the entire duration of the data copy. Simpler to implement but the domain is unavailable for writes during the full copy, which may take hours for large domains.
Step 1 -- Preparation:
SELECT COUNT(*), pg_total_relation_size(...) FROM <table> WHERE pulp_domain_id = <pk>Step 2 -- Read-only mode:
Domain.moving = True(new boolean field)"Domain is being migrated"Step 3 -- Data copy:
pg_dumpwith--tableand--wherefiltering bypulp_domain_id, thenpg_restoreto the satellite. Fastest for large datasets, but requirespg_dump>= 16 for row filtering.dumpdatawith domain filtering +loaddata --database=<satellite>. Slower but portable.INSERT INTO satellite.table SELECT * FROM original.table WHERE pulp_domain_id = <pk>viadblinkorpostgres_fdw. Requires network access between RDS instances.Step 4 -- Verification:
MD5(array_agg(pk ORDER BY pk))) per tableStep 5 -- Cutover:
Domain.database_aliasto the satellite aliasDomain.movingflagStep 6 -- Monitoring:
Step 7 -- Cleanup:
pulpcore-manager cleanup-moved-domain <domain-name>pulp_domain_id = <pk>Strategy B: Incremental Sync with Final Blocking Cutover (minimal downtime)
The domain remains fully operational during the bulk data copy. Only a brief blocking window is needed at the end to sync the last delta and cut over. This significantly reduces downtime at the cost of more complex tooling.
Step 1 -- Preparation:
Step 2 -- Bulk sync (non-blocking):
sync_started_at) or usepulp_created/pulp_last_updatedto identify what was copiedStep 3 -- Delta sync (non-blocking, repeatable):
pulp_last_updated > sync_started_atto identify the deltaStep 4 -- Blocking cutover (brief downtime):
Domain.moving = True-- middleware rejects writes, no new tasks dispatchedStep 5 -- Verification:
pulp_last_updatedafter the final sync pointStep 6 -- Cutover:
Domain.database_aliasto the satellite aliasDomain.movingflagStep 7-8 -- Monitoring and Cleanup:
Handling deletes during incremental sync
Rows deleted on the original RDS between sync passes will be missed by the
pulp_last_updateddelta query. Two approaches:DomainMoveDeleteLogtable recording deleted PKs per table). Replay deletes on the satellite during each delta sync.Strategy comparison
Cross-Database Query Handling
No distributed transactions -- accepted trade-off
A task that writes to both planes (e.g.,
CreatedResourceon control DB +Repositoryon satellite) has no atomicity guarantee. If the satellite write succeeds but the control write fails, data is orphaned.Mitigation:
CreatedResourceis advisory (used for task result reporting, not data integrity). An orphaned data-plane object without aCreatedResourceentry is benign.Admin cross-domain queries
For admin operations that need to query across all domains (capacity planning, usage reports):
database_aliasvalues from Domain.using(alias)Management commands
Commands that iterate over domains (orphan cleanup, data repair) must be updated:
This is a Phase 1 audit item -- identify all such commands and add explicit routing.
Upstream Pulpcore Changes Required
New files
pulpcore/app/db_router.py--PulpDomainRouterpulpcore/app/management/commands/migrate_all.py-- multi-DB migration orchestrationpulpcore/app/management/commands/move_domain.py-- domain movement tooling (Phase 2)pulpcore/app/management/commands/cleanup_moved_domain.py-- post-move cleanup (Phase 2)pulpcore/app/models/migration_status.py--MigrationStatusmodelModified files
pulpcore/app/models/domain.py-- adddatabase_aliasandmovingfields; addpost_savesignal for cross-DB replicationpulpcore/app/settings.py-- conditionalDATABASE_ROUTERSwhenlen(DATABASES) > 1pulpcore/app/util.py-- fixget_domain_pk()to usedefaultduring migrationspulpcore/app/apps.py-- guardpost_migratehooks to only run ondefaultpulpcore/app/views/status.py-- add per-database health to/status/pulpcore/middleware.py-- reject writes for domains withmoving=Truepulpcore/tasking/tasks.py-- skip dispatch for domains withmoving=TruePlugin impact
.using()calls (Phase 1 audit)post_migratehooks should be verified forusingkwarg awarenessPhased Rollout
Phase 0: Prerequisites (weeks 1-3)
No multi-DB yet. Prepare the codebase.
connection.cursor(),RawSQL(),.extra()in pulpcore and plugins. Catalog which need explicitconnections[alias]handling.database_aliasfield to Domain: Default"default", no behavioral change. Migration is a simpleAddField.movingfield to Domain: DefaultFalse. No behavioral change.get_domain_pk(): Make it migration-safe (always querydefaultduring migrations).post_migratehooks: Addif kwargs.get("using") != "default": returnguard.RunPythonidempotency: Audit all data migrations in core and plugins.Phase 1: Routing Layer (weeks 4-8)
Multi-DB infrastructure. No domains move yet.
PulpDomainRouterwith control-plane/data-plane classification.migrate-allcommand withMigrationStatustracking.post_save/post_deletesignals, with retry logic and async-domainsmanagement command for manual reconciliation./status/endpoint with per-database health.wait_on_all_databases.pystartup gate script.Phase 2: Domain Movement (weeks 9-14)
The operational capability to actually move domains.
move-domaincommand with the full procedure: read-only mode, data copy, verification, cutover, monitoring window.cleanup-moved-domaincommand for post-move deletion.Phase 3: Production Hardening (weeks 15-20)
Risks and Mitigations
.using()calls. Safe default is"default"(original RDS).post_savesignal includes retry with exponential backoff.sync-domainscommand for manual reconciliation. Periodic health check in/status/.CreatedResource). Reconciliation task handles orphans. Same failure class as crash-during-task (already handled).len(DATABASES) > 1. Prepare alternative approaches (per-plugin splitting, read replicas) for RFC discussion.Open Questions
pg_dumpwith--where(requires pg >= 16), application-level dump/load (portable but slow), orpostgres_fdw(requires cross-instance network). Depends on RDS configuration and data volumes. Needs benchmarking in Phase 2.pg_notifychannel? No -- task coordination stays entirely on the original RDS. Workers only listen todefaultfor task wakeups. Task execution routes data-plane queries to the satellite transparently.Contentwithin a domain (same DB). Content shared between domains on different satellites will be duplicated. This is acceptable -- deduplication across satellites would require cross-DB queries.