Skip to content

LadybugDB version migration pipeline #350

@jfrench9

Description

@jfrench9

Summary

Implement a Dagster-orchestrated export/import pipeline for safely migrating LadybugDB databases across version upgrades. The pipeline uses LadybugDB's native EXPORT DATABASE / IMPORT DATABASE commands, with system backups to S3 as a safety net, health check 503 during import to drain traffic, and background tasks with SSE for monitoring long-running operations.

Status: Approved

Problem Statement: Current State

LadybugDB uses an embedded single-file database format (.lbug). When the LadybugDB Python package is upgraded to a new version, the on-disk format may be incompatible with existing databases. Today there is no migration path:

  • Existing .lbug files may fail to open with a new version
  • No version tracking on databases tells us which LadybugDB version created them
  • No automated export/import pipeline exists to migrate data between versions
  • A failed upgrade could leave databases in an unrecoverable state
  • Shared repositories (SEC) and all user databases must be migrated together or not at all

The current version is 0.13.0 (set via LADYBUG_INTERNAL_VERSION in the Dockerfile).

Problem Statement: Desired State

A fully automated, Dagster-orchestrated pipeline that:

  1. Pre-deploy: Creates system backups to S3 + exports all databases to local Parquet on each instance
  2. Deploy: New container with new LadybugDB version (EBS volume persists exported files)
  3. Post-deploy: Imports all databases from Parquet into fresh .lbug files with the new version
  4. Has rollback at every stage (.pre-migration files on disk, system backups in S3, old Docker image in ECR)
  5. Is version-agnostic — can be run with same-version as a validation/integrity check

Problem Statement: Why Now?

The platform is on LadybugDB 0.13.0. Upstream Kuzu releases (which LadybugDB forks) regularly change the on-disk format. Without a migration pipeline, we cannot upgrade LadybugDB, which blocks performance improvements, bug fixes, and new features from upstream.

Proposed Solution: Approach

Dagster-orchestrated, instance-level export/import via Graph API endpoints.

Two Dagster jobs orchestrate the migration across the fleet:

  1. Export job (pre-deploy): Discovers all instances from DynamoDB → calls POST /migration/export on each → system backup to S3, then EXPORT DATABASE to local Parquet, writes migration.json manifest
  2. Import job (post-deploy): Discovers all instances → calls POST /migration/import on each → reads manifest, creates fresh empty databases, runs IMPORT DATABASE from Parquet

Key design decisions:

  • Background tasks with SSE — export/import can take 30+ minutes for large databases
  • Health check 503 during import — same pattern as S3 ATTACH replica warmup; ALB stops routing traffic
  • System backups — new BackupType.SYSTEM (hidden from customers) provides off-disk S3 safety net
  • EBS persistence — exported Parquet files survive container swaps
  • Lazy connections — new Graph API starts without trying to open incompatible .lbug files
  • Replicas don't need migration — SEC replicas S3 ATTACH whatever backup is published; only writers have downtime

Components Affected

  • Graph API (/robosystems/graph_api/)
  • Dagster (/robosystems/dagster/)
  • Models (/robosystems/models/)
  • API (/robosystems/routers/)

Key Changes

New Files

File Description
graph_api/models/migration.py Pydantic models for export/import responses, manifest, status
graph_api/core/migration_service.py Export/import logic, 503 flag, system backup + EXPORT/IMPORT DATABASE
graph_api/routers/migration.py /migration/export, /migration/import, /migration/status endpoints
dagster/jobs/migration.py Dagster export + import jobs with DynamoDB fleet discovery + SSE monitoring

Modified Files

File Change
graph_api/app.py Include migration router
graph_api/client/client.py Add migration_export(), migration_import(), migration_status() client methods
graph_api/core/ladybug/pool.py Version incompatibility detection in _create_new_connection()
graph_api/core/task_manager.py Add migration_task_manager instance
graph_api/core/task_sse.py Add TaskType.MIGRATION SSE messages
graph_api/routers/tasks.py Register migration manager in UnifiedTaskManager
graph_api/routers/health.py Add is_migration_in_progress() check returning 503
models/iam/graph_backup.py Add SYSTEM to BackupType enum
routers/graphs/backups/backup.py Filter backup_type != "system" from customer list
dagster/definitions.py Register migration jobs

Data Model Changes

BackupType enum (models/iam/graph_backup.py):

  • Add SYSTEM = "system" — used for pre-migration safety backups, hidden from customer-facing endpoints

No database migrations requiredBackupType is stored as a string column.

API Changes

# Graph API (instance-level, called by Dagster)

POST /migration/export?source_version=0.13.0&target_version=0.14.0
→ 200 {"task_id": "migration_migration_a1b2c3d4", "monitor_url": "/tasks/migration_.../monitor"}

POST /migration/import
→ 200 {"task_id": "migration_migration_e5f6g7h8", "monitor_url": "/tasks/migration_.../monitor"}

GET /migration/status
→ 200 {"migration_pending": true, "migration_in_progress": false, "manifest": {...}, "pre_migration_files": [...]}

# Health endpoint behavior during import:
GET /health
→ 503 {"status": "migrating", "message": "Version migration in progress - not ready for traffic"}

Implementation Plan

  • Phase 1: Core models and service

    • Create graph_api/models/migration.py — Pydantic models
    • Create graph_api/core/migration_service.py — Export/import logic with 503 flag
    • Add migration_task_manager to graph_api/core/task_manager.py
    • Add TaskType.MIGRATION SSE messages to graph_api/core/task_sse.py
  • Phase 2: Endpoints and health check

    • Create graph_api/routers/migration.py — HTTP endpoints
    • Register migration router in graph_api/app.py
    • Add migration check to graph_api/routers/health.py (503 pattern)
    • Register migration manager in graph_api/routers/tasks.py (UnifiedTaskManager)
  • Phase 3: Backup and safety

    • Add SYSTEM to BackupType enum in models/iam/graph_backup.py
    • Filter system backups from customer-facing list_backups endpoint
    • Add version incompatibility detection in connection pool
  • Phase 4: Client and Dagster

    • Add migration_export(), migration_import(), migration_status() to Graph API client
    • Create dagster/jobs/migration.py — export + import jobs
    • Register migration jobs in dagster/definitions.py
  • Phase 5: Validation

    • Same-version dry run in dev (0.13.0 → 0.13.0)
    • Cross-version test in staging
    • SEC shared repo migration test

Dependencies: None — all building blocks exist (connection pool, backup service, task manager, DynamoDB registry)

Testing

  • Unit tests for MigrationService export/import logic
  • Unit tests for manifest parsing and validation
  • Unit tests for health check 503 flag behavior
  • Unit tests for system backup type filtering
  • Integration test: same-version export/import round-trip (0.13.0 → 0.13.0)
  • Integration test: disk space pre-check failure path
  • Integration test: import failure → .pre-migration rollback
  • Manual: full pipeline in dev environment with Dagster jobs

Rollout

Environments: Development → Staging → Production

Rollout strategy:

  1. Same-version validation in dev (export + import with identical version)
  2. Cross-version migration in staging
  3. Production: SEC shared repos first → large/xlarge (fewer customers) → standard (most customers, smallest DBs)

Rollback plan:

  • .pre-migration files on local EBS for immediate rollback
  • System backups in S3 for full recovery
  • Old Docker image in ECR — rollback is just changing the image tag
  • Dagster import job has skip_instances config for partial fleet recovery

Success Criteria

  • Same-version round-trip (export + import) preserves 100% of data (node/relationship count verification)
  • Cross-version migration works end-to-end in staging
  • Health check returns 503 during import, 200 after completion
  • System backups are created and hidden from customer-facing endpoints
  • Dagster jobs successfully orchestrate fleet-wide export and import
  • SEC shared repo migration completes with minimal replica downtime
  • Rollback from .pre-migration files works correctly on import failure
  • Full pipeline documentation for operational runbook

Open Questions

  • Does LadybugDB's EXPORT DATABASE work with an active read connection? If not, we need to drain all connections before exporting, causing brief unavailability during export.
  • What's the maximum practical database size for export/import? Need to benchmark with SEC-scale databases (50GB+).
  • Can we detect the LadybugDB version from a .lbug file header? Would enable version guard without the manifest.
  • Should migrations be opt-in per tier? (SEC first → large/xlarge → standard for progressive confidence)

References

  • Spec: local/docs/specs/ladybug-version-migration.md
  • Kuzu EXPORT/IMPORT docs: https://docs.kuzudb.com/export-import/
  • Existing building blocks: graph_api/core/backup_service.py, graph_api/core/task_manager.py, graph_api/routers/health.py
  • Related: Graph deprovisioning (feature/graph-deprovisioning branch)

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions