LadybugDB version migration pipeline

## Summary

Implement a Dagster-orchestrated export/import pipeline for safely migrating LadybugDB databases across version upgrades. The pipeline uses LadybugDB's native `EXPORT DATABASE` / `IMPORT DATABASE` commands, with system backups to S3 as a safety net, health check 503 during import to drain traffic, and background tasks with SSE for monitoring long-running operations.

**Status:** Approved

## Problem Statement: Current State

LadybugDB uses an embedded single-file database format (`.lbug`). When the LadybugDB Python package is upgraded to a new version, the on-disk format may be incompatible with existing databases. Today there is **no migration path**:

- Existing `.lbug` files may fail to open with a new version
- No version tracking on databases tells us which LadybugDB version created them
- No automated export/import pipeline exists to migrate data between versions
- A failed upgrade could leave databases in an unrecoverable state
- Shared repositories (SEC) and all user databases must be migrated together or not at all

The current version is `0.13.0` (set via `LADYBUG_INTERNAL_VERSION` in the Dockerfile).

## Problem Statement: Desired State

A fully automated, Dagster-orchestrated pipeline that:

1. **Pre-deploy**: Creates system backups to S3 + exports all databases to local Parquet on each instance
2. **Deploy**: New container with new LadybugDB version (EBS volume persists exported files)
3. **Post-deploy**: Imports all databases from Parquet into fresh `.lbug` files with the new version
4. Has rollback at every stage (`.pre-migration` files on disk, system backups in S3, old Docker image in ECR)
5. Is version-agnostic — can be run with same-version as a validation/integrity check

## Problem Statement: Why Now?

The platform is on LadybugDB `0.13.0`. Upstream Kuzu releases (which LadybugDB forks) regularly change the on-disk format. Without a migration pipeline, we cannot upgrade LadybugDB, which blocks performance improvements, bug fixes, and new features from upstream.

## Proposed Solution: Approach

**Dagster-orchestrated, instance-level export/import via Graph API endpoints.**

Two Dagster jobs orchestrate the migration across the fleet:

1. **Export job** (pre-deploy): Discovers all instances from DynamoDB → calls `POST /migration/export` on each → system backup to S3, then `EXPORT DATABASE` to local Parquet, writes `migration.json` manifest
2. **Import job** (post-deploy): Discovers all instances → calls `POST /migration/import` on each → reads manifest, creates fresh empty databases, runs `IMPORT DATABASE` from Parquet

Key design decisions:
- **Background tasks with SSE** — export/import can take 30+ minutes for large databases
- **Health check 503 during import** — same pattern as S3 ATTACH replica warmup; ALB stops routing traffic
- **System backups** — new `BackupType.SYSTEM` (hidden from customers) provides off-disk S3 safety net
- **EBS persistence** — exported Parquet files survive container swaps
- **Lazy connections** — new Graph API starts without trying to open incompatible `.lbug` files
- **Replicas don't need migration** — SEC replicas S3 ATTACH whatever backup is published; only writers have downtime

### Components Affected

- [x] Graph API (`/robosystems/graph_api/`)
- [x] Dagster (`/robosystems/dagster/`)
- [x] Models (`/robosystems/models/`)
- [x] API (`/robosystems/routers/`)

## Key Changes

### New Files

| File | Description |
|------|-------------|
| `graph_api/models/migration.py` | Pydantic models for export/import responses, manifest, status |
| `graph_api/core/migration_service.py` | Export/import logic, 503 flag, system backup + EXPORT/IMPORT DATABASE |
| `graph_api/routers/migration.py` | `/migration/export`, `/migration/import`, `/migration/status` endpoints |
| `dagster/jobs/migration.py` | Dagster export + import jobs with DynamoDB fleet discovery + SSE monitoring |

### Modified Files

| File | Change |
|------|--------|
| `graph_api/app.py` | Include migration router |
| `graph_api/client/client.py` | Add `migration_export()`, `migration_import()`, `migration_status()` client methods |
| `graph_api/core/ladybug/pool.py` | Version incompatibility detection in `_create_new_connection()` |
| `graph_api/core/task_manager.py` | Add `migration_task_manager` instance |
| `graph_api/core/task_sse.py` | Add `TaskType.MIGRATION` SSE messages |
| `graph_api/routers/tasks.py` | Register migration manager in `UnifiedTaskManager` |
| `graph_api/routers/health.py` | Add `is_migration_in_progress()` check returning 503 |
| `models/iam/graph_backup.py` | Add `SYSTEM` to `BackupType` enum |
| `routers/graphs/backups/backup.py` | Filter `backup_type != "system"` from customer list |
| `dagster/definitions.py` | Register migration jobs |

## Data Model Changes

**`BackupType` enum** (`models/iam/graph_backup.py`):
- Add `SYSTEM = "system"` — used for pre-migration safety backups, hidden from customer-facing endpoints

**No database migrations required** — `BackupType` is stored as a string column.

## API Changes

```http
# Graph API (instance-level, called by Dagster)

POST /migration/export?source_version=0.13.0&target_version=0.14.0
→ 200 {"task_id": "migration_migration_a1b2c3d4", "monitor_url": "/tasks/migration_.../monitor"}

POST /migration/import
→ 200 {"task_id": "migration_migration_e5f6g7h8", "monitor_url": "/tasks/migration_.../monitor"}

GET /migration/status
→ 200 {"migration_pending": true, "migration_in_progress": false, "manifest": {...}, "pre_migration_files": [...]}

# Health endpoint behavior during import:
GET /health
→ 503 {"status": "migrating", "message": "Version migration in progress - not ready for traffic"}
```

## Implementation Plan

- [ ] **Phase 1: Core models and service**
  - [ ] Create `graph_api/models/migration.py` — Pydantic models
  - [ ] Create `graph_api/core/migration_service.py` — Export/import logic with 503 flag
  - [ ] Add `migration_task_manager` to `graph_api/core/task_manager.py`
  - [ ] Add `TaskType.MIGRATION` SSE messages to `graph_api/core/task_sse.py`

- [ ] **Phase 2: Endpoints and health check**
  - [ ] Create `graph_api/routers/migration.py` — HTTP endpoints
  - [ ] Register migration router in `graph_api/app.py`
  - [ ] Add migration check to `graph_api/routers/health.py` (503 pattern)
  - [ ] Register migration manager in `graph_api/routers/tasks.py` (`UnifiedTaskManager`)

- [ ] **Phase 3: Backup and safety**
  - [ ] Add `SYSTEM` to `BackupType` enum in `models/iam/graph_backup.py`
  - [ ] Filter system backups from customer-facing `list_backups` endpoint
  - [ ] Add version incompatibility detection in connection pool

- [ ] **Phase 4: Client and Dagster**
  - [ ] Add `migration_export()`, `migration_import()`, `migration_status()` to Graph API client
  - [ ] Create `dagster/jobs/migration.py` — export + import jobs
  - [ ] Register migration jobs in `dagster/definitions.py`

- [ ] **Phase 5: Validation**
  - [ ] Same-version dry run in dev (0.13.0 → 0.13.0)
  - [ ] Cross-version test in staging
  - [ ] SEC shared repo migration test

**Dependencies:** None — all building blocks exist (connection pool, backup service, task manager, DynamoDB registry)

## Testing

- [ ] Unit tests for `MigrationService` export/import logic
- [ ] Unit tests for manifest parsing and validation
- [ ] Unit tests for health check 503 flag behavior
- [ ] Unit tests for system backup type filtering
- [ ] Integration test: same-version export/import round-trip (0.13.0 → 0.13.0)
- [ ] Integration test: disk space pre-check failure path
- [ ] Integration test: import failure → `.pre-migration` rollback
- [ ] Manual: full pipeline in dev environment with Dagster jobs

## Rollout

**Environments:** Development → Staging → Production

**Rollout strategy:**
1. Same-version validation in dev (export + import with identical version)
2. Cross-version migration in staging
3. Production: SEC shared repos first → large/xlarge (fewer customers) → standard (most customers, smallest DBs)

**Rollback plan:**
- `.pre-migration` files on local EBS for immediate rollback
- System backups in S3 for full recovery
- Old Docker image in ECR — rollback is just changing the image tag
- Dagster import job has `skip_instances` config for partial fleet recovery

## Success Criteria

- [ ] Same-version round-trip (export + import) preserves 100% of data (node/relationship count verification)
- [ ] Cross-version migration works end-to-end in staging
- [ ] Health check returns 503 during import, 200 after completion
- [ ] System backups are created and hidden from customer-facing endpoints
- [ ] Dagster jobs successfully orchestrate fleet-wide export and import
- [ ] SEC shared repo migration completes with minimal replica downtime
- [ ] Rollback from `.pre-migration` files works correctly on import failure
- [ ] Full pipeline documentation for operational runbook

## Open Questions

- [ ] Does LadybugDB's `EXPORT DATABASE` work with an active read connection? If not, we need to drain all connections before exporting, causing brief unavailability during export.
- [ ] What's the maximum practical database size for export/import? Need to benchmark with SEC-scale databases (50GB+).
- [ ] Can we detect the LadybugDB version from a `.lbug` file header? Would enable version guard without the manifest.
- [ ] Should migrations be opt-in per tier? (SEC first → large/xlarge → standard for progressive confidence)

## References

- Spec: `local/docs/specs/ladybug-version-migration.md`
- Kuzu EXPORT/IMPORT docs: https://docs.kuzudb.com/export-import/
- Existing building blocks: `graph_api/core/backup_service.py`, `graph_api/core/task_manager.py`, `graph_api/routers/health.py`
- Related: Graph deprovisioning (`feature/graph-deprovisioning` branch)

File	Change
`graph_api/app.py`	Include migration router
`graph_api/client/client.py`	Add `migration_export()`, `migration_import()`, `migration_status()` client methods
`graph_api/core/ladybug/pool.py`	Version incompatibility detection in `_create_new_connection()`
`graph_api/core/task_manager.py`	Add `migration_task_manager` instance
`graph_api/core/task_sse.py`	Add `TaskType.MIGRATION` SSE messages
`graph_api/routers/tasks.py`	Register migration manager in `UnifiedTaskManager`
`graph_api/routers/health.py`	Add `is_migration_in_progress()` check returning 503
`models/iam/graph_backup.py`	Add `SYSTEM` to `BackupType` enum
`routers/graphs/backups/backup.py`	Filter `backup_type != "system"` from customer list
`dagster/definitions.py`	Register migration jobs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LadybugDB version migration pipeline #350

Summary

Problem Statement: Current State

Problem Statement: Desired State

Problem Statement: Why Now?

Proposed Solution: Approach

Components Affected

Key Changes

New Files

Modified Files

Data Model Changes

API Changes

Implementation Plan

Testing

Rollout

Success Criteria

Open Questions

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

File	Description
`graph_api/models/migration.py`	Pydantic models for export/import responses, manifest, status
`graph_api/core/migration_service.py`	Export/import logic, 503 flag, system backup + EXPORT/IMPORT DATABASE
`graph_api/routers/migration.py`	`/migration/export`, `/migration/import`, `/migration/status` endpoints
`dagster/jobs/migration.py`	Dagster export + import jobs with DynamoDB fleet discovery + SSE monitoring

LadybugDB version migration pipeline #350

Description

Summary

Problem Statement: Current State

Problem Statement: Desired State

Problem Statement: Why Now?

Proposed Solution: Approach

Components Affected

Key Changes

New Files

Modified Files

Data Model Changes

API Changes

Implementation Plan

Testing

Rollout

Success Criteria

Open Questions

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions