feat(webapp): plan-aware compute migration#3957
Conversation
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis PR introduces a plan-aware compute migration system that routes organizations onto compute backing at task trigger time. It adds a generic 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Addressed the review feedback, plus a few issues a deeper review pass turned up:
Operational notes for rollout:
|
3cf484d to
697de03
Compare
|
Follow-up: replaced the
Rollout requirement: set Treat |
…egion, drop env map
a1a460e to
b75e18a
Compare
@trigger.dev/build
trigger.dev
@trigger.dev/core
@trigger.dev/python
@trigger.dev/react-hooks
@trigger.dev/redis-worker
@trigger.dev/rsc
@trigger.dev/schema-to-json
@trigger.dev/sdk
commit: |
There was a problem hiding this comment.
🚩 No ClickHouse backfill for existing runs' region column
The ClickHouse migration 032_add_task_runs_v2_region.sql adds the region column with DEFAULT ''. All existing rows in task_runs_v2 will have region = '' until they are re-replicated (which doesn't happen for old rows). New runs going forward will have region populated via the replication service (apps/webapp/app/services/runsReplicationService.server.ts:1126). There's no backfill migration to populate region from worker_queue for existing rows. This is fine for the run list page (which has the if(region != '', region, worker_queue) fallback) but contributes to the Logs page bug. If a backfill is planned as a separate step, this is acceptable; if not, it extends the window during which old runs are invisible to region filters on the Logs page.
Was this helpful? React with 👍 or 👎 to provide feedback.
| clickhouseName: "region", | ||
| ...column("String", { | ||
| description: "Region", | ||
| example: "us-east-1", | ||
| }), | ||
| expression: "if(startsWith(worker_queue, 'cm'), NULL, worker_queue)", | ||
| expression: "multiIf(region != '', region, startsWith(worker_queue, 'cm'), NULL, worker_queue)", |
There was a problem hiding this comment.
🔴 Logs/Query page region filter silently drops all pre-existing runs
Changing clickhouseName from "worker_queue" to "region" makes the WHERE clause filter on the raw region column. The expression (the multiIf fallback) only applies to SELECT output. For all runs inserted before this deploy, the ClickHouse region column is "" (the DEFAULT from 032_add_task_runs_v2_region.sql), so a Logs page filter like region = 'us-east-1' generates WHERE region = 'us-east-1' which silently excludes every pre-existing run — even though worker_queue still carries the correct value.
Contrast with the run-list page in apps/webapp/app/services/runsRepository/clickhouseRunsRepository.server.ts:381 which correctly uses if(region != '', region, worker_queue) IN {regions: ...} for its WHERE clause. The Logs page has no equivalent fallback because clickhouseName is used directly for WHERE while expression is only used for SELECT rendering.
Prompt for agents
The `clickhouseName` field is used by the Logs/Query page query builder for WHERE clauses, while `expression` is only used for SELECT output. Changing `clickhouseName` from `worker_queue` to `region` means WHERE filters now target the raw `region` column, which is empty string for all pre-existing ClickHouse rows (the migration adds it with DEFAULT '').
The run-list page (clickhouseRunsRepository.server.ts:381) handles this correctly with `if(region != '', region, worker_queue)` in its WHERE clause.
To fix this for the Logs/Query page, you need the WHERE path to also use the fallback expression. Options:
1. Keep `clickhouseName: 'worker_queue'` and add a `whereTransform` that handles the mapping, or
2. Add a custom WHERE expression mechanism if the query builder supports it (similar to how `expression` works for SELECT), or
3. Use a ClickHouse materialized column that computes the fallback so both SELECT and WHERE can use a single column name.
The simplest fix is likely option 1: revert `clickhouseName` to `'worker_queue'` (since old runs only have that populated) and keep the `expression` for display. This mirrors the pre-PR behavior for WHERE while still showing the correct region via the expression.
Was this helpful? React with 👍 or 👎 to provide feedback.
Adds an opt-in mechanism to route a configurable percentage of organizations onto the compute (MicroVM) backing of their region at trigger time, without changing their stored region settings.
Routing is gated by three global feature flags -
computeMigrationEnabled,computeMigrationFreePercentage,computeMigrationPaidPercentage- plus a per-orgcomputeMigrationEnabledoverride that wins in both directions, and theCOMPUTE_BACKING_MAPenv var that maps a region's worker queue to its compute-backing queue. Orgs are bucketed deterministically by id, so ramping a percentage down keeps a strict subset rather than reshuffling, and a region with no mapped backing (including the empty default map) is never touched. Everything is off by default - behaviour is unchanged unless the flags are set.The flags are read on the trigger hot path from an in-memory snapshot rather than the database: a small
createReloadingRegistryhelper loads the global flags at startup and refreshes them on an interval, so no per-trigger query is added and a percentage or kill-switch change propagates within the reload interval. A cold replica that hasn't loaded yet falls back to off (the container path). The same migration decision is consulted at deploy-time template creation so a migrated org still gets a compute template built, in shadow mode so it never fails the deploy.Minor follow-ups left out of scope: the percentage flags render as text inputs on the admin flags page (the catalog UI has no numeric control type yet), and
createReloadingRegistrycould later gain pub/sub for sub-second cross-replica propagation if the reload interval proves too slow.