Skip to content

feat(webapp): plan-aware compute migration#3957

Open
nicktrn wants to merge 21 commits into
mainfrom
feat/compute-migration
Open

feat(webapp): plan-aware compute migration#3957
nicktrn wants to merge 21 commits into
mainfrom
feat/compute-migration

Conversation

@nicktrn

@nicktrn nicktrn commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Adds an opt-in mechanism to route a configurable percentage of organizations onto the compute (MicroVM) backing of their region at trigger time, without changing their stored region settings.

Routing is gated by three global feature flags - computeMigrationEnabled, computeMigrationFreePercentage, computeMigrationPaidPercentage - plus a per-org computeMigrationEnabled override that wins in both directions, and the COMPUTE_BACKING_MAP env var that maps a region's worker queue to its compute-backing queue. Orgs are bucketed deterministically by id, so ramping a percentage down keeps a strict subset rather than reshuffling, and a region with no mapped backing (including the empty default map) is never touched. Everything is off by default - behaviour is unchanged unless the flags are set.

The flags are read on the trigger hot path from an in-memory snapshot rather than the database: a small createReloadingRegistry helper loads the global flags at startup and refreshes them on an interval, so no per-trigger query is added and a percentage or kill-switch change propagates within the reload interval. A cold replica that hasn't loaded yet falls back to off (the container path). The same migration decision is consulted at deploy-time template creation so a migrated org still gets a compute template built, in shadow mode so it never fails the deploy.

Minor follow-ups left out of scope: the percentage flags render as text inputs on the admin flags page (the catalog UI has no numeric control type yet), and createReloadingRegistry could later gain pub/sub for sub-second cross-replica propagation if the reload interval proves too slow.

@nicktrn nicktrn self-assigned this Jun 15, 2026
@changeset-bot

changeset-bot Bot commented Jun 15, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: 126a01f

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

This PR introduces a plan-aware compute migration system that routes organizations onto compute backing at task trigger time. It adds a generic createReloadingRegistry utility with Prometheus metrics, p-retry startup loading, and periodic refresh. A new workerRegionRegistry loads WorkerGroupRegionRow data from the database and exposes regionForQueue and backingForQueue helpers; the WorkerInstanceGroup table gains a nullable region TEXT column via migration. Three feature flags (computeMigrationEnabled, computeMigrationFreePercentage, computeMigrationPaidPercentage) and two new environment variables (GLOBAL_FLAGS_RELOAD_INTERVAL_MS, GLOBAL_FLAGS_READY_TIMEOUT_MS) are added. A globalFlagsRegistry singleton caches global flags from the database. An FNV-1a hashBucket function and isOrgMigrated/resolveComputeMigration functions implement the enrollment decision and queue rewrite logic. TaskRun gains a region column persisted by RunEngine.trigger. The triggerTask and computeTemplateCreation services are updated to evaluate migration at routing time and rewrite worker queues to compute backing when enrolled. Region derivation across presenters, routes, and the ClickHouse replication service is updated to use explicit region field when present. ClickHouse task_runs_v2 table gains a region column for analytics.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is comprehensive and addresses the core functionality, routing mechanism, registry pattern, and deployment considerations. However, the provided description does not follow the required template format with the specified sections (Closes #, Checklist, Testing, Changelog, Screenshots). Reformat the description to follow the template: include Closes #, complete the checklist, add a Testing section explaining validation steps, provide a Changelog section with a short summary, and include Screenshots if applicable.
Docstring Coverage ⚠️ Warning Docstring coverage is 53.85% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'feat(webapp): plan-aware compute migration' is specific, concise, and accurately describes the main feature being added—a plan-aware mechanism for compute migration controlled by feature flags and percentage bucketing.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/compute-migration

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

Open in Devin Review

coderabbitai[bot]

This comment was marked as resolved.

@nicktrn

nicktrn commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator Author

Addressed the review feedback, plus a few issues a deeper review pass turned up:

  • Replay of a migrated run would have silently produced no run: the stored backing queue (us-east-1-next) was read back as an explicit region override and rejected by the compute-access gate. Replay now reverse-maps the stored backing to its geo region and re-resolves, so migration re-applies with current flags (and an org that's since been excluded replays onto the container path).
  • Backing hidden on customer surfaces: a regionForBacking inverse of COMPUTE_BACKING_MAP is applied at the run API, run list, run detail, replay, and the ClickHouse worker_queue write, so the API / dashboard / Query feature all report the geo region. The raw backing stays on TaskRun.workerQueue in Postgres for internal use - no schema change.
  • Registry: reloads are now sequence-guarded so a slow older reload can't overwrite a newer snapshot (the kill switch can't silently revert), and waitUntilReady clears its timeout instead of leaking one per cold-start trigger.
  • Kill switch uses strict z.boolean() (coercion turned the string "false" into true); the reload interval is now bounded.

Operational notes for rollout:

  • Billing should key off machine preset / actual execution, not hasComputeAccess - migrated orgs run on the backing without that flag.
  • The compute backing needs its own :scheduled consumer for scheduled runs.
  • The deprecated V3 batch path doesn't percentage-enroll (it passes skipChecks without a plan type); per-org overrides still apply there.

@nicktrn nicktrn force-pushed the feat/compute-migration branch from 3cf484d to 697de03 Compare June 15, 2026 20:05
@nicktrn

nicktrn commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator Author

Follow-up: replaced the COMPUTE_BACKING_MAP env var with a region column on WorkerInstanceGroup, so region<->backing resolution comes from data instead of editable config (removes the "edit a config blob and silently break reverse-mapping for historical runs" footgun).

  • New nullable WorkerInstanceGroup.region (migration ..._add_worker_instance_group_region). Container and compute groups for one geo share the value - e.g. both us-east-1 and us-east-1-next get region = "us-east-1".
  • A workerRegionRegistry (same createReloadingRegistry pattern, in-memory snapshot) serves both directions off the hot path: forward (region -> its MICROVM backing) at trigger, reverse (a stored queue -> its geo region) at the presenters / replay / ClickHouse write.
  • COMPUTE_BACKING_MAP and computeBackingMap.server.ts deleted.

Rollout requirement: set region on the live worker groups before enabling migration. It's nullable - unset means that group never migrates and resolves to its own queue (safe no-op). Backfill the container + compute groups of each geo to the same region value.

Treat region as set-once while a group has run history: changing it re-breaks region resolution for existing runs. The durability win is that this is now one immutable data field rather than an editable config map.

devin-ai-integration[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

@nicktrn nicktrn force-pushed the feat/compute-migration branch from a1a460e to b75e18a Compare June 15, 2026 22:07
@pkg-pr-new

pkg-pr-new Bot commented Jun 15, 2026

Copy link
Copy Markdown

Open in StackBlitz

@trigger.dev/build

npm i https://pkg.pr.new/@trigger.dev/build@b75e18a

trigger.dev

npm i https://pkg.pr.new/trigger.dev@b75e18a

@trigger.dev/core

npm i https://pkg.pr.new/@trigger.dev/core@b75e18a

@trigger.dev/python

npm i https://pkg.pr.new/@trigger.dev/python@b75e18a

@trigger.dev/react-hooks

npm i https://pkg.pr.new/@trigger.dev/react-hooks@b75e18a

@trigger.dev/redis-worker

npm i https://pkg.pr.new/@trigger.dev/redis-worker@b75e18a

@trigger.dev/rsc

npm i https://pkg.pr.new/@trigger.dev/rsc@b75e18a

@trigger.dev/schema-to-json

npm i https://pkg.pr.new/@trigger.dev/schema-to-json@b75e18a

@trigger.dev/sdk

npm i https://pkg.pr.new/@trigger.dev/sdk@b75e18a

commit: b75e18a

coderabbitai[bot]

This comment was marked as resolved.

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

Open in Devin Review

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 No ClickHouse backfill for existing runs' region column

The ClickHouse migration 032_add_task_runs_v2_region.sql adds the region column with DEFAULT ''. All existing rows in task_runs_v2 will have region = '' until they are re-replicated (which doesn't happen for old rows). New runs going forward will have region populated via the replication service (apps/webapp/app/services/runsReplicationService.server.ts:1126). There's no backfill migration to populate region from worker_queue for existing rows. This is fine for the run list page (which has the if(region != '', region, worker_queue) fallback) but contributes to the Logs page bug. If a backfill is planned as a separate step, this is acceptable; if not, it extends the window during which old runs are invisible to region filters on the Logs page.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +195 to +200
clickhouseName: "region",
...column("String", {
description: "Region",
example: "us-east-1",
}),
expression: "if(startsWith(worker_queue, 'cm'), NULL, worker_queue)",
expression: "multiIf(region != '', region, startsWith(worker_queue, 'cm'), NULL, worker_queue)",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Logs/Query page region filter silently drops all pre-existing runs

Changing clickhouseName from "worker_queue" to "region" makes the WHERE clause filter on the raw region column. The expression (the multiIf fallback) only applies to SELECT output. For all runs inserted before this deploy, the ClickHouse region column is "" (the DEFAULT from 032_add_task_runs_v2_region.sql), so a Logs page filter like region = 'us-east-1' generates WHERE region = 'us-east-1' which silently excludes every pre-existing run — even though worker_queue still carries the correct value.

Contrast with the run-list page in apps/webapp/app/services/runsRepository/clickhouseRunsRepository.server.ts:381 which correctly uses if(region != '', region, worker_queue) IN {regions: ...} for its WHERE clause. The Logs page has no equivalent fallback because clickhouseName is used directly for WHERE while expression is only used for SELECT rendering.

Prompt for agents
The `clickhouseName` field is used by the Logs/Query page query builder for WHERE clauses, while `expression` is only used for SELECT output. Changing `clickhouseName` from `worker_queue` to `region` means WHERE filters now target the raw `region` column, which is empty string for all pre-existing ClickHouse rows (the migration adds it with DEFAULT '').

The run-list page (clickhouseRunsRepository.server.ts:381) handles this correctly with `if(region != '', region, worker_queue)` in its WHERE clause.

To fix this for the Logs/Query page, you need the WHERE path to also use the fallback expression. Options:
1. Keep `clickhouseName: 'worker_queue'` and add a `whereTransform` that handles the mapping, or
2. Add a custom WHERE expression mechanism if the query builder supports it (similar to how `expression` works for SELECT), or
3. Use a ClickHouse materialized column that computes the fallback so both SELECT and WHERE can use a single column name.

The simplest fix is likely option 1: revert `clickhouseName` to `'worker_queue'` (since old runs only have that populated) and keep the `expression` for display. This mirrors the pre-PR behavior for WHERE while still showing the correct region via the expression.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants