Fix version-stuck: reconcile SharedDocs via Collaborate.start after provisioner import by lmac-1 · Pull Request #4562 · OpenFn/lightning

lmac-1 · 2026-03-24T19:57:48Z

Description

This PR fixes a class of bugs where the collaborative editor ends up stuck on a stale version of a workflow after an external write (sandbox merge or CLI deploy).

The bug

When the provisioner (or sandbox merge) writes a new workflow version to the database, live editor sessions are never notified. Two failure modes:

Someone is online: the SharedDoc stays alive but is never updated. Users with the workflow open see the pre-merge state, and their next save silently overwrites the provisioner's work. The unsaved-changes indicator (red dot) stays on permanently because baseWorkflow is never refreshed.
Nobody is online: the stale DocumentState (the persisted Y.Doc snapshot) is loaded the next time a user opens the workflow, giving them the pre-merge state. The correct DB content is ignored.

The fix

Both cases now go through the same code path.

After the provisioner's DB transaction commits, WorkflowReconciler.reconcile_workflow_from_db/2 is called for each updated workflow. The reconciler calls Collaborate.start/1 (joining an existing live SharedDoc if one is running, starting a new one if not), applies a state-based diff via apply_db_state_to_doc, then calls Session.stop. The shutdown chain flushes the corrected state to DocumentState before any real user opens the workflow.

After Session.stop, the reconciler broadcasts :workflow_updated_externally on PubSub. Each connected WorkflowChannel process receives this and pushes a workflow_saved event to its client — the existing frontend handler updates baseWorkflow and clears the unsaved-changes indicator. No frontend changes were needed.

Supporting changes:

Collaborate.start/1 now retries once (after a 1ms sleep) if Session.init returns :shared_doc_not_found. This handles a 0ms auto_exit race: when the reconciler's Session.stop triggers the last observer to leave, the SharedDoc dies via a 0ms timer. A subsequent Collaborate.start can find a dying process still registered in :pg before the timer fires. The retry absorbs this and all callers benefit.
Session now accepts a ProjectRepoConnection as the user option. Presence tracking is guarded by is_struct(user, User) — system actors skip it entirely.
The previous approach (invalidate_document_states inside the provisioner transaction, which deleted stale DocumentState rows) is removed. It only handled the nobody-online case and required the provisioner to know about the online/offline distinction.

Closes #4535

Validation steps

Someone online during a sandbox merge:
- Open a workflow in the editor (Tab A)
- Make a sandbox for the project in Tab B
- In Tab A, make some changes to the workflow but don't save.
- In Tab B, make a different change to the workflow, save and then merge the sandbox.
- Confirm Tab A's Y.Doc is updated in real time (new jobs/edges appear) unsaved changes will be deleted, this is expected
- Confirm the unsaved-changes indicator (red dot) clears after the merge
- Confirm saving from Tab A produces the merged state, not the pre-merge state
Nobody online during a sandbox merge / CLI deploy:
- Make sure the workflow has some unsaved changes and then close it
- Ensure no sessions are open for the workflow (close all tabs)
- Run a sandbox merge or Provisioner.import_document with changes
- Open the workflow in a new tab
- Confirm the new tab shows the merged state, not the pre-merge state
GitHub sync (ProjectRepoConnection actor):
- Trigger a GitHub sync that updates a workflow
- Confirm the same behaviour as steps 1 and 2 depending on whether anyone
  was online

Additional notes for the reviewer

@stuartc, if the reconciler part is not correct, I might need you to jump in to help a bit as it's a bit over my head and I fear we will be going in circles.
At the moment, if a user is active with unsaved changes when an external change happens, their changes will be lost. This PR does not address this but can be discussed in a follow-up issue to improve the user experience.
Commits 1–5 on this branch are the old approach (deleting DocumentState rows in the provisioner transaction). They are superseded by commits 6–7. The final state of the code is what matters; the history reflects the iteration. Happy to squash if you'd prefer a cleaner history.
Concurrent sandbox merges (not fixed here): For CLI deployments (allow_stale: false), DB optimistic locking ensures at most one provisioner succeeds per workflow — so at most one reconciler ever runs. For sandbox merge (allow_stale: true), two concurrent merges to the same workflow could both succeed and both run a reconciler. Collaborate.start/1 has a pre-existing TOCTOU race (check-then-act on the SharedDoc lookup) that could cause one reconciler to fail silently; the other would still complete correctly. This is not introduced by this PR and is a realistic edge case only under concurrent sandbox merges. A follow-on improvement would be for the reconciler to re-load the workflow from DB by ID rather than using the struct passed in at transaction time, so the last writer always applies current state.
No frontend changes. The existing workflowSavedHandler increateSessionContextStore.ts already handles workflow_saved by callingsetBaseWorkflow() and setLatestSnapshotLockVersion().

AI Usage

Please disclose whether you've used AI anywhere in this PR (it's cool, we just
want to know!):

I have used Claude Code
I have used another model
I have not used AI

You can read more details in our
Responsible AI Policy

Pre-submission checklist

I have performed an AI review of my code (we recommend using /review
with Claude Code)
I have implemented and tested all related authorization policies.
(e.g., :owner, :admin, :editor, :viewer)
I have updated the changelog.
I have ticked a box in "AI usage" in this PR

…test

…n write Removes reconcile_or_reset from Persistence.bind and instead has the provisioner delete stale DocumentState for any workflow it changes. Adds DocumentState.delete_for_document/1 for this purpose.

Remove tests for reconcile_or_reset and reconcile_workflow_metadata, which were deleted from Persistence.bind/3. Replace with tests that verify the new behaviour: persisted state loads as-is, and the provisioner deletes stale DocumentState before writing to the DB. Adds three provisioner tests covering the invalidate_document_states filtering logic (changed, unchanged, and soft-deleted workflows).

Previously, the provisioner tried to handle the "nobody online" case by deleting stale DocumentState rows inside the DB transaction. This left the "someone online" case unaddressed and required the provisioner to know about the online/offline distinction. Replace both paths with a single unified approach: after the DB transaction commits, call WorkflowReconciler.reconcile_workflow_from_db/2 for each updated workflow. The reconciler calls Collaborate.start (joining an existing SharedDoc if one is alive, or starting a fresh one if not), applies a state-driven diff via apply_db_state_to_doc, then calls Session.stop. The shutdown chain flushes the corrected state to DocumentState before any real user opens the workflow. Key changes: - workflow_reconciler.ex: new reconcile_workflow_from_db/2 takes the provisioner actor (User or ProjectRepoConnection) and applies the full DB→Y.Doc diff via Collaborate.start/Session.stop. Broadcasts :workflow_updated_externally on PubSub so connected clients update their baseWorkflow and clear the unsaved-changes indicator. - collaborate.ex: start/1 retries once after a 1ms sleep if Session.init returns :shared_doc_not_found — handles the 0ms auto_exit race where a dying SharedDoc is still registered in :pg when the reconciler's fresh session starts. - session.ex: presence tracking and untracking are now guarded by is_struct(user, User) — ProjectRepoConnection actors skip Presence entirely. SharedDoc.observe in init is wrapped in try/catch :exit so a dying SharedDoc returns a clean stop reason instead of crashing the session. - provisioner.ex: passes user_or_repo_connection through to the reconciler; removes invalidate_document_states and the DocumentState alias. - document_state.ex: removes delete_for_document/1 (no longer needed). - workflow_channel.ex: handle_info for :workflow_updated_externally pushes workflow_saved to each connected client so baseWorkflow stays in sync.

- external_workflow_update_test.exs: all three "someone online" tests now start sessions via Collaborate.start instead of start_supervised! Session directly. DocumentSupervisor is still started via start_supervised! (ExUnit must own it for PersistenceWriter flush ordering), but now receives the Registry name so ensure_doc_supervisor_stopped can find it. Each test calls ensure_doc_supervisor_stopped in-body after the last Session.stop and also registers it as an on_exit safety net — prevents DB connection warnings from PersistenceWriter flushes racing the Ecto sandbox teardown. Adds a new "GitHub sync" test: ProjectRepoConnection actor flows through the provisioner and reconciler correctly, no presence tracking fires, and the next user sees the synced state. - persistence_test.exs: removes the delete_for_document test — that function no longer exists. - provisioner_test.exs: removes the three DocumentState invalidation tests (invalidates-on-change, preserves-on-no-change, skips-on-soft-delete) — the invalidation approach has been replaced by the reconciler.

codecov · 2026-03-25T17:56:10Z

Codecov Report

❌ Patch coverage is 89.55224% with 28 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.85%. Comparing base (bef2bc2) to head (d769533).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
lib/lightning/collaboration/workflow_reconciler.ex	90.11%	17 Missing ⚠️
lib/lightning/collaboration/registry.ex	73.68%	5 Missing ⚠️
lib/lightning/collaboration/session.ex	83.33%	2 Missing ⚠️
lib/lightning/collaboration.ex	94.73%	1 Missing ⚠️
lib/lightning/collaboration/persistence.ex	75.00%	1 Missing ⚠️
lib/lightning/collaboration/supervisor.ex	85.71%	1 Missing ⚠️
lib/lightning/collaboration/topology.ex	90.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4562      +/-   ##
==========================================
- Coverage   89.88%   89.85%   -0.03%     
==========================================
  Files         444      445       +1     
  Lines       21941    22125     +184     
==========================================
+ Hits        19722    19881     +159     
- Misses       2219     2244      +25

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

lmac-1 · 2026-03-25T18:05:50Z

@stuartc would love to hear your thoughts on everything! I will be tying up test cases and doing some more testing tomorrow.

lmac-1 · 2026-03-27T07:55:35Z

@stuartc I am doing my own review on this now and adding tests for provisioner. Feel free to shout if there's something alarming in the meantime.

lmac-1 · 2026-03-27T08:21:28Z

I've completed testing of the CLI ✅

lmac-1 · 2026-03-27T08:35:08Z

+      |> Enum.flat_map(fn {key, db_val} ->
+        case Yex.Map.fetch(ydoc_job, key) do
+          {:ok, ^db_val} -> []
+          _ -> [{:set_field, ydoc_job, key, db_val}]


What came up in my review:

The _ -> branch can't distinguish "provisioner changed this" from "user has unsaved edits here". DB is truth.

This is acknowledged in the PR (note 2). The correct long-term fix would require the provisioner to write through the Y.Doc, producing real CRDT operations so the CRDT can merge provisioner intent with user unsaved changes. That's a much bigger change and is tracked as a follow-up.

find_index_in_array returns nil when an ID is not found (standard Enum.find_index behaviour). Without a nil filter, a phantom ID that can't be located produces a {:delete, array, nil} tuple, and Yex.Array.delete/2 with a nil index has undefined behaviour — in the worst case it raises, aborting the entire reconciliation via the rescue block. Add Enum.reject(&is_nil/1) after the find_index step in all three compute ops functions (jobs, edges, triggers) so an unresolvable phantom is silently skipped rather than crashing the reconciler.

…ler.reconcile_workflow_from_db

lmac-1 · 2026-03-27T09:59:31Z

@elias-ba @stuartc I have not manually tested the validation steps for GitHub sync that updates a workflow still. Will work out how to test this properly later this morning.

lmac-1 · 2026-03-27T12:54:24Z

@stuartc I synced with @elias-ba on this and he flagged about the issue of losing unsaved changes for a user actively editing (which we are already aware of). He's going to dive into the code more and share his thoughts.

I think something to bear in mind when reviewing this is to make sure that it won't be too difficult to add something on top of it to make unsaved changes work more nicely (like the rebase we discussed as a follow-on issue). At the very least we might want to notify a user via toast that their workflow has been updated due to external change.

He also suggested that we should maybe sync with Joe next week to talk through next steps / priority / if there are next steps as he might have some opinions or thoughts about this already.

…oner-2 # Conflicts: # CHANGELOG.md

…ion.get_env topology Replace Mox-based topology dispatch with direct Application.get_env atom lookup, eliminating process-scoping problems that caused OwnershipError flakes. Key changes: - Topology.base/0 now reads the base atom directly from Application.get_env, so every process in the VM (including spawned GenServer children) resolves the correct supervisor tree without Mox plumbing. - Deleted Topology.Default and TopologyMock — no longer needed. - CollaborationCase starts a per-test isolated Supervisor, overrides Application.get_env(:lightning, Topology, base) for the test's duration, and tears the supervisor down in an on_exit that fires *before* the SQL Sandbox connection is checked back in, so PersistenceWriter terminate-time DB writes complete inside the test sandbox. - collaboration.ex do_start/1: propagate start_document errors as {:error, reason} instead of raising a MatchError, so WorkflowReconciler's error branch handles async-sandbox failures gracefully (async DataCase tests spawn processes under the production supervisor without sandbox access — this was previously swallowed by a catch-all rescue).

CollaborationCase doesn't itself use Mox; the import + verify_on_exit! were a courtesy to downstream tests. Drop them so the case stays focused on collaboration setup and tests declare their own Mox usage explicitly. session_test.exs uses unqualified stub/3 calls, so it picks up the import directly. The other CollaborationCase tests already use fully-qualified Mox.* calls and need no change.

…(Blocker 1) Eliminates `Application.put_env(:lightning, Topology, base)` from CollaborationCase so the global app env is no longer a race condition when concurrent tests each write their own base atom. Every spawned process (DocumentSupervisor, Session, PersistenceWriter, SharedDoc via Persistence) now receives its topology base as an explicit keyword opt rather than reading it from Application.get_env at call time. Changes: - topology.ex: add via/2 (explicit base form) - persistence.ex: update_v1 casts directly to the pid in state instead of looking up via Registry (eliminates the Topology.base() call path from inside SharedDoc) - document_supervisor.ex: Keyword.fetch!(:base) in init/1; use Topology.pg_scope(base) and Topology.via(base, key) throughout - session.ex: add :base to struct; init/1 stores it; lookup_shared_doc/2 uses Topology.pg_scope(base) - collaboration.ex: start/1 extracts base (falls back to Topology.base() for production); threads it to do_start, start_document, and lookup_shared_doc - registry.ex: base-aware overloads for all helpers (via/2, whereis/2, lookup/2, select/2, get_group/2, count/2) - workflow_reconciler.ex: reconcile_workflow_changes/3 and reconcile_workflow_from_db/3 accept optional base param - collaboration_case.ex: drop Application.put_env and restore entirely; start_workflow_collab! arity changes to (base, workflow, opts) - collaboration_helpers.ex: ensure_doc_supervisor_stopped/2 takes optional base - All collaboration test files updated to thread collaboration_base from context Also fixes FakeRambo to start its Cachex cache under the ExUnit test supervisor (start_supervised!) so cache lifetime is scoped to each test, preventing ETS table ownership loss across tests. All 143 collaboration tests and 130 channel tests pass.

The $callers hypothesis was tested experimentally: - Processes started via Collaborate.start(base: base) — i.e. through the test-owned isolated DynamicSupervisor — inherit $callers from the test pid, so private-mode Mox stubs propagate without any extra work. - Processes started via start_supervised!({Session, ...}) — i.e. through ExUnit's own test supervisor — do NOT get the test pid in $callers. Session.save_workflow/2 calling Lightning.broadcast in those processes therefore cannot see the test's LightningMock stubs. The fix is Mox.allow/3, which explicitly grants a known pid access to the owner's stubs. This replaces the blunt Mox.set_mox_global(LightningMock) calls, which were setting ALL mocks global as a side-effect (the arg is an ExUnit context map, not a mock-module filter). Changes: - workflow_reconciler_test.exs: remove set_mox_global + dead stub; passes because all sessions start via Collaborate.start(base: base) through the test's isolated supervisor — $callers propagates naturally. - session_test.exs: remove set_mox_global + dead stubs from all three save_workflow/2 setups; add Mox.allow(LightningMock, self(), session_pid) in all five start_supervised!({Session, ...}) sites (three setups + two test bodies that spin up their own sessions). - no_change_snapshot_test.exs: remove set_mox_global + stub; add Mox.allow(LightningMock, self(), session_pid) inline in both test bodies. - external_workflow_update_test.exs: remove set_mox_global + stub from the one test that uses Topology.base() (production topology path, so $callers definitely does not reach Session); add Mox.allow after Collaborate.start. - collaboration_case.ex: using opts do so async: is configurable (default false); updated moduledoc to reflect the actual mechanism — no Application.put_env, no set_mox_global, $callers rules documented.

lmac-1 added 4 commits March 24, 2026 12:50

Adds red tests for version stuck

0244859

adds second test

a6b2abf

Rename provisioner_version_conflict_test to external_workflow_update_…

31c88d4

…test

Fix nobody-online path: provisioner invalidates stale DocumentState o…

01dd46a

…n write Removes reconcile_or_reset from Persistence.bind and instead has the provisioner delete stale DocumentState for any workflow it changes. Adds DocumentState.delete_for_document/1 for this purpose.

github-project-automation Bot added this to Core Mar 24, 2026

github-project-automation Bot moved this to New Issues in Core Mar 24, 2026

lmac-1 added 3 commits March 24, 2026 20:38

lmac-1 changed the title ~~[draft] version stuck~~ Fix version-stuck: reconcile SharedDocs via Collaborate.start after provisioner import Mar 25, 2026

updates changelog

3f29830

theroinaochieng requested a review from stuartc March 26, 2026 08:17

lmac-1 marked this pull request as ready for review March 27, 2026 07:54

lmac-1 commented Mar 27, 2026

View reviewed changes

lmac-1 added 3 commits March 27, 2026 08:54

Adds clarifying comment following self review

9f2111e

Add test coverage for WorkflowChannel handle_info and WorkflowReconci…

e333b09

…ler.reconcile_workflow_from_db

Merge branch 'main' into version-stuck-provisioner-2

8d42984

lmac-1 requested a review from elias-ba March 27, 2026 13:04

stuartc added 3 commits May 4, 2026 13:04

Merge remote-tracking branch 'origin/main' into version-stuck-provisi…

4bab51a

…oner-2 # Conflicts: # CHANGELOG.md

stuartc self-assigned this May 5, 2026

stuartc mentioned this pull request May 7, 2026

Kafka trigger config changes deployed via provisioner are silently reverted on next save #4715

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix version-stuck: reconcile SharedDocs via Collaborate.start after provisioner import#4562

Fix version-stuck: reconcile SharedDocs via Collaborate.start after provisioner import#4562
lmac-1 wants to merge 17 commits into
mainfrom
version-stuck-provisioner-2

lmac-1 commented Mar 24, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Mar 25, 2026 •

edited

Loading

Uh oh!

lmac-1 commented Mar 25, 2026

Uh oh!

lmac-1 commented Mar 27, 2026

Uh oh!

lmac-1 commented Mar 27, 2026

Uh oh!

lmac-1 Mar 27, 2026 •

edited

Loading

Uh oh!

lmac-1 commented Mar 27, 2026

Uh oh!

lmac-1 commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lmac-1 commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

The bug

The fix

Validation steps

Additional notes for the reviewer

AI Usage

Pre-submission checklist

Uh oh!

codecov Bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

lmac-1 commented Mar 25, 2026

Uh oh!

lmac-1 commented Mar 27, 2026

Uh oh!

lmac-1 commented Mar 27, 2026

Uh oh!

lmac-1 Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lmac-1 commented Mar 27, 2026

Uh oh!

lmac-1 commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lmac-1 commented Mar 24, 2026 •

edited

Loading

codecov Bot commented Mar 25, 2026 •

edited

Loading

lmac-1 Mar 27, 2026 •

edited

Loading