Fix version-stuck: reconcile SharedDocs via Collaborate.start after provisioner import#4562
Fix version-stuck: reconcile SharedDocs via Collaborate.start after provisioner import#4562lmac-1 wants to merge 17 commits into
Conversation
…n write Removes reconcile_or_reset from Persistence.bind and instead has the provisioner delete stale DocumentState for any workflow it changes. Adds DocumentState.delete_for_document/1 for this purpose.
Remove tests for reconcile_or_reset and reconcile_workflow_metadata, which were deleted from Persistence.bind/3. Replace with tests that verify the new behaviour: persisted state loads as-is, and the provisioner deletes stale DocumentState before writing to the DB. Adds three provisioner tests covering the invalidate_document_states filtering logic (changed, unchanged, and soft-deleted workflows).
Previously, the provisioner tried to handle the "nobody online" case by deleting stale DocumentState rows inside the DB transaction. This left the "someone online" case unaddressed and required the provisioner to know about the online/offline distinction. Replace both paths with a single unified approach: after the DB transaction commits, call WorkflowReconciler.reconcile_workflow_from_db/2 for each updated workflow. The reconciler calls Collaborate.start (joining an existing SharedDoc if one is alive, or starting a fresh one if not), applies a state-driven diff via apply_db_state_to_doc, then calls Session.stop. The shutdown chain flushes the corrected state to DocumentState before any real user opens the workflow. Key changes: - workflow_reconciler.ex: new reconcile_workflow_from_db/2 takes the provisioner actor (User or ProjectRepoConnection) and applies the full DB→Y.Doc diff via Collaborate.start/Session.stop. Broadcasts :workflow_updated_externally on PubSub so connected clients update their baseWorkflow and clear the unsaved-changes indicator. - collaborate.ex: start/1 retries once after a 1ms sleep if Session.init returns :shared_doc_not_found — handles the 0ms auto_exit race where a dying SharedDoc is still registered in :pg when the reconciler's fresh session starts. - session.ex: presence tracking and untracking are now guarded by is_struct(user, User) — ProjectRepoConnection actors skip Presence entirely. SharedDoc.observe in init is wrapped in try/catch :exit so a dying SharedDoc returns a clean stop reason instead of crashing the session. - provisioner.ex: passes user_or_repo_connection through to the reconciler; removes invalidate_document_states and the DocumentState alias. - document_state.ex: removes delete_for_document/1 (no longer needed). - workflow_channel.ex: handle_info for :workflow_updated_externally pushes workflow_saved to each connected client so baseWorkflow stays in sync.
- external_workflow_update_test.exs: all three "someone online" tests now start sessions via Collaborate.start instead of start_supervised! Session directly. DocumentSupervisor is still started via start_supervised! (ExUnit must own it for PersistenceWriter flush ordering), but now receives the Registry name so ensure_doc_supervisor_stopped can find it. Each test calls ensure_doc_supervisor_stopped in-body after the last Session.stop and also registers it as an on_exit safety net — prevents DB connection warnings from PersistenceWriter flushes racing the Ecto sandbox teardown. Adds a new "GitHub sync" test: ProjectRepoConnection actor flows through the provisioner and reconciler correctly, no presence tracking fires, and the next user sees the synced state. - persistence_test.exs: removes the delete_for_document test — that function no longer exists. - provisioner_test.exs: removes the three DocumentState invalidation tests (invalidates-on-change, preserves-on-no-change, skips-on-soft-delete) — the invalidation approach has been replaced by the reconciler.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4562 +/- ##
==========================================
- Coverage 89.88% 89.85% -0.03%
==========================================
Files 444 445 +1
Lines 21941 22125 +184
==========================================
+ Hits 19722 19881 +159
- Misses 2219 2244 +25 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@stuartc would love to hear your thoughts on everything! I will be tying up test cases and doing some more testing tomorrow. |
|
@stuartc I am doing my own review on this now and adding tests for provisioner. Feel free to shout if there's something alarming in the meantime. |
|
I've completed testing of the CLI ✅ |
| |> Enum.flat_map(fn {key, db_val} -> | ||
| case Yex.Map.fetch(ydoc_job, key) do | ||
| {:ok, ^db_val} -> [] | ||
| _ -> [{:set_field, ydoc_job, key, db_val}] |
There was a problem hiding this comment.
What came up in my review:
The _ -> branch can't distinguish "provisioner changed this" from "user has unsaved edits here". DB is truth.
This is acknowledged in the PR (note 2). The correct long-term fix would require the provisioner to write through the Y.Doc, producing real CRDT operations so the CRDT can merge provisioner intent with user unsaved changes. That's a much bigger change and is tracked as a follow-up.
find_index_in_array returns nil when an ID is not found (standard
Enum.find_index behaviour). Without a nil filter, a phantom ID that
can't be located produces a {:delete, array, nil} tuple, and
Yex.Array.delete/2 with a nil index has undefined behaviour — in the
worst case it raises, aborting the entire reconciliation via the rescue
block.
Add Enum.reject(&is_nil/1) after the find_index step in all three
compute ops functions (jobs, edges, triggers) so an unresolvable
phantom is silently skipped rather than crashing the reconciler.
…ler.reconcile_workflow_from_db
|
@stuartc I synced with @elias-ba on this and he flagged about the issue of losing unsaved changes for a user actively editing (which we are already aware of). He's going to dive into the code more and share his thoughts. I think something to bear in mind when reviewing this is to make sure that it won't be too difficult to add something on top of it to make unsaved changes work more nicely (like the rebase we discussed as a follow-on issue). At the very least we might want to notify a user via toast that their workflow has been updated due to external change. He also suggested that we should maybe sync with Joe next week to talk through next steps / priority / if there are next steps as he might have some opinions or thoughts about this already. |
…oner-2 # Conflicts: # CHANGELOG.md
…ion.get_env topology
Replace Mox-based topology dispatch with direct Application.get_env atom lookup,
eliminating process-scoping problems that caused OwnershipError flakes.
Key changes:
- Topology.base/0 now reads the base atom directly from Application.get_env,
so every process in the VM (including spawned GenServer children) resolves
the correct supervisor tree without Mox plumbing.
- Deleted Topology.Default and TopologyMock — no longer needed.
- CollaborationCase starts a per-test isolated Supervisor, overrides
Application.get_env(:lightning, Topology, base) for the test's duration,
and tears the supervisor down in an on_exit that fires *before* the SQL
Sandbox connection is checked back in, so PersistenceWriter terminate-time
DB writes complete inside the test sandbox.
- collaboration.ex do_start/1: propagate start_document errors as
{:error, reason} instead of raising a MatchError, so WorkflowReconciler's
error branch handles async-sandbox failures gracefully (async DataCase
tests spawn processes under the production supervisor without sandbox
access — this was previously swallowed by a catch-all rescue).
CollaborationCase doesn't itself use Mox; the import + verify_on_exit! were a courtesy to downstream tests. Drop them so the case stays focused on collaboration setup and tests declare their own Mox usage explicitly. session_test.exs uses unqualified stub/3 calls, so it picks up the import directly. The other CollaborationCase tests already use fully-qualified Mox.* calls and need no change.
…(Blocker 1) Eliminates `Application.put_env(:lightning, Topology, base)` from CollaborationCase so the global app env is no longer a race condition when concurrent tests each write their own base atom. Every spawned process (DocumentSupervisor, Session, PersistenceWriter, SharedDoc via Persistence) now receives its topology base as an explicit keyword opt rather than reading it from Application.get_env at call time. Changes: - topology.ex: add via/2 (explicit base form) - persistence.ex: update_v1 casts directly to the pid in state instead of looking up via Registry (eliminates the Topology.base() call path from inside SharedDoc) - document_supervisor.ex: Keyword.fetch!(:base) in init/1; use Topology.pg_scope(base) and Topology.via(base, key) throughout - session.ex: add :base to struct; init/1 stores it; lookup_shared_doc/2 uses Topology.pg_scope(base) - collaboration.ex: start/1 extracts base (falls back to Topology.base() for production); threads it to do_start, start_document, and lookup_shared_doc - registry.ex: base-aware overloads for all helpers (via/2, whereis/2, lookup/2, select/2, get_group/2, count/2) - workflow_reconciler.ex: reconcile_workflow_changes/3 and reconcile_workflow_from_db/3 accept optional base param - collaboration_case.ex: drop Application.put_env and restore entirely; start_workflow_collab! arity changes to (base, workflow, opts) - collaboration_helpers.ex: ensure_doc_supervisor_stopped/2 takes optional base - All collaboration test files updated to thread collaboration_base from context Also fixes FakeRambo to start its Cachex cache under the ExUnit test supervisor (start_supervised!) so cache lifetime is scoped to each test, preventing ETS table ownership loss across tests. All 143 collaboration tests and 130 channel tests pass.
The $callers hypothesis was tested experimentally:
- Processes started via Collaborate.start(base: base) — i.e. through the
test-owned isolated DynamicSupervisor — inherit $callers from the test
pid, so private-mode Mox stubs propagate without any extra work.
- Processes started via start_supervised!({Session, ...}) — i.e. through
ExUnit's own test supervisor — do NOT get the test pid in $callers.
Session.save_workflow/2 calling Lightning.broadcast in those processes
therefore cannot see the test's LightningMock stubs.
The fix is Mox.allow/3, which explicitly grants a known pid access to the
owner's stubs. This replaces the blunt Mox.set_mox_global(LightningMock)
calls, which were setting ALL mocks global as a side-effect (the arg is an
ExUnit context map, not a mock-module filter).
Changes:
- workflow_reconciler_test.exs: remove set_mox_global + dead stub; passes
because all sessions start via Collaborate.start(base: base) through the
test's isolated supervisor — $callers propagates naturally.
- session_test.exs: remove set_mox_global + dead stubs from all three
save_workflow/2 setups; add Mox.allow(LightningMock, self(), session_pid)
in all five start_supervised!({Session, ...}) sites (three setups + two
test bodies that spin up their own sessions).
- no_change_snapshot_test.exs: remove set_mox_global + stub; add
Mox.allow(LightningMock, self(), session_pid) inline in both test bodies.
- external_workflow_update_test.exs: remove set_mox_global + stub from the
one test that uses Topology.base() (production topology path, so $callers
definitely does not reach Session); add Mox.allow after Collaborate.start.
- collaboration_case.ex: using opts do so async: is configurable (default
false); updated moduledoc to reflect the actual mechanism — no
Application.put_env, no set_mox_global, $callers rules documented.
Description
This PR fixes a class of bugs where the collaborative editor ends up stuck on a stale version of a workflow after an external write (sandbox merge or CLI deploy).
The bug
When the provisioner (or sandbox merge) writes a new workflow version to the database, live editor sessions are never notified. Two failure modes:
baseWorkflowis never refreshed.DocumentState(the persisted Y.Doc snapshot) is loaded the next time a user opens the workflow, giving them the pre-merge state. The correct DB content is ignored.The fix
Both cases now go through the same code path.
After the provisioner's DB transaction commits,
WorkflowReconciler.reconcile_workflow_from_db/2is called for each updated workflow. The reconciler callsCollaborate.start/1(joining an existing live SharedDoc if one is running, starting a new one if not), applies a state-based diff viaapply_db_state_to_doc, then callsSession.stop. The shutdown chain flushes the corrected state toDocumentStatebefore any real user opens the workflow.After
Session.stop, the reconciler broadcasts:workflow_updated_externallyon PubSub. Each connectedWorkflowChannelprocess receives this and pushes aworkflow_savedevent to its client — the existing frontend handler updatesbaseWorkflowand clears the unsaved-changes indicator. No frontend changes were needed.Supporting changes:
Collaborate.start/1now retries once (after a 1ms sleep) ifSession.initreturns:shared_doc_not_found. This handles a 0ms auto_exit race: when the reconciler'sSession.stoptriggers the last observer to leave, the SharedDoc dies via a 0ms timer. A subsequentCollaborate.startcan find a dying process still registered in:pgbefore the timer fires. The retry absorbs this and all callers benefit.Sessionnow accepts aProjectRepoConnectionas theuseroption. Presence tracking is guarded byis_struct(user, User)— system actors skip it entirely.The previous approach (
invalidate_document_statesinside the provisioner transaction, which deleted staleDocumentStaterows) is removed. It only handled the nobody-online case and required the provisioner to know about the online/offline distinction.Closes #4535
Validation steps
Someone online during a sandbox merge:
Nobody online during a sandbox merge / CLI deploy:
Provisioner.import_documentwith changesGitHub sync (ProjectRepoConnection actor):
was online
Additional notes for the reviewer
DocumentStaterows in the provisioner transaction). They are superseded by commits 6–7. The final state of the code is what matters; the history reflects the iteration. Happy to squash if you'd prefer a cleaner history.allow_stale: false), DB optimistic locking ensures at most one provisioner succeeds per workflow — so at most one reconciler ever runs. For sandbox merge (allow_stale: true), two concurrent merges to the same workflow could both succeed and both run a reconciler.Collaborate.start/1has a pre-existing TOCTOU race (check-then-act on the SharedDoc lookup) that could cause one reconciler to fail silently; the other would still complete correctly. This is not introduced by this PR and is a realistic edge case only under concurrent sandbox merges. A follow-on improvement would be for the reconciler to re-load the workflow from DB by ID rather than using the struct passed in at transaction time, so the last writer always applies current state.workflowSavedHandlerincreateSessionContextStore.tsalready handlesworkflow_savedby callingsetBaseWorkflow()andsetLatestSnapshotLockVersion().AI Usage
Please disclose whether you've used AI anywhere in this PR (it's cool, we just
want to know!):
You can read more details in our
Responsible AI Policy
Pre-submission checklist
/reviewwith Claude Code)
(e.g.,
:owner,:admin,:editor,:viewer)