Skip to content

perf(matrix): use DoublyLinkedList + side-map for pending cell changes#27422

Draft
anthony-murphy wants to merge 5 commits into
microsoft:mainfrom
anthony-murphy:prep-matrix-pending-cell-list
Draft

perf(matrix): use DoublyLinkedList + side-map for pending cell changes#27422
anthony-murphy wants to merge 5 commits into
microsoft:mainfrom
anthony-murphy:prep-matrix-pending-cell-list

Conversation

@anthony-murphy
Copy link
Copy Markdown
Contributor

Description

Replaces PendingCellChanges<T>.local: { localSeq, value }[] with DoublyLinkedList<{ localSeq, value }> and adds a sibling side-map localByLocalSeq: Map<number, ListNode<{ localSeq, value }>> for O(1) findIndex-by-localSeq lookup.

Sites converted:

  • sendSetCellOp push: insert into list, register node in side-map.
  • reSubmitCore: was findIndex + splice per pending op; now localByLocalSeq.get(localSeq)list.remove(node)localByLocalSeq.delete(localSeq). O(1).
  • ACK shift: list.shift()?.data + side-map delete.
  • Rollback pop() + peek-tail list.last!.data.value.
  • Length probes use list.length.

Why

For a hot cell with k pending writes, the previous reSubmit-per-op was O(k) find + O(k) splice (O(k²) total) and per-ACK was O(k) shift. For n ops to a single hot cell, the whole rebase pass was O(n²). The list-plus-side-map makes every per-op step O(1).

Notes

Snapshot/wire format unaffected (_pendingCliSeqData is unknown). Pure perf prep — no reSubmitSquashed, no squash logic, no tests touched, no api-report changes.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 27, 2026

Hi! Thank you for opening this PR. Want me to review it?

Based on the diff (131 lines, 2 files), I've queued these reviewers:

  • Correctness — logic errors, race conditions, lifecycle issues
  • Security — vulnerabilities, secret exposure, injection
  • API Compatibility — breaking changes, release tags, type design
  • Performance — algorithmic regressions, memory leaks
  • Testing — coverage gaps, hollow tests

How this works

  • Adjust the reviewer set by ticking/unticking boxes above. Reviewer toggles alone don't trigger anything.

  • Tick Start review below to dispatch the review fleet.

  • After review finishes, tick Start review again to request another run — it auto-resets after each dispatch.

  • This comment updates as new commits land; your reviewer selections are preserved.

  • Start review

Comment thread packages/dds/matrix/src/matrix.ts
Comment thread packages/dds/matrix/src/matrix.ts Outdated
const change = pendingCell.local.pop();
const change = pendingCell.local.pop()?.data;
assert(change?.localSeq === setMetadata.localSeq, 0xbaa /* must have change */);
pendingCell.localByLocalSeq.delete(setMetadata.localSeq);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep Review: pop() here is correct — rollback is LIFO, the operation being rolled back is always the tail, and assert 0xbaa enforces the match. But on #24604 the Copilot bot flagged this exact construct ("Using pop() to remove the pending operation may remove the most recently added op rather than the one corresponding to the rollback"), and now that this PR adds the localByLocalSeq index — making targeted removal trivial and applying it in reSubmitCore — a reader will reasonably ask why rollback didn't switch too. Switching would be semantically wrong (rollback must restore the previous cell value, which requires popping the tail — see matrix.rollback.spec.ts:163-187, :232-290), so the code is right; only the intent is implicit.

A one-line comment above the pop() pre-empts the same #24604 round-trip.

Suggested change
pendingCell.localByLocalSeq.delete(setMetadata.localSeq);
// Rollback is LIFO: the operation being rolled back is always the tail. `pop()` is correct; assert 0xbaa enforces the match.
const change = pendingCell.local.pop()?.data;

@anthony-murphy
Copy link
Copy Markdown
Contributor Author

Deep Review

Reviewed commit 0a7f3fe on 2026-06-02.

Readiness: 8/10 — ALMOST READY

Faithful, well-scoped perf prep: PendingCellChanges<T>.local is a private PendingLocalCellChanges<T> wrapper around DoublyLinkedList + Map<localSeq, ListNode>, turning the per-cell reSubmit hot path from O(k²) to O(k). The hot-cell reconnect regression test landed at matrix.reconnect.spec.ts:234-281 (N=200) and the lock-step invariant is resolved by the wrapper-class adoption. No correctness defects. Six Tier 3 polish items remain — none individually material, but they accumulate: the last-leaks-ListNode ergonomic gap (the PR's only eslint-disable), the untagged assert at matrix.ts:250, the rollback pop()-is-LIFO-only note (open thread 3314510178), undocumented squash-prep / explicit-removal rationale on the wrapper class, a final-state-only assertion in the new reconnect test, and missing rollback-of-many-pending + FWW-with-many-pending coverage.

Path to Ready

  • Resolve inline threads
  • Amend PR description Notes to acknowledge the new regression test (matrix.reconnect.spec.ts, N=200 same-cell writes is the worst-case O(n²) path this change targets) — current Notes say "no tests touched"

Context for Reviewers

For human reviewer
  • Encapsulation taste call (Josmithr) — the wrapper-class direction was already adopted at the design-owner's request (thread 3314510112, resolved). The remaining last-shape ergonomic gap is matrix-area taste; if peekLastValue() is acceptable, the inline thread's recommendation lands cleanly.
  • Perf-direction validation — the pipeline cannot measure runtime. Once a hot-cell micro-bench lands in packages/dds/matrix/src/test/matrix.bench.ts, confirm the measured improvement matches the O(k²) → O(k) claim under Matrix: Enable rollback and local server stress #24604's stress workloads.
  • FWW ACK glance (jatgarg) — the ACK-path conversion from array shift() to list.shift()?.data (matrix.ts:1142-1146) is mechanically equivalent but worth a domain-expert look given Update shared matrix FWW policy to immediately switch mode on switch rather than on ack #18709's pending-write sequencing history.
  • Resubmit-over-deleted-row/col (Abe27342)fix(matrix): Properly detect row/cell deletion when resubmitting a set op #19862's fix sits adjacent to the new removeByLocalSeq → list.remove(node) in reSubmitCore. The local-ref-position dance is unchanged but worth a confirming glance.
  • Change-event-sequence contract (vladsud) — the open inline thread on the new reconnect test asks whether the rewritten resubmit path still exposes every intermediate cellsChanged state. Domain-expert confirmation that the contract is unchanged would close the residual risk that the test gap signals.
Review history (4 prior reviews)
  • abd04d4 2026-05-27 · 8/10 — wrapper-class adoption resolved lock-step invariant; five Tier 3 polish items remained
  • 42d0e76 2026-05-27 · 7/10 — perf prep verified; four polish items, hot-cell regression test most material
  • b065a1c 2026-05-27 · 7/10 — perf prep verified; three polish items, hot-cell regression test most material
  • 09b3415 2026-05-27 · 6/10 — perf prep with a single hot-cell regression-test blocker

return this.list.length;
}

public get last(): ListNode<{ localSeq: number; value: MatrixItem<T> }> | undefined {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep Review: PendingLocalCellChanges.last returns ListNode<{ localSeq: number; value: MatrixItem<T> }> | undefined, exposing the internal DoublyLinkedList node type as part of the wrapper's public surface. The single call site in setCellCore writes pendingCell.local.last!.data.value with // eslint-disable-next-line @typescript-eslint/no-non-null-assertion — the only such disable in this PR. Both the eslint suppression and the leaky ListNode shape come from the same accessor.

Replace get last(): ListNode<…> with peekLastValue(): MatrixItem<T> | undefined returning this.list.last?.data.value. The call site collapses to pendingCell.local.peekLastValue() ?? pendingCell.consensus, dropping the !.data.value chain and the eslint-disable, and centralising the internal-structure dependency inside the wrapper where it belongs. Josmithr's pattern on #24604 was to push back on exactly this kind of optional/ambiguity surface on the pending-cell interface family.

}

public push(localSeq: number, value: MatrixItem<T>): void {
assert(!this.index.has(localSeq), "duplicate localSeq in PendingLocalCellChanges");
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep Review: assert(!this.index.has(localSeq), "duplicate localSeq in PendingLocalCellChanges") is the only plain-string assert in the changed file. Every other assert(...) in matrix.ts carries a hex short-code tag, including the three asserts this PR modifies (0xba40xba7) and the adjacent 0xba80xbab. The convention was established by the bulk-tagging pass in #24797 (jatgarg, 2025-06-09); the release-time assert-short-code tagger converts plain strings to numeric codes. If that policy check runs in CI for this package, it will fail.

Run the assert-short-code tagger (or confirm the intent is to let the next release-time pass tag it). Don't invent a code by hand.

* (instead of a linear `findIndex` scan).
*
* The list and the index are kept in lock-step: every mutator updates both.
*/
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep Review: The class JSDoc states the purpose (O(1) lookup) but not the motivation for the side-map shape. Under today's runtime contract reSubmitCore always receives a localSeq equal to the head of pendingCell.local (because sendSetCellOp assigns a fresh nextLocalSeq() and pushes to tail, and the runtime walks pending ops FIFO). A Deque + head-shift + head-match assert would deliver identical O(1) behavior with no side-map — which is exactly the question the next reader will ask, and the question this dossier converged on.

The justification — explicit removal by localSeq (a) avoids encoding an implicit FIFO dependency in a per-cell data structure, and (b) is staged for follow-up squash work that needs arbitrary-position removal — is reasonable but lives only in PR discussion, not in code. Expand the PendingLocalCellChanges class comment to state explicitly: (1) why the side-map exists (avoids relying on FIFO/head invariant at call sites; supports upcoming squash/arbitrary-position removal), and (2) the lock-step invariant the wrapper enforces between list and index. Without this, the next reader will re-derive the "why not a Deque?" critique.


// Both clients should converge on the last-written value.
const expected = [[N - 1]];
assert.deepEqual(extract(matrix1), expected);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep Review: "resubmits N writes to the same cell after reconnect" stages N=200 pending writes, reconnects, and asserts only assert.deepEqual(extract(matrix1), [[N - 1]]) and extract(matrix2), [[N - 1]] — final-state convergence. The test's own framing ("re-emit all N pending writes") implies a per-op regression target, but no cellsChanged observation is wired up. On #18018, vladsud explicitly objected to resubmit logic preserving final state while skipping intermediate set-cell states, because matrix exposes each intermediate state through change events. The PR adds no batching/squash path so the risk is narrowed today — but the gap remains real for the path being rewritten, and a future squash follow-up could regress this silently.

Attach a cellsChanged consumer (or a counter on the open-handler) and verify both clients observe all N values in order during the reconnect drain — not just the final one.

assert.deepEqual(extract(matrix1), expected);
assert.deepEqual(extract(matrix3), expected);
});

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep Review: This PR introduces three new code paths on PendingLocalCellChanges: pop() (rollback, matrix.ts:974), last peek (FWW previous-value decision in setCellCore), and shift() (ACK). The single new test exercises only the reSubmit path. Existing tests cover single-pending rollback and single-pending FWW, but the "many pending locals to one cell → rollback / FWW conflict" combinations — exactly the paths that depend on the new last accessor and on pop's side-map cleanup — are not directly exercised.

Add two tests under the same harness:

  • A rollback test that stacks ≥3 pending writes to one cell then undoes them in sequence (exercises pop + last + side-map cleanup; pairs with the existing matrix.rollback.spec.ts patterns).
  • An FWW-conflict test where a remote winner arrives while multiple local pendings exist on the same cell (exercises the length > 0 ? local.last!.data.value : pendingCell.consensus branch).

Both fit the existing MockContainerRuntime harness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant