fix(ui): long-poll room list sync after initial request#6361
fix(ui): long-poll room list sync after initial request#6361TigerInYourDream wants to merge 3 commits into
Conversation
Reproduction context
Observed symptomAfter the initial sync completed and the client was idle, the homeserver kept receiving a continuous stream of requests like: with the same |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6361 +/- ##
==========================================
+ Coverage 88.93% 89.78% +0.84%
==========================================
Files 357 357
Lines 99195 98827 -368
Branches 99195 98827 -368
==========================================
+ Hits 88221 88730 +509
+ Misses 6991 6614 -377
+ Partials 3983 3483 -500 ☔ View full report in Codecov by Sentry. |
|
Thank you for your contribution. The state machine is as follows. The initial state is
The timeout is set to 0 for the requests made in the The only place I see where we can save one request with no timeout is After a proof-read, I think I understand your use case: an account with e.g. 800 rooms but with very few activity: we have to paginate over all the rooms but the server has nothing to return, so it's a sequence of quick requests and empty responses. Is it correct? The problem is that the I want to remove this |
Per review on matrix-org#6361: long-polling in `SettingUp` / `Recovering` extends the time spent outside `Running` and falsely triggers `SyncIndicator` (which derives `Show` from how long the state machine stays in non-`Running` states). Roll those states back to `PollTimeout::Some(0)` and limit the policy change to collapsing the `Running` branch so it always long-polls regardless of `is_fully_loaded()`. The idle-loop symptom from Robrix2 + local Palpo (MSC4186) still goes away because the loop sits in `Running` + `!is_fully_loaded` — the client repeatedly issues `pos=<n>&timeout=0` while the server has nothing to deliver for the next Growing batch. Tests: - Integration `test_sync_all_states`: `SettingUp -> Running` request goes back to `timeout=0`; `Running -> Running` requests stay at `timeout=30000`. - Unit `test_long_poll_once_running` (renamed): drive 3 sync cycles, assert the `SettingUp` request still carries `timeout=0` AND the first `Running` request carries `timeout=30000`, defending the `SyncIndicator` invariant.
|
Thanks for the careful walkthrough — it really helps me see where my change overreaches. Let me re-anchor the reproduction with concrete details and then propose a narrower fix that respects your Where the report came fromI hit this while testing Robrix2 against a local Palpo homeserver (an MSC4186-capable implementation). The trigger is not specifically "many rooms" — it is simply "the client has reached idle after the initial sync". After that, server logs show a non-stop stream of: with the same Why it's a two-sided bug, and why the SDK side still mattersThe loop is the product of two independent issues:
Crucially: when I applied just the SDK-side fix and left the Palpo bug in place, the tight loop already stopped in our local Palpo deployment (verified in Project-Robius-China/robrix2#101). So this PR is a complete fix for the user-visible symptom on the client side, independent of the server work. Where my original patch overreachedYou're right that the loop is not specifically a Pushing Narrower change I just pushed (
|
|
One more piece of context I should have led with — the loop is not "a few The reason So the takeaway I'd flag for any future |
|
Thank you for your replies.
matrix-rust-sdk/crates/matrix-sdk/src/sliding_sync/list/request_generator.rs Lines 257 to 270 in 6777907 In your case, are you sure the |
The MSC4186 fast-path in sync_v5::sync_events early-returns an
empty body when the client's pos is ahead of curr_sn -- the typical
idle case where a client immediately re-polls with the pos we just
handed back. Because palpo writes pos = curr_sn + 1, that condition
fires on every quiet polling cycle, and the response degenerates to
{"pos":"200"} with no `lists` field at all.
matrix-rust-sdk's SlidingSyncList::update only calls
update_request_generator_state when the response carries a count.
With count missing, fully_loaded stays false in the Running phase,
which keeps the room-list service in pos=N&timeout=0 mode. A field
report from a 2-room test account observed 8 827 identical
pos=200&timeout=0 requests in 24h on a single client.
Fix:
* Extract a compute_active_rooms helper that runs only the filter
pipeline (is_invite / room_types / not_room_types / is_dm /
is_encrypted) and returns the surviving rooms. process_lists
keeps owning the sort + ops/range pass.
* Move room-set + DM-set loading above the since_sn > curr_sn
branch so the fast path can produce counts without re-querying.
* In the fast path, emit one SyncList { count, ops: vec![] } per
requested list. ops stays skip_serializing_if=Vec::is_empty, so
the wire shape is { "count": N } -- exactly what every other
server (synapse, conduwuit, sliding-sync proxy) emits in this
case.
is_empty_for_long_poll already ignores `lists`, so palpo#72's
empty-body short-circuit still treats count-only responses as
long-poll empty. has_list_count_changes correctly reports "no
change" when the cached count matches, so the long-poll guard
keeps firing for genuinely idle clients.
Tests:
* core: count_only_list_serializes_count_field guards the wire
shape; count_only_response_is_long_poll_empty guards palpo#72.
* server (sync_msc4186): idle_repoll_preserves_count_and_long_poll_semantics
walks the handler's full decision sequence across a two-step
scenario; count_change_after_room_join_breaks_out_of_long_poll
and newly_introduced_list_is_treated_as_count_change cover the
counter-cases.
* server (sync_v5): five compute_active_rooms unit tests for the
no-DB filter branches (no filters / is_invite ± / is_dm ±).
Full end-to-end coverage (HTTP + DB) is left to complement / sytest;
a follow-up should add a test that drives sync_events_v5 with five
consecutive incremental syncs and asserts count is present in each.
Refs: matrix-org/matrix-rust-sdk#6361
Retracting my Round-2 explanationYou were right. What actually drives the loop (verified live with
|
Evidence appendixPosting the raw artifacts in case any of this is worth checking independently or reusing in test fixtures. 1. curl reproduction against our Palpo (count is omitted on incremental responses)Step 1 — fresh request, no {
"pos": "200",
"lists": { "all_rooms": { "count": 2, "ops": [{ "op": "SYNC", "range": [0, 1], "room_ids": [...] }] } },
"rooms": { ... }
}Step 2 — same body, with {
"pos": "200",
"lists": { "all_rooms": {} },
"rooms": {}
}Same 2. Loop signature in production logs
Same 3. Account scale3 users, 2 rooms total in the database. So 4. Where
|
|
Question: are you using an LLM to discuss with me, or to write the code? Please read our AI policy: https://github.com/matrix-org/matrix-rust-sdk/blob/main/CONTRIBUTING.md#ai-policy. Palpo and Salvo both have proofs of heavy usage of LLM. It makes me question our discussion. |
|
Due to our AI policy violation, I'm closing this PR. Feel free to comment and to ask to reopen if you believe I'm mistaken. For the moment, I'm not convinced I'm talking to a human… (what a time to be alive…). |
I want to clarify that my previous two comments were written with agent assistance. The code itself was written by me, but I understand and respect the project's AI policy and the maintainers' decision to close this PR. |
Problem
After the initial room-list sync,
RoomListServicecontinues sendingtimeout=0requests becauseSettingUp,Recovering, andRunning(before fully loaded) all forced immediate responses. This creates a tight polling loop when idle — the server returns empty responses instantly, and the client re-sends the sameposright away.Fix
Only force
timeout=0forState::Init. All post-init states usePollTimeout::Default, letting the server long-poll when idle. If the server has pending changes, it can still respond immediately regardless of the timeout value.Test
Added a regression test verifying the second sync request carries
timeout=30000instead oftimeout=0.