fix(ui): long-poll room list sync after initial request by TigerInYourDream · Pull Request #6361 · matrix-org/matrix-rust-sdk

TigerInYourDream · 2026-03-27T08:44:53Z

Problem

After the initial room-list sync, RoomListService continues sending timeout=0 requests because SettingUp, Recovering, and Running (before fully loaded) all forced immediate responses. This creates a tight polling loop when idle — the server returns empty responses instantly, and the client re-sends the same pos right away.

Fix

Only force timeout=0 for State::Init. All post-init states use PollTimeout::Default, letting the server long-poll when idle. If the server has pending changes, it can still respond immediately regardless of the timeout value.

Test

Added a regression test verifying the second sync request carries timeout=30000 instead of timeout=0.

Public API changes documented in changelogs (optional)

TigerInYourDream · 2026-03-27T08:53:48Z

Reproduction context

Local homeserver implementing MSC4186
Robrix as the client, with matrix-sdk-ui driving the room-list sync

Observed symptom

After the initial sync completed and the client was idle, the homeserver kept receiving a continuous stream of requests like:

/_matrix/client/unstable/org.matrix.simplified_msc3575/sync?pos=539&timeout=0

with the same pos repeating. The loop is driven by the room-list timeout policy: all post-init states were forcing timeout=0, so the server had no choice but to respond immediately, and the client would re-send right away.

codspeed-hq · 2026-03-27T09:09:30Z

Merging this PR will not alter performance

✅ 50 untouched benchmarks

_{Comparing TigerInYourDream:fix/room-list-long-poll-after-initial-sync (a5d7247) with main (2c2e0f1)}

codecov · 2026-03-27T09:24:47Z

Codecov Report

❌ Patch coverage is 93.18182% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.78%. Comparing base (d64c990) to head (a5d7247).
⚠️ Report is 950 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
crates/matrix-sdk-ui/src/room_list_service/mod.rs	93.18%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #6361      +/-   ##
==========================================
+ Coverage   88.93%   89.78%   +0.84%     
==========================================
  Files         357      357              
  Lines       99195    98827     -368     
  Branches    99195    98827     -368     
==========================================
+ Hits        88221    88730     +509     
+ Misses       6991     6614     -377     
+ Partials     3983     3483     -500

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Hywan · 2026-04-22T07:23:04Z

Thank you for your contribution.

The state machine is as follows. The initial state is Init. Then, the transitions are (not exhaustive, just the ones we are interested by):

Init -> SettingUp, a first request is made, with a SlidingSyncMode::Selective
SettingUp -> Running, a second request is made, with a SlidingSyncMode::Growing with the initial batch size of 100
Running -> Running, a third request is made, with the same growing mode, and the batch increases to 200
etc.

The timeout is set to 0 for the requests made in the SettingUp state (and Error, Recovering etc.). For the requests made in the Running state, the timeout is set to 0 as long as the list is not fully loaded, i.e. when not all rooms have been “paginated”. Put it in different words, it enters long-polling as soon as all rooms have been paginated/synced at least once.

The only place I see where we can save one request with no timeout is SettingUp -> Running. Indeed, for the server to return nothing here, we need a number of rooms smaller than the range used in Selective, with no updates. Maybe I'm misunderstand your use case. Can you give me a concrete example with the number of rooms, number of updates etc. please?

After a proof-read, I think I understand your use case: an account with e.g. 800 rooms but with very few activity: we have to paginate over all the rooms but the server has nothing to return, so it's a sequence of quick requests and empty responses. Is it correct?

The problem is that the SyncIndicator depends on the state machine. If we enter long-polling in SettingUp immediately, the system will believe there is a lag and will display the SyncIndicator.

I want to remove this SyncIndicator API entirely, and replace it with a NetworkStatus-ish API. Instead of displaying a “sync loader”, I prefer to provide a “wait on network loader”, which seems more accurate.

Per review on matrix-org#6361: long-polling in `SettingUp` / `Recovering` extends the time spent outside `Running` and falsely triggers `SyncIndicator` (which derives `Show` from how long the state machine stays in non-`Running` states). Roll those states back to `PollTimeout::Some(0)` and limit the policy change to collapsing the `Running` branch so it always long-polls regardless of `is_fully_loaded()`. The idle-loop symptom from Robrix2 + local Palpo (MSC4186) still goes away because the loop sits in `Running` + `!is_fully_loaded` — the client repeatedly issues `pos=<n>&timeout=0` while the server has nothing to deliver for the next Growing batch. Tests: - Integration `test_sync_all_states`: `SettingUp -> Running` request goes back to `timeout=0`; `Running -> Running` requests stay at `timeout=30000`. - Unit `test_long_poll_once_running` (renamed): drive 3 sync cycles, assert the `SettingUp` request still carries `timeout=0` AND the first `Running` request carries `timeout=30000`, defending the `SyncIndicator` invariant.

TigerInYourDream · 2026-04-23T06:59:22Z

Thanks for the careful walkthrough — it really helps me see where my change overreaches. Let me re-anchor the reproduction with concrete details and then propose a narrower fix that respects your SyncIndicator concern.

Where the report came from

I hit this while testing Robrix2 against a local Palpo homeserver (an MSC4186-capable implementation). The trigger is not specifically "many rooms" — it is simply "the client has reached idle after the initial sync". After that, server logs show a non-stop stream of:

/_matrix/client/unstable/org.matrix.simplified_msc3575/sync?pos=539&timeout=0

with the same pos repeating back-to-back. Full investigation tracked in Project-Robius-China/robrix2#30.

Why it's a two-sided bug, and why the SDK side still matters

The loop is the product of two independent issues:

SDK side (this PR): requires_timeout was forcing timeout=0 for every post-init state. The client was actively opting out of long-polling.
Server side (palpo-im/palpo#72): Palpo's MSC4186 route was treating static metadata (one-time-key counts, empty account_data / typing containers, unchanged list counts) as "non-empty updates", so even when the client did long-poll, the server short-circuited and answered immediately.

Crucially: when I applied just the SDK-side fix and left the Palpo bug in place, the tight loop already stopped in our local Palpo deployment (verified in Project-Robius-China/robrix2#101). So this PR is a complete fix for the user-visible symptom on the client side, independent of the server work.

Where my original patch overreached

You're right that the loop is not specifically a SettingUp -> Running hop. It sits in Running while request_generator.is_fully_loaded() stays false — Growing keeps wanting the next batch, the server keeps having nothing for that batch, and the client keeps spinning. That's the branch I really need to fix.

Pushing SettingUp / Recovering to PollTimeout::Default was overreach on my part: as you pointed out, SyncIndicator derives its Show decision from how long we stay outside Running (State::SettingUp | State::Error { .. }), so long-polling there would falsely trigger the loading hint. That's a real UX regression and not worth taking.

Narrower change I just pushed (`a5d7247a6`)

The policy change is now limited to the Running branch:

match observable_state.get() {
    State::Init
    | State::SettingUp
    | State::Recovering
    | State::Error { .. }
    | State::Terminated { .. } => PollTimeout::Some(0),

    // Once we're `Running`, let the server long-poll regardless of
    // whether the list is fully loaded. If the next Growing batch is
    // ready (or any update is queued) the server can still answer
    // immediately; otherwise it holds the connection instead of forcing
    // the client to spin on empty responses.
    State::Running => PollTimeout::Default,
}

Behavior change reduces to a single rule: collapse the Running branch so it always long-polls, instead of conditioning on is_fully_loaded(). By the time we are in Running, SyncIndicator is already in the Hide group, so this is invisible to that API.

Tests updated accordingly:

test_sync_all_states (integration): the SettingUp -> Running request goes back to asserting timeout=0; the three Running -> Running requests assert timeout=30000 (which previously only held after is_fully_loaded() flipped true).
Renamed the new unit test to test_long_poll_once_running, drove three sync.next().await cycles, and now assert both that the SettingUp request still carries timeout=0 and that the first Running request carries timeout=30000. That defends the SyncIndicator invariant in CI.

Happy to defer the broader SyncIndicator -> NetworkStatus rework to your planned redesign — let me know if this scope works.

TigerInYourDream · 2026-04-23T07:15:00Z

One more piece of context I should have led with — the loop is not "a few timeout=0 requests during the Growing transition", it's hundreds per minute. The original server log showed pos values up to pos=539 within a single idle session (every response carries a fresh, monotonically-incremented pos, so 539 means ~539 back-to-back requests). That's the concrete data point if you're weighing whether the Running + !is_fully_loaded window is really worth optimizing — at the rates we observed, it's a sustained state, not a brief transition.

The reason is_fully_loaded() does not flip to true and end the loop on its own is that on Palpo, every response carries device_one_time_keys_count, empty extensions.{typing,account_data} containers, and a lists.<id>.count field that can microscopically fluctuate. Each of those is spec-legal — Palpo isn't violating MSC4186, its tests pass, its implementation is internally consistent. But the combination repeatedly nudges the list state, so is_fully_loaded() never settles, and we stay in the Running + !is_fully_loaded arm indefinitely. (palpo-im/palpo#72 retroactively tightens what counts as "empty" on the server side, but that's Palpo voluntarily narrowing a permissive spec reading, not fixing a violation.)

So the takeaway I'd flag for any future NetworkStatus-style rework: MSC4186's "long-poll until there's a change" leaves enough room for divergent-but-compliant server implementations that the SDK can't rely on the server to gate empty responses correctly. This PR makes the client express its long-poll intent unconditionally once we're Running; the failure mode against a permissive server then degrades to "server returns immediately but client does not re-spin", instead of the current tight loop. Surfacing this in case it informs the broader API design you're already considering.

Hywan · 2026-04-28T08:29:19Z

Thank you for your replies.

is_fully_loaded does not mean there is nothing else to load, rather that all the rooms have been synced at least once, i.e. the “list is fully loaded” means “all the rooms have been paginated”. Put it in other words, the request.list.range covers all the rooms returned by the response. It indeed derives this value from response.list.*.count:

matrix-rust-sdk/crates/matrix-sdk/src/sliding_sync/list/request_generator.rs

Lines 257 to 270 in 6777907

    
           // Otherwise the current range has reached its maximum, we switched to `FullyLoaded` 
        
           // mode. 
        
           else { 
        
               // The number of fetched rooms is set to the maximum too. 
        
               *number_of_fetched_rooms = range_maximum; 
        
               // We update the `fully_loaded` marker. 
        
               *fully_loaded = true; 
        
               // The range is covering the entire list, from 0 to its maximum. 
        
               self.ranges = vec![0..=range_maximum]; 
        
               // Finally, let's update the list' state. 
        
               Ok(SlidingSyncListLoadingState::FullyLoaded)

In your case, are you sure the count value is “stable”? It can increase with time (for example, when the user joins a new room), but it should not be “too dynamic”. What do you observe?

The MSC4186 fast-path in sync_v5::sync_events early-returns an empty body when the client's pos is ahead of curr_sn -- the typical idle case where a client immediately re-polls with the pos we just handed back. Because palpo writes pos = curr_sn + 1, that condition fires on every quiet polling cycle, and the response degenerates to {"pos":"200"} with no `lists` field at all. matrix-rust-sdk's SlidingSyncList::update only calls update_request_generator_state when the response carries a count. With count missing, fully_loaded stays false in the Running phase, which keeps the room-list service in pos=N&timeout=0 mode. A field report from a 2-room test account observed 8 827 identical pos=200&timeout=0 requests in 24h on a single client. Fix: * Extract a compute_active_rooms helper that runs only the filter pipeline (is_invite / room_types / not_room_types / is_dm / is_encrypted) and returns the surviving rooms. process_lists keeps owning the sort + ops/range pass. * Move room-set + DM-set loading above the since_sn > curr_sn branch so the fast path can produce counts without re-querying. * In the fast path, emit one SyncList { count, ops: vec![] } per requested list. ops stays skip_serializing_if=Vec::is_empty, so the wire shape is { "count": N } -- exactly what every other server (synapse, conduwuit, sliding-sync proxy) emits in this case. is_empty_for_long_poll already ignores `lists`, so palpo#72's empty-body short-circuit still treats count-only responses as long-poll empty. has_list_count_changes correctly reports "no change" when the cached count matches, so the long-poll guard keeps firing for genuinely idle clients. Tests: * core: count_only_list_serializes_count_field guards the wire shape; count_only_response_is_long_poll_empty guards palpo#72. * server (sync_msc4186): idle_repoll_preserves_count_and_long_poll_semantics walks the handler's full decision sequence across a two-step scenario; count_change_after_room_join_breaks_out_of_long_poll and newly_introduced_list_is_treated_as_count_change cover the counter-cases. * server (sync_v5): five compute_active_rooms unit tests for the no-DB filter branches (no filters / is_invite ± / is_dm ±). Full end-to-end coverage (HTTP + DB) is left to complement / sytest; a follow-up should add a test that drives sync_events_v5 with five consecutive incremental syncs and asserts count is present in each. Refs: matrix-org/matrix-rust-sdk#6361

TigerInYourDream · 2026-04-30T06:37:36Z

Retracting my Round-2 explanation

You were right. is_fully_loaded() derives only from lists.<id>.count (request_generator.rs:243-264) — OTK count and empty extension containers don't enter that calculation. Conflating client-side is_fully_loaded with Palpo's server-side empty-response check was wrong on my part.

What actually drives the loop (verified live with `curl`)

Palpo omits count from incremental list responses: a follow-up pos=N with no per-list change returns lists.all_rooms = {} — count / ops / room_ids all absent.
SlidingSyncList::update() (mod.rs:164-174) short-circuits when count is missing: the if let Some branch is skipped, so update_request_generator_state is never called.
set_sync_mode(Growing) at SettingUp -> Running resets fully_loaded to false (request_generator.rs:106); the list-level maximum_number_of_rooms survives, but the generator's flag does not.
Server returns the same pos (nothing changed) → Running + !is_fully_loaded → PollTimeout::Some(0) → tight loop. Concrete: 8,827 identical pos=200&timeout=0 requests in 24h on a 2-room test account.

Status & this PR

Palpo merged a fix to always emit count, and the loop is gone in our deployment. A separate, deeper issue in update() (falling back to the persisted count when omitted) probably deserves its own PR — happy to open one.

The narrowed change in this PR no longer fixes a live symptom for us. It's purely defense-in-depth: capping worst-case client traffic if anything (divergent server behavior, an SDK accounting bug) incorrectly leaves is_fully_loaded false against the Running arm. Whether to carry it as a guard is your call — merge or close, both work for me.

TigerInYourDream · 2026-04-30T06:44:24Z

Evidence appendix

Posting the raw artifacts in case any of this is worth checking independently or reusing in test fixtures.

1. curl reproduction against our Palpo (count is omitted on incremental responses)

Step 1 — fresh request, no pos:

{
  "pos": "200",
  "lists": { "all_rooms": { "count": 2, "ops": [{ "op": "SYNC", "range": [0, 1], "room_ids": [...] }] } },
  "rooms": { ... }
}

Step 2 — same body, with pos=200:

{
  "pos": "200",
  "lists": { "all_rooms": {} },
  "rooms": {}
}

Same pos returned, list block is {}. Repeating step 2 produces the same response indefinitely.

2. Loop signature in production logs

8827  /sync?pos=200&timeout=0
 382  /sync?pos=200&timeout=30000
 377  /sync?pos=129&timeout=0
 124  /sync?pos=106&timeout=0
  11  /sync?pos=111&timeout=30000
   6  /sync?pos=114&timeout=30000
   3  /sync?pos=129&timeout=30000
   3  /sync?pos=111&timeout=0
   2  /sync?pos=165&timeout=30000
   2  /sync?pos=163&timeout=30000

Same pos repeating thousands of times with timeout=0, with bursts of timeout=30000 at the same pos mixed in — meaning the closure sees is_fully_loaded flipping between true and false against an unchanging server response.

3. Account scale

3 users, 2 rooms total in the database. So count = 2, well below the Growing initial batch of 100. With a working chain, is_fully_loaded should become true on the first Growing response.

4. Where `fully_loaded` resets on mode switch

crates/matrix-sdk/src/sliding_sync/list/request_generator.rs:100-108 (SlidingSyncMode::Growing arm of SlidingSyncListRequestGenerator::new):

SlidingSyncMode::Growing { batch_size, maximum_number_of_rooms_to_fetch } => Self {
    ranges: Vec::new(),
    kind: SlidingSyncListRequestGeneratorKind::Growing {
        batch_size,
        maximum_number_of_rooms_to_fetch,
        number_of_fetched_rooms: 0,
        fully_loaded: false,
        requested_end: None,
    },
},

Combined with update() short-circuiting on None (mod.rs:164-174), once set_sync_mode(Growing) runs and the very next response omits count, fully_loaded is permanently stuck at false.

Hywan · 2026-04-30T06:47:44Z

Question: are you using an LLM to discuss with me, or to write the code? Please read our AI policy: https://github.com/matrix-org/matrix-rust-sdk/blob/main/CONTRIBUTING.md#ai-policy.

Palpo and Salvo both have proofs of heavy usage of LLM. It makes me question our discussion.

Hywan · 2026-05-05T09:45:55Z

Due to our AI policy violation, I'm closing this PR. Feel free to comment and to ask to reopen if you believe I'm mistaken. For the moment, I'm not convinced I'm talking to a human… (what a time to be alive…).

TigerInYourDream · 2026-05-06T07:36:42Z

Due to our AI policy violation, I'm closing this PR. Feel free to comment and to ask to reopen if you believe I'm mistaken. For the moment, I'm not convinced I'm talking to a human… (what a time to be alive…).

I want to clarify that my previous two comments were written with agent assistance. The code itself was written by me, but I understand and respect the project's AI policy and the maintainers' decision to close this PR.
After re-checking the behavior, I agree that the better fix for my observed case is on the server side: the list response should include count according to the spec / expected response model. With count present, the SDK can update the request generator state correctly.
It's OK to keep this PR closed. Thank you for the review and clarification.

fix(ui): long-poll room list sync after initial request

c757346

TigerInYourDream requested a review from a team as a code owner March 27, 2026 08:44

TigerInYourDream requested review from stefanceriu and removed request for a team March 27, 2026 08:44

fix: update timeout assertions in integration tests to match new policy

c47fa7e

stefanceriu requested review from Hywan and removed request for stefanceriu March 27, 2026 10:07

TigerInYourDream mentioned this pull request Mar 27, 2026

Track idle sliding-sync request loop after initial sync Project-Robius-China/robrix2#30

Open

TigerInYourDream mentioned this pull request Apr 16, 2026

fix(deps): backport matrix-rust-sdk#6361 to stop idle sync loop Project-Robius-China/robrix2#101

Open

4 tasks

TigerInYourDream mentioned this pull request Apr 30, 2026

fix(sliding-sync): always emit list count on idle re-poll palpo-im/palpo#181

Merged

4 tasks

Hywan closed this May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ui): long-poll room list sync after initial request#6361

fix(ui): long-poll room list sync after initial request#6361
TigerInYourDream wants to merge 3 commits into
matrix-org:mainfrom
TigerInYourDream:fix/room-list-long-poll-after-initial-sync

TigerInYourDream commented Mar 27, 2026 •

edited

Loading

Uh oh!

TigerInYourDream commented Mar 27, 2026 •

edited

Loading

Uh oh!

codspeed-hq Bot commented Mar 27, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Mar 27, 2026 •

edited

Loading

Uh oh!

Hywan commented Apr 22, 2026 •

edited

Loading

Uh oh!

TigerInYourDream commented Apr 23, 2026 •

edited

Loading

Uh oh!

TigerInYourDream commented Apr 23, 2026

Uh oh!

Hywan commented Apr 28, 2026 •

edited

Loading

Uh oh!

TigerInYourDream commented Apr 30, 2026

Uh oh!

TigerInYourDream commented Apr 30, 2026

Uh oh!

Hywan commented Apr 30, 2026

Uh oh!

Hywan commented May 5, 2026

Uh oh!

TigerInYourDream commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TigerInYourDream commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Test

Uh oh!

TigerInYourDream commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reproduction context

Observed symptom

Uh oh!

codspeed-hq Bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will not alter performance

Uh oh!

codecov Bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Hywan commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TigerInYourDream commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Where the report came from

Why it's a two-sided bug, and why the SDK side still matters

Where my original patch overreached

Narrower change I just pushed (a5d7247a6)

Uh oh!

TigerInYourDream commented Apr 23, 2026

Uh oh!

Hywan commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TigerInYourDream commented Apr 30, 2026

Retracting my Round-2 explanation

What actually drives the loop (verified live with curl)

Status & this PR

Uh oh!

TigerInYourDream commented Apr 30, 2026

Evidence appendix

1. curl reproduction against our Palpo (count is omitted on incremental responses)

2. Loop signature in production logs

3. Account scale

4. Where fully_loaded resets on mode switch

Uh oh!

Hywan commented Apr 30, 2026

Uh oh!

Hywan commented May 5, 2026

Uh oh!

TigerInYourDream commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TigerInYourDream commented Mar 27, 2026 •

edited

Loading

TigerInYourDream commented Mar 27, 2026 •

edited

Loading

codspeed-hq Bot commented Mar 27, 2026 •

edited

Loading

codecov Bot commented Mar 27, 2026 •

edited

Loading

Hywan commented Apr 22, 2026 •

edited

Loading

TigerInYourDream commented Apr 23, 2026 •

edited

Loading

Narrower change I just pushed (`a5d7247a6`)

Hywan commented Apr 28, 2026 •

edited

Loading

What actually drives the loop (verified live with `curl`)

4. Where `fully_loaded` resets on mode switch