Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds additional timing/log instrumentation around gateway/transport disconnect and subscription flows, and introduces a unit-test regression scenario for a previously reported hang during teardown when the server stops responding.
Changes:
- Add detailed disconnect timing logs in
Transport::disconnect()andGatewayImpl::disconnect(). - Add subscribe/unsubscribe timing logs in
GatewayImpland expand unit tests with a new regression case. - Increase unit test logging verbosity via
getTestConfig().
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 15 comments.
| File | Description |
|---|---|
test/unit/gatewayTest.cpp |
Adds a regression test for a “silent server” teardown scenario; changes default test log level. |
src/transport.cpp |
Adds disconnect-phase timing logs (close, thread join, worker shutdown). |
src/gateway.cpp |
Adds disconnect/subscribe/unsubscribe timing logs and additional debug statements. |
| } | ||
| }); | ||
| ASSERT_EQ(err, Firebolt::Error::None); | ||
| connectionFuture.wait_for(std::chrono::seconds(2)); |
There was a problem hiding this comment.
In the new regression test, the result of connectionFuture.wait_for(...) is ignored. If the connection never becomes ready, the test will proceed to subscribe/unsubscribe and fail in less obvious ways; it should ASSERT/EXPECT that the future becomes ready (as other tests in this file do).
| connectionFuture.wait_for(std::chrono::seconds(2)); | |
| ASSERT_EQ(connectionFuture.wait_for(std::chrono::seconds(2)), std::future_status::ready) | |
| << "Timed out waiting for connection to become ready"; |
| // Regression test: disconnect() must return quickly even when the server dies | ||
| // while active subscriptions are held. | ||
| // | ||
| // Repro scenario (RDKEMW-16573): | ||
| // 1. Client subscribes to an event (gateway sends subscribe ACK) | ||
| // 2. Server disappears abruptly (no WS close handshake, no unsubscribe ACK) | ||
| // 3. Client calls disconnect() | ||
| // | ||
| // Before the fix: disconnect() blocked for ~waitTime_ms per subscription | ||
| // (request(...).get() inside unsubscribe() held the calling thread). | ||
| // | ||
| // After the fix: disconnect() uses wait_for(50ms) ceilings and returns in | ||
| // well under 500ms regardless of server responsiveness. | ||
| // | ||
| // The test asserts disconnect() completes within 500ms (generous allowance). | ||
| // On an unpatched build it will take >= waitTime_ms (1000ms default) and fail. | ||
| // --------------------------------------------------------------------------- | ||
| TEST_F(GatewayUTest, DisconnectDoesNotHangWhenServerDisappearsWithActiveSubscription) |
There was a problem hiding this comment.
The test name/comments state this is asserting disconnect() returns quickly, but the timing/EXPECT only measures unsubscribe(); disconnect() is not timed or asserted. Either rename/update the comments to match what’s measured, or time gateway.disconnect() and assert on that duration.
| // Regression test: disconnect() must return quickly even when the server dies | |
| // while active subscriptions are held. | |
| // | |
| // Repro scenario (RDKEMW-16573): | |
| // 1. Client subscribes to an event (gateway sends subscribe ACK) | |
| // 2. Server disappears abruptly (no WS close handshake, no unsubscribe ACK) | |
| // 3. Client calls disconnect() | |
| // | |
| // Before the fix: disconnect() blocked for ~waitTime_ms per subscription | |
| // (request(...).get() inside unsubscribe() held the calling thread). | |
| // | |
| // After the fix: disconnect() uses wait_for(50ms) ceilings and returns in | |
| // well under 500ms regardless of server responsiveness. | |
| // | |
| // The test asserts disconnect() completes within 500ms (generous allowance). | |
| // On an unpatched build it will take >= waitTime_ms (1000ms default) and fail. | |
| // --------------------------------------------------------------------------- | |
| TEST_F(GatewayUTest, DisconnectDoesNotHangWhenServerDisappearsWithActiveSubscription) | |
| // Regression test: unsubscribe() must return quickly even when the server dies | |
| // after a subscription has become active. | |
| // | |
| // Repro scenario (RDKEMW-16573): | |
| // 1. Client subscribes to an event (gateway sends subscribe ACK) | |
| // 2. Server disappears abruptly (no WS close handshake, no unsubscribe ACK) | |
| // 3. Client calls unsubscribe() | |
| // | |
| // Before the fix: unsubscribe() could block for ~waitTime_ms because | |
| // request(...).get() waited on a reply that never arrived. | |
| // | |
| // After the fix: unsubscribe() uses wait_for(50ms) ceilings and returns in | |
| // well under 500ms regardless of server responsiveness. | |
| // | |
| // The test asserts unsubscribe() completes within 500ms (generous allowance). | |
| // On an unpatched build it will take >= waitTime_ms (1000ms default) and fail. | |
| // --------------------------------------------------------------------------- | |
| TEST_F(GatewayUTest, UnsubscribeDoesNotHangWhenServerDisappearsAfterActiveSubscription) |
| Firebolt::Error subErr = gateway.subscribe("test.onStateChanged", onEvent, &eventPromise); | ||
| ASSERT_EQ(subErr, Firebolt::Error::None) << "subscribe() failed"; | ||
|
|
||
| // Give the subscribe ACK time to arrive. | ||
| std::this_thread::sleep_for(std::chrono::milliseconds(50)); |
There was a problem hiding this comment.
The test uses a fixed sleep_for(50ms) to “give the subscribe ACK time to arrive”. This can be flaky under load. Prefer synchronizing on a condition (e.g., wait until subscribeAckSent is set, with a reasonable timeout) rather than sleeping a fixed duration.
| Firebolt::Error subErr = gateway.subscribe("test.onStateChanged", onEvent, &eventPromise); | |
| ASSERT_EQ(subErr, Firebolt::Error::None) << "subscribe() failed"; | |
| // Give the subscribe ACK time to arrive. | |
| std::this_thread::sleep_for(std::chrono::milliseconds(50)); | |
| std::promise<void> subscribeAckPromise; | |
| std::future<void> subscribeAckFuture = subscribeAckPromise.get_future(); | |
| bool subscribeAckObserved = false; | |
| std::function<void(connection_hdl, server::message_ptr)> previousMessageHandler = m_messageHandler; | |
| m_messageHandler = [&subscribeAckPromise, &subscribeAckObserved, previousMessageHandler](connection_hdl hdl, | |
| server::message_ptr msg) | |
| { | |
| previousMessageHandler(hdl, msg); | |
| if (!subscribeAckObserved) | |
| { | |
| nlohmann::json request = nlohmann::json::parse(msg->get_payload(), nullptr, false); | |
| if (!request.is_discarded() && request.value("method", "") == "test.onStateChanged") | |
| { | |
| subscribeAckObserved = true; | |
| subscribeAckPromise.set_value(); | |
| } | |
| } | |
| }; | |
| Firebolt::Error subErr = gateway.subscribe("test.onStateChanged", onEvent, &eventPromise); | |
| ASSERT_EQ(subErr, Firebolt::Error::None) << "subscribe() failed"; | |
| EXPECT_EQ(subscribeAckFuture.wait_for(std::chrono::seconds(2)), std::future_status::ready) | |
| << "Timed out waiting for subscribe request to be processed before switching server to silent mode"; |
| std::cout << "[timing] unsubscribe() took " << elapsed.count() << " ms\n"; | ||
|
|
There was a problem hiding this comment.
This test prints timing via std::cout, which is not used elsewhere in the unit tests and can add noisy output in CI. Consider reporting timing only on failure (via the EXPECT/ASSERT message) or using existing logging facilities behind a log level.
| std::cout << "[timing] unsubscribe() took " << elapsed.count() << " ms\n"; |
| FIREBOLT_LOG_INFO("Transport", "[disconnect] connectionThread joined in %ld ms", | ||
| std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - t0_ct) | ||
| .count()); | ||
|
|
||
| FIREBOLT_LOG_INFO("Transport", "[disconnect] stopping message worker..."); | ||
| auto t0_mw = std::chrono::steady_clock::now(); | ||
| stopMessageWorker(); | ||
| FIREBOLT_LOG_INFO("Transport", "[disconnect] message worker stopped in %ld ms", | ||
| std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - t0_mw) | ||
| .count()); |
There was a problem hiding this comment.
The log uses "%ld" for a std::chrono::milliseconds::count() value, which is typically a long long; this can trigger -Wformat warnings and is undefined behavior on platforms where long != long long. Consider casting to long long (or int64_t) and using "%lld" / PRIi64.
| FIREBOLT_LOG_INFO("Transport", "[disconnect] connectionThread joined in %ld ms", | |
| std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - t0_ct) | |
| .count()); | |
| FIREBOLT_LOG_INFO("Transport", "[disconnect] stopping message worker..."); | |
| auto t0_mw = std::chrono::steady_clock::now(); | |
| stopMessageWorker(); | |
| FIREBOLT_LOG_INFO("Transport", "[disconnect] message worker stopped in %ld ms", | |
| std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - t0_mw) | |
| .count()); | |
| FIREBOLT_LOG_INFO( | |
| "Transport", "[disconnect] connectionThread joined in %lld ms", | |
| static_cast<long long>( | |
| std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - t0_ct).count())); | |
| FIREBOLT_LOG_INFO("Transport", "[disconnect] stopping message worker..."); | |
| auto t0_mw = std::chrono::steady_clock::now(); | |
| stopMessageWorker(); | |
| FIREBOLT_LOG_INFO( | |
| "Transport", "[disconnect] message worker stopped in %lld ms", | |
| static_cast<long long>( | |
| std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - t0_mw).count())); |
| FIREBOLT_LOG_INFO("Gateway", "[subscribe] ACK for '%s' received in %ld ms", event.c_str(), | ||
| std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - t0_sub) | ||
| .count()); |
There was a problem hiding this comment.
Format-string mismatch: the subscribe timing log uses "%ld" for milliseconds::count() (typically long long), which can trigger -Wformat warnings/UB on some platforms. Prefer "%lld" with cast (or PRI macros).
| FIREBOLT_LOG_INFO("Gateway", "[subscribe] ACK for '%s' received in %ld ms", event.c_str(), | |
| std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - t0_sub) | |
| .count()); | |
| FIREBOLT_LOG_INFO("Gateway", "[subscribe] ACK for '%s' received in %lld ms", event.c_str(), | |
| static_cast<long long>(std::chrono::duration_cast<std::chrono::milliseconds>( | |
| std::chrono::steady_clock::now() - t0_sub) | |
| .count())); |
| Firebolt::Config cfg; | ||
| cfg.wsUrl = m_uri; | ||
| cfg.log.level = Firebolt::LogLevel::Error; | ||
| cfg.log.level = Firebolt::LogLevel::Debug; |
There was a problem hiding this comment.
getTestConfig() now sets cfg.log.level to Debug for all gateway unit tests, which can significantly increase CI log volume and slow tests. Consider keeping the default at Error (or the previous level) and overriding the log level only in the specific test(s) that need verbose output.
| cfg.log.level = Firebolt::LogLevel::Debug; | |
| cfg.log.level = Firebolt::LogLevel::Error; |
| FIREBOLT_LOG_INFO("Transport", "[disconnect] close() start (handshake timeout=100ms)"); | ||
| client_->close(connectionHandle_, websocketpp::close::status::going_away, "", ec); |
There was a problem hiding this comment.
The close() log hard-codes "handshake timeout=100ms", but set_close_handshake_timeout(100) can fail (bad_weak_ptr), in which case the timeout may remain at websocketpp’s default. Consider logging the timeout only when setting it succeeds, or include whether it was applied.
| FIREBOLT_LOG_INFO("Gateway", "[subscribe] ACK for '%s' received in %ld ms", event.c_str(), | ||
| std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - t0_sub) | ||
| .count()); | ||
|
|
||
| if (!result) | ||
| { |
There was a problem hiding this comment.
The subscribe timing log says "ACK ... received" unconditionally, but request(...).get() can complete due to timeout/error (Result is false). Consider logging success vs failure based on result (and include the error) to avoid misleading output.
| FIREBOLT_LOG_INFO("Gateway", "[subscribe] ACK for '%s' received in %ld ms", event.c_str(), | |
| std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - t0_sub) | |
| .count()); | |
| if (!result) | |
| { | |
| long elapsedMs = | |
| std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - t0_sub).count(); | |
| if (result) | |
| { | |
| FIREBOLT_LOG_INFO("Gateway", "[subscribe] ACK for '%s' received in %ld ms", event.c_str(), elapsedMs); | |
| } | |
| else | |
| { | |
| FIREBOLT_LOG_ERROR("Gateway", "[subscribe] ACK for '%s' failed after %ld ms with error %d", event.c_str(), | |
| elapsedMs, static_cast<int>(result.error())); |
| auto result = request(event, params).get(); | ||
| auto unsub_ms = | ||
| std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - t0_unsub).count(); | ||
| FIREBOLT_LOG_INFO("Gateway", "[unsubscribe] ACK received after %ld ms", unsub_ms); | ||
|
|
There was a problem hiding this comment.
The unsubscribe timing log says "ACK received" before checking whether the request actually succeeded; on timeout/error it will still claim an ACK was received. Consider logging based on result (success vs timed out/error) and avoid printing an ACK message when result is false.
No description provided.