Skip to content

feat: more logging#81

Open
brendanobra wants to merge 2 commits intomainfrom
topic/nojira-yet-just-logging
Open

feat: more logging#81
brendanobra wants to merge 2 commits intomainfrom
topic/nojira-yet-just-logging

Conversation

@brendanobra
Copy link
Copy Markdown
Contributor

No description provided.

Copilot AI review requested due to automatic review settings April 6, 2026 21:29
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds additional timing/log instrumentation around gateway/transport disconnect and subscription flows, and introduces a unit-test regression scenario for a previously reported hang during teardown when the server stops responding.

Changes:

  • Add detailed disconnect timing logs in Transport::disconnect() and GatewayImpl::disconnect().
  • Add subscribe/unsubscribe timing logs in GatewayImpl and expand unit tests with a new regression case.
  • Increase unit test logging verbosity via getTestConfig().

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 15 comments.

File Description
test/unit/gatewayTest.cpp Adds a regression test for a “silent server” teardown scenario; changes default test log level.
src/transport.cpp Adds disconnect-phase timing logs (close, thread join, worker shutdown).
src/gateway.cpp Adds disconnect/subscribe/unsubscribe timing logs and additional debug statements.

Comment thread test/unit/gatewayTest.cpp
}
});
ASSERT_EQ(err, Firebolt::Error::None);
connectionFuture.wait_for(std::chrono::seconds(2));
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the new regression test, the result of connectionFuture.wait_for(...) is ignored. If the connection never becomes ready, the test will proceed to subscribe/unsubscribe and fail in less obvious ways; it should ASSERT/EXPECT that the future becomes ready (as other tests in this file do).

Suggested change
connectionFuture.wait_for(std::chrono::seconds(2));
ASSERT_EQ(connectionFuture.wait_for(std::chrono::seconds(2)), std::future_status::ready)
<< "Timed out waiting for connection to become ready";

Copilot uses AI. Check for mistakes.
Comment thread test/unit/gatewayTest.cpp
Comment on lines +598 to +615
// Regression test: disconnect() must return quickly even when the server dies
// while active subscriptions are held.
//
// Repro scenario (RDKEMW-16573):
// 1. Client subscribes to an event (gateway sends subscribe ACK)
// 2. Server disappears abruptly (no WS close handshake, no unsubscribe ACK)
// 3. Client calls disconnect()
//
// Before the fix: disconnect() blocked for ~waitTime_ms per subscription
// (request(...).get() inside unsubscribe() held the calling thread).
//
// After the fix: disconnect() uses wait_for(50ms) ceilings and returns in
// well under 500ms regardless of server responsiveness.
//
// The test asserts disconnect() completes within 500ms (generous allowance).
// On an unpatched build it will take >= waitTime_ms (1000ms default) and fail.
// ---------------------------------------------------------------------------
TEST_F(GatewayUTest, DisconnectDoesNotHangWhenServerDisappearsWithActiveSubscription)
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test name/comments state this is asserting disconnect() returns quickly, but the timing/EXPECT only measures unsubscribe(); disconnect() is not timed or asserted. Either rename/update the comments to match what’s measured, or time gateway.disconnect() and assert on that duration.

Suggested change
// Regression test: disconnect() must return quickly even when the server dies
// while active subscriptions are held.
//
// Repro scenario (RDKEMW-16573):
// 1. Client subscribes to an event (gateway sends subscribe ACK)
// 2. Server disappears abruptly (no WS close handshake, no unsubscribe ACK)
// 3. Client calls disconnect()
//
// Before the fix: disconnect() blocked for ~waitTime_ms per subscription
// (request(...).get() inside unsubscribe() held the calling thread).
//
// After the fix: disconnect() uses wait_for(50ms) ceilings and returns in
// well under 500ms regardless of server responsiveness.
//
// The test asserts disconnect() completes within 500ms (generous allowance).
// On an unpatched build it will take >= waitTime_ms (1000ms default) and fail.
// ---------------------------------------------------------------------------
TEST_F(GatewayUTest, DisconnectDoesNotHangWhenServerDisappearsWithActiveSubscription)
// Regression test: unsubscribe() must return quickly even when the server dies
// after a subscription has become active.
//
// Repro scenario (RDKEMW-16573):
// 1. Client subscribes to an event (gateway sends subscribe ACK)
// 2. Server disappears abruptly (no WS close handshake, no unsubscribe ACK)
// 3. Client calls unsubscribe()
//
// Before the fix: unsubscribe() could block for ~waitTime_ms because
// request(...).get() waited on a reply that never arrived.
//
// After the fix: unsubscribe() uses wait_for(50ms) ceilings and returns in
// well under 500ms regardless of server responsiveness.
//
// The test asserts unsubscribe() completes within 500ms (generous allowance).
// On an unpatched build it will take >= waitTime_ms (1000ms default) and fail.
// ---------------------------------------------------------------------------
TEST_F(GatewayUTest, UnsubscribeDoesNotHangWhenServerDisappearsAfterActiveSubscription)

Copilot uses AI. Check for mistakes.
Comment thread test/unit/gatewayTest.cpp
Comment on lines +671 to +675
Firebolt::Error subErr = gateway.subscribe("test.onStateChanged", onEvent, &eventPromise);
ASSERT_EQ(subErr, Firebolt::Error::None) << "subscribe() failed";

// Give the subscribe ACK time to arrive.
std::this_thread::sleep_for(std::chrono::milliseconds(50));
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test uses a fixed sleep_for(50ms) to “give the subscribe ACK time to arrive”. This can be flaky under load. Prefer synchronizing on a condition (e.g., wait until subscribeAckSent is set, with a reasonable timeout) rather than sleeping a fixed duration.

Suggested change
Firebolt::Error subErr = gateway.subscribe("test.onStateChanged", onEvent, &eventPromise);
ASSERT_EQ(subErr, Firebolt::Error::None) << "subscribe() failed";
// Give the subscribe ACK time to arrive.
std::this_thread::sleep_for(std::chrono::milliseconds(50));
std::promise<void> subscribeAckPromise;
std::future<void> subscribeAckFuture = subscribeAckPromise.get_future();
bool subscribeAckObserved = false;
std::function<void(connection_hdl, server::message_ptr)> previousMessageHandler = m_messageHandler;
m_messageHandler = [&subscribeAckPromise, &subscribeAckObserved, previousMessageHandler](connection_hdl hdl,
server::message_ptr msg)
{
previousMessageHandler(hdl, msg);
if (!subscribeAckObserved)
{
nlohmann::json request = nlohmann::json::parse(msg->get_payload(), nullptr, false);
if (!request.is_discarded() && request.value("method", "") == "test.onStateChanged")
{
subscribeAckObserved = true;
subscribeAckPromise.set_value();
}
}
};
Firebolt::Error subErr = gateway.subscribe("test.onStateChanged", onEvent, &eventPromise);
ASSERT_EQ(subErr, Firebolt::Error::None) << "subscribe() failed";
EXPECT_EQ(subscribeAckFuture.wait_for(std::chrono::seconds(2)), std::future_status::ready)
<< "Timed out waiting for subscribe request to be processed before switching server to silent mode";

Copilot uses AI. Check for mistakes.
Comment thread test/unit/gatewayTest.cpp
Comment on lines +694 to +695
std::cout << "[timing] unsubscribe() took " << elapsed.count() << " ms\n";

Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test prints timing via std::cout, which is not used elsewhere in the unit tests and can add noisy output in CI. Consider reporting timing only on failure (via the EXPECT/ASSERT message) or using existing logging facilities behind a log level.

Suggested change
std::cout << "[timing] unsubscribe() took " << elapsed.count() << " ms\n";

Copilot uses AI. Check for mistakes.
Comment thread src/transport.cpp
Comment on lines +252 to +261
FIREBOLT_LOG_INFO("Transport", "[disconnect] connectionThread joined in %ld ms",
std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - t0_ct)
.count());

FIREBOLT_LOG_INFO("Transport", "[disconnect] stopping message worker...");
auto t0_mw = std::chrono::steady_clock::now();
stopMessageWorker();
FIREBOLT_LOG_INFO("Transport", "[disconnect] message worker stopped in %ld ms",
std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - t0_mw)
.count());
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log uses "%ld" for a std::chrono::milliseconds::count() value, which is typically a long long; this can trigger -Wformat warnings and is undefined behavior on platforms where long != long long. Consider casting to long long (or int64_t) and using "%lld" / PRIi64.

Suggested change
FIREBOLT_LOG_INFO("Transport", "[disconnect] connectionThread joined in %ld ms",
std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - t0_ct)
.count());
FIREBOLT_LOG_INFO("Transport", "[disconnect] stopping message worker...");
auto t0_mw = std::chrono::steady_clock::now();
stopMessageWorker();
FIREBOLT_LOG_INFO("Transport", "[disconnect] message worker stopped in %ld ms",
std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - t0_mw)
.count());
FIREBOLT_LOG_INFO(
"Transport", "[disconnect] connectionThread joined in %lld ms",
static_cast<long long>(
std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - t0_ct).count()));
FIREBOLT_LOG_INFO("Transport", "[disconnect] stopping message worker...");
auto t0_mw = std::chrono::steady_clock::now();
stopMessageWorker();
FIREBOLT_LOG_INFO(
"Transport", "[disconnect] message worker stopped in %lld ms",
static_cast<long long>(
std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - t0_mw).count()));

Copilot uses AI. Check for mistakes.
Comment thread src/gateway.cpp Outdated
Comment on lines +535 to +537
FIREBOLT_LOG_INFO("Gateway", "[subscribe] ACK for '%s' received in %ld ms", event.c_str(),
std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - t0_sub)
.count());
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Format-string mismatch: the subscribe timing log uses "%ld" for milliseconds::count() (typically long long), which can trigger -Wformat warnings/UB on some platforms. Prefer "%lld" with cast (or PRI macros).

Suggested change
FIREBOLT_LOG_INFO("Gateway", "[subscribe] ACK for '%s' received in %ld ms", event.c_str(),
std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - t0_sub)
.count());
FIREBOLT_LOG_INFO("Gateway", "[subscribe] ACK for '%s' received in %lld ms", event.c_str(),
static_cast<long long>(std::chrono::duration_cast<std::chrono::milliseconds>(
std::chrono::steady_clock::now() - t0_sub)
.count()));

Copilot uses AI. Check for mistakes.
Comment thread test/unit/gatewayTest.cpp
Firebolt::Config cfg;
cfg.wsUrl = m_uri;
cfg.log.level = Firebolt::LogLevel::Error;
cfg.log.level = Firebolt::LogLevel::Debug;
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getTestConfig() now sets cfg.log.level to Debug for all gateway unit tests, which can significantly increase CI log volume and slow tests. Consider keeping the default at Error (or the previous level) and overriding the log level only in the specific test(s) that need verbose output.

Suggested change
cfg.log.level = Firebolt::LogLevel::Debug;
cfg.log.level = Firebolt::LogLevel::Error;

Copilot uses AI. Check for mistakes.
Comment thread src/transport.cpp
Comment on lines +238 to 239
FIREBOLT_LOG_INFO("Transport", "[disconnect] close() start (handshake timeout=100ms)");
client_->close(connectionHandle_, websocketpp::close::status::going_away, "", ec);
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The close() log hard-codes "handshake timeout=100ms", but set_close_handshake_timeout(100) can fail (bad_weak_ptr), in which case the timeout may remain at websocketpp’s default. Consider logging the timeout only when setting it succeeds, or include whether it was applied.

Copilot uses AI. Check for mistakes.
Comment thread src/gateway.cpp Outdated
Comment on lines 535 to 540
FIREBOLT_LOG_INFO("Gateway", "[subscribe] ACK for '%s' received in %ld ms", event.c_str(),
std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - t0_sub)
.count());

if (!result)
{
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subscribe timing log says "ACK ... received" unconditionally, but request(...).get() can complete due to timeout/error (Result is false). Consider logging success vs failure based on result (and include the error) to avoid misleading output.

Suggested change
FIREBOLT_LOG_INFO("Gateway", "[subscribe] ACK for '%s' received in %ld ms", event.c_str(),
std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - t0_sub)
.count());
if (!result)
{
long elapsedMs =
std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - t0_sub).count();
if (result)
{
FIREBOLT_LOG_INFO("Gateway", "[subscribe] ACK for '%s' received in %ld ms", event.c_str(), elapsedMs);
}
else
{
FIREBOLT_LOG_ERROR("Gateway", "[subscribe] ACK for '%s' failed after %ld ms with error %d", event.c_str(),
elapsedMs, static_cast<int>(result.error()));

Copilot uses AI. Check for mistakes.
Comment thread src/gateway.cpp Outdated
Comment on lines 594 to 598
auto result = request(event, params).get();
auto unsub_ms =
std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - t0_unsub).count();
FIREBOLT_LOG_INFO("Gateway", "[unsubscribe] ACK received after %ld ms", unsub_ms);

Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unsubscribe timing log says "ACK received" before checking whether the request actually succeeded; on timeout/error it will still claim an ACK was received. Consider logging based on result (success vs timed out/error) and avoid printing an ACK message when result is false.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants