Problem
FlakeIdGeneratorApiTest.testSmoke fails randomly on the windows-64-(Release, Static, noSSL) CI configuration when running PR #1448 . The failure manifests as a process crash rather than a test assertion failure.
CI Error
Exception: E06D7363.?AVoverflow_error@std@@
E06D7363 is the Windows SEH exception code for any C++ exception thrown via MSVC's __CxxThrowException. The RTTI tag ?AVoverflow_error@std@@ is MSVC's mangled name for std::overflow_error. This indicates an unhandled std::overflow_error reached Windows' unhandled-exception filter and terminated the process.
The error was observed repeatedly across multiple test invocations on the same CI run, indicating a reproducible race condition rather than a one-off fluke.
Affected Configuration
Root Cause Analysis
The only source of std::overflow_error in the codebase
std::overflow_error is thrown in exactly one place:
hazelcast/src/hazelcast/client/proxy.cpp — new_id_internal():
int64_t
flake_id_generator_impl::new_id_internal()
{
auto b = block_.load();
if (b) {
int64_t res = b->next();
if (res != INT64_MIN) {
return res;
}
}
throw std::overflow_error(""); // <-- sole source of the crash
}
This exception is used as a control-flow signal (not an error): "the local prefetch batch is exhausted; fetch a new one from the server." It is intended to be caught immediately in the calling function new_id():
boost::future<int64_t>
flake_id_generator_impl::new_id()
{
try {
return boost::make_ready_future(new_id_internal());
} catch (std::overflow_error&) { // <-- expected catch site
return new_id_batch(batch_size_)
.then(boost::launch::sync, ...); // async chain begins here
}
}
Why the exception escapes on Windows/MSVC
The batch-fetch future chain is:
invocation_promise_ .then(user_executor, id_seq_lambda) → F1
F1 .then(launch::sync, decode_lambda) → F2
F2 .then(launch::sync, block_callback) → F3
complete_call_id_sequence() in spi.cpp checks user_executor.closed() at call time. If the user executor is closed (race during client shutdown) it substitutes boost::launch::sync, making the entire chain synchronous in whatever thread fires the parent promise — potentially an IO thread.
The IO threads introduced by PR #1448 are started without any exception guard:
// network.cpp:144
io_threads_.emplace_back([raw_ctx]() { raw_ctx->run(); }); // no try/catch
On Windows/MSVC with Release-build optimizations, Boost.Thread's boost::launch::sync continuation mechanism does not reliably contain exceptions within a continuation's promise when the underlying exception machinery is SEH-based. An std::overflow_error active in a parent continuation context can leak through the chain into the IO thread or user executor thread, neither of which has a handler, causing the process to crash.
The developer already observed overflow_error escaping from invocation_promise_.set_exception() and added broader catch clauses in ClientInvocation::set_exception() (commits 809bb5dc9, 09ec6a5a8), but crashes persist because additional escape paths exist through the raw_ctx->run() loop.
Why this is a regression vs. master
master uses a single IO thread model; PR #1448 introduces multiple IO threads and the associated race with user-executor shutdown. The interaction between the new launch::sync-fallback path and the unguarded raw_ctx->run() loop is what allows the exception to terminate a thread.
Frequency
The test generates ~4,000 batch-exhaustion events per run (400,000 IDs ÷ default prefetch batch of 100). Each event triggers the throw. With many throws per run, even a low-probability escape path becomes a near-certainty.
Proposed Fix
Two complementary changes:
1. Eliminate std::overflow_error as control flow (primary fix)
Replace the throw/catch pattern in new_id_internal() / new_id() with the sentinel return value INT64_MIN, identical to the pattern already used by Block::next(). This removes the exception from the codebase entirely, making the crash structurally impossible.
2. Guard IO thread loops against uncaught exceptions (defensive fix)
Wrap raw_ctx->run() in a try-catch so that any future misbehaving handler cannot silently crash an IO thread.
A detailed design document and fix are provided in the associated PR.
Problem
FlakeIdGeneratorApiTest.testSmokefails randomly on thewindows-64-(Release, Static, noSSL)CI configuration when running PR #1448 . The failure manifests as a process crash rather than a test assertion failure.CI Error
E06D7363is the Windows SEH exception code for any C++ exception thrown via MSVC's__CxxThrowException. The RTTI tag?AVoverflow_error@std@@is MSVC's mangled name forstd::overflow_error. This indicates an unhandledstd::overflow_errorreached Windows' unhandled-exception filter and terminated the process.The error was observed repeatedly across multiple test invocations on the same CI run, indicating a reproducible race condition rather than a one-off fluke.
Affected Configuration
windows-64-(Release, Static, noSSL))ihsandemir/update_test_server_version(PR Update test Hazelcast server version to 5.7.0 in start-rc scripts #1448) — not reproducible onmasterFlakeIdGeneratorApiTest.testSmokeRoot Cause Analysis
The only source of
std::overflow_errorin the codebasestd::overflow_erroris thrown in exactly one place:hazelcast/src/hazelcast/client/proxy.cpp—new_id_internal():This exception is used as a control-flow signal (not an error): "the local prefetch batch is exhausted; fetch a new one from the server." It is intended to be caught immediately in the calling function
new_id():Why the exception escapes on Windows/MSVC
The batch-fetch future chain is:
complete_call_id_sequence()inspi.cppchecksuser_executor.closed()at call time. If the user executor is closed (race during client shutdown) it substitutesboost::launch::sync, making the entire chain synchronous in whatever thread fires the parent promise — potentially an IO thread.The IO threads introduced by PR #1448 are started without any exception guard:
On Windows/MSVC with Release-build optimizations, Boost.Thread's
boost::launch::synccontinuation mechanism does not reliably contain exceptions within a continuation's promise when the underlying exception machinery is SEH-based. Anstd::overflow_erroractive in a parent continuation context can leak through the chain into the IO thread or user executor thread, neither of which has a handler, causing the process to crash.The developer already observed
overflow_errorescaping frominvocation_promise_.set_exception()and added broader catch clauses inClientInvocation::set_exception()(commits809bb5dc9,09ec6a5a8), but crashes persist because additional escape paths exist through theraw_ctx->run()loop.Why this is a regression vs. master
masteruses a single IO thread model; PR #1448 introduces multiple IO threads and the associated race with user-executor shutdown. The interaction between the newlaunch::sync-fallback path and the unguardedraw_ctx->run()loop is what allows the exception to terminate a thread.Frequency
The test generates ~4,000 batch-exhaustion events per run (400,000 IDs ÷ default prefetch batch of 100). Each event triggers the throw. With many throws per run, even a low-probability escape path becomes a near-certainty.
Proposed Fix
Two complementary changes:
1. Eliminate
std::overflow_erroras control flow (primary fix)Replace the throw/catch pattern in
new_id_internal()/new_id()with the sentinel return valueINT64_MIN, identical to the pattern already used byBlock::next(). This removes the exception from the codebase entirely, making the crash structurally impossible.2. Guard IO thread loops against uncaught exceptions (defensive fix)
Wrap
raw_ctx->run()in a try-catch so that any future misbehaving handler cannot silently crash an IO thread.A detailed design document and fix are provided in the associated PR.