Skip to content

FlakeIdGeneratorApiTest.testSmoke crashes with std::overflow_error on Windows (PR #1448) [API-2380] #1450

@ihsandemir

Description

@ihsandemir

Problem

FlakeIdGeneratorApiTest.testSmoke fails randomly on the windows-64-(Release, Static, noSSL) CI configuration when running PR #1448 . The failure manifests as a process crash rather than a test assertion failure.

CI Error

Exception: E06D7363.?AVoverflow_error@std@@

E06D7363 is the Windows SEH exception code for any C++ exception thrown via MSVC's __CxxThrowException. The RTTI tag ?AVoverflow_error@std@@ is MSVC's mangled name for std::overflow_error. This indicates an unhandled std::overflow_error reached Windows' unhandled-exception filter and terminated the process.

The error was observed repeatedly across multiple test invocations on the same CI run, indicating a reproducible race condition rather than a one-off fluke.

Affected Configuration

Root Cause Analysis

The only source of std::overflow_error in the codebase

std::overflow_error is thrown in exactly one place:

hazelcast/src/hazelcast/client/proxy.cppnew_id_internal():

int64_t
flake_id_generator_impl::new_id_internal()
{
    auto b = block_.load();
    if (b) {
        int64_t res = b->next();
        if (res != INT64_MIN) {
            return res;
        }
    }
    throw std::overflow_error("");   // <-- sole source of the crash
}

This exception is used as a control-flow signal (not an error): "the local prefetch batch is exhausted; fetch a new one from the server." It is intended to be caught immediately in the calling function new_id():

boost::future<int64_t>
flake_id_generator_impl::new_id()
{
    try {
        return boost::make_ready_future(new_id_internal());
    } catch (std::overflow_error&) {          // <-- expected catch site
        return new_id_batch(batch_size_)
          .then(boost::launch::sync, ...);    // async chain begins here
    }
}

Why the exception escapes on Windows/MSVC

The batch-fetch future chain is:

invocation_promise_  .then(user_executor,    id_seq_lambda)  → F1
F1                   .then(launch::sync,     decode_lambda)  → F2
F2                   .then(launch::sync,     block_callback) → F3

complete_call_id_sequence() in spi.cpp checks user_executor.closed() at call time. If the user executor is closed (race during client shutdown) it substitutes boost::launch::sync, making the entire chain synchronous in whatever thread fires the parent promise — potentially an IO thread.

The IO threads introduced by PR #1448 are started without any exception guard:

// network.cpp:144
io_threads_.emplace_back([raw_ctx]() { raw_ctx->run(); });  // no try/catch

On Windows/MSVC with Release-build optimizations, Boost.Thread's boost::launch::sync continuation mechanism does not reliably contain exceptions within a continuation's promise when the underlying exception machinery is SEH-based. An std::overflow_error active in a parent continuation context can leak through the chain into the IO thread or user executor thread, neither of which has a handler, causing the process to crash.

The developer already observed overflow_error escaping from invocation_promise_.set_exception() and added broader catch clauses in ClientInvocation::set_exception() (commits 809bb5dc9, 09ec6a5a8), but crashes persist because additional escape paths exist through the raw_ctx->run() loop.

Why this is a regression vs. master

master uses a single IO thread model; PR #1448 introduces multiple IO threads and the associated race with user-executor shutdown. The interaction between the new launch::sync-fallback path and the unguarded raw_ctx->run() loop is what allows the exception to terminate a thread.

Frequency

The test generates ~4,000 batch-exhaustion events per run (400,000 IDs ÷ default prefetch batch of 100). Each event triggers the throw. With many throws per run, even a low-probability escape path becomes a near-certainty.

Proposed Fix

Two complementary changes:

1. Eliminate std::overflow_error as control flow (primary fix)

Replace the throw/catch pattern in new_id_internal() / new_id() with the sentinel return value INT64_MIN, identical to the pattern already used by Block::next(). This removes the exception from the codebase entirely, making the crash structurally impossible.

2. Guard IO thread loops against uncaught exceptions (defensive fix)

Wrap raw_ctx->run() in a try-catch so that any future misbehaving handler cannot silently crash an IO thread.

A detailed design document and fix are provided in the associated PR.

Metadata

Metadata

Assignees

Labels

Type: Defectto-jiraUse to create a placeholder Jira issue in Jira APIs Project

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions