Skip to content

[BUG] Comos's lazy class initialization causes deadlocks #48585

@ashbeitz

Description

@ashbeitz

Describe the bug
The Azure Cosmos Java SDK has circular class initialization () dependencies that cause a permanent, unrecoverable JVM-level deadlock when two threads concurrently trigger Cosmos SDK class loading. The deadlock occurs between JsonSerializable, CosmosItemRequestOptions, CosmosAsyncClient, and related classes — all through ImplementationBridgeHelpers.initializeAllAccessors().

A similar issue was previously identified and fixed specifically for Kafka connectors in PR #46378: fixKafkaConnectorStuckIssue, but that fix was not comprehensive. The same circular pattern exists in the core Cosmos SDK classes and can be triggered by any application that concurrently initializes Cosmos SDK classes from multiple threads.

JVM class initialization monitors have no timeout — this deadlock is permanent and unrecoverable without killing the process.

Exception or Stack Trace
Thread dump captured from a production JVM (~336 seconds uptime, 200/200 Tomcat request threads stuck permanently):

Thread A (exec-31, elapsed=221.74s):
   Thread.State: RUNNABLE
   - new SqlParameter() → JsonSerializable.<clinit>()
     → ImplementationBridgeHelpers.initializeAllAccessors()
          → CosmosItemRequestOptions.<clinit>()
               → CosmosDiagnosticsThresholdsHelper → FeedResponse.<clinit>()
                    → CosmosPagedFluxDefaultImpl.<clinit>()
                         → CosmosAsyncContainer.<clinit>()
                              → BridgeInternal.initializeAllAccessors()
                                   → ⛔ WAITS on CosmosAsyncClient class init (owned by Thread B)

Thread B (exec-13, elapsed=228.03s):
   Thread.State: RUNNABLE
   - CosmosClientBuilder.buildClient() → CosmosAsyncClient.<clinit>()
     → ImplementationBridgeHelpers.initializeAllAccessors()
          → ModelBridgeInternal.initializeAllAccessors()
               → ⛔ WAITS on CosmosItemRequestOptions class init (part of Thread A's chain)

Result: Classic AB/BA deadlock on JVM class initialization monitors.

  • 192 threads blocked waiting on JsonSerializable class init monitor (held by Thread A / exec-31)
  • 6 threads blocked on Guava cache waitForLoadingValue (waiting for Thread B / exec-13 to finish buildClient())
  • 1 thread (exec-31) deadlocked in JsonSerializable.<clinit>() chain
  • 1 thread (exec-13) deadlocked in CosmosAsyncClient.<clinit>() chain
  • 200/200 request threads permanently stuck — process is completely unresponsive

All 192 threads blocked on JsonSerializable show the same stack:

at com.azure.cosmos.models.SqlParameter.<init>(SqlParameter.java:41)
- waiting on the Class initialization monitor for com.azure.cosmos.implementation.JsonSerializable

To Reproduce
The deadlock is a non-deterministic race condition triggered when two threads concurrently initiate Cosmos SDK class loading for the first time:

  1. Thread A creates a new SqlParameter() — triggers JsonSerializable.<clinit>()
  2. Thread B calls CosmosClientBuilder.buildClient() — triggers CosmosAsyncClient.<clinit>()
  3. Both <clinit> methods call ImplementationBridgeHelpers.initializeAllAccessors(), which eagerly initializes multiple SDK classes
  4. The initialization chains create circular dependencies:
    • JsonSerializable.<clinit>() chain eventually needs CosmosAsyncClient to be initialized
    • CosmosAsyncClient.<clinit>() chain eventually needs CosmosItemRequestOptions to be initialized (part of JsonSerializable's chain)
  5. Permanent deadlock — neither thread can ever make progress

The race window exists during application startup or any time Cosmos SDK classes are loaded for the first time. Pods/processes where all Cosmos SDK classes happen to be loaded by a single thread (or sequentially) are unaffected.

Code Snippet
The root cause is in ImplementationBridgeHelpers.initializeAllAccessors(), which is called from multiple class <clinit> methods. This creates the following circular initialization dependency graph:

JsonSerializable.<clinit>()
  └→ ImplementationBridgeHelpers.initializeAllAccessors()
       └→ CosmosItemRequestOptions.<clinit>()
            └→ ... → FeedResponse.<clinit>()
                 └→ CosmosPagedFluxDefaultImpl.<clinit>()
                      └→ CosmosAsyncContainer.<clinit>()
                           └→ BridgeInternal.initializeAllAccessors()
                                └→ needs CosmosAsyncClient initialized ← CIRCULAR

CosmosAsyncClient.<clinit>()
  └→ ImplementationBridgeHelpers.initializeAllAccessors()
       └→ ModelBridgeInternal.initializeAllAccessors()
            └→ needs CosmosItemRequestOptions initialized ← CIRCULAR

Expected behavior
Cosmos SDK class initialization should be safe under concurrent class loading. ImplementationBridgeHelpers.initializeAllAccessors() should not create circular <clinit> dependencies that can deadlock. The fix applied in PR #46378 for the Kafka connector addressed one instance of this pattern, but a comprehensive fix is needed across all Cosmos SDK classes that call initializeAllAccessors() from their <clinit> methods.

Screenshots
N/A — diagnosed via JVM thread dump analysis.

Setup (please complete the following information):

  • OS: Linux (Kubernetes)
  • IDE: VS Code
  • Library/Libraries: com.azure:azure-sdk-bom:1.3.4
  • Java version: 17
  • App Server/Environment: Tomcat
  • Frameworks: Spring Boot

Additional context
Workaround: Force eager single-threaded initialization of all Cosmos SDK accessor bridges at application startup, before concurrent access is possible:

@Component
@Order(Ordered.HIGHEST_PRECEDENCE)
public class CosmosClassInitializer {
    /**
     * Eagerly initializes Cosmos SDK accessor bridges on the main thread
     * during application startup, before any concurrent access is possible.
     */
    @PostConstruct
    public void initializeCosmosClasses() {
        try {
            ImplementationBridgeHelpers.initializeAllAccessors();
        } catch (Exception e) {
            throw new IllegalStateException(
                "Cosmos SDK initialization failed, causing application startup failure.", e);
        }
    }
}

This works because @Order(Ordered.HIGHEST_PRECEDENCE) ensures the bean initializes before any other component, and @PostConstruct runs on the main thread during Spring context startup — before Tomcat begins accepting requests. Calling ImplementationBridgeHelpers.initializeAllAccessors() on a single thread forces all Cosmos SDK classes in the circular dependency chain to complete their <clinit> sequentially, eliminating the concurrent class loading race window.

Information Checklist
Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

  • Bug Description Added
  • Repro Steps Added
  • Setup information Added

Metadata

Metadata

Assignees

Labels

ClientThis issue points to a problem in the data-plane of the library.CosmosService AttentionWorkflow: This issue is responsible by Azure service team.customer-reportedIssues that are reported by GitHub users external to the Azure organization.needs-team-attentionWorkflow: This issue needs attention from Azure service team or SDK team

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions