-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
Describe the bug
The Azure Cosmos Java SDK has circular class initialization () dependencies that cause a permanent, unrecoverable JVM-level deadlock when two threads concurrently trigger Cosmos SDK class loading. The deadlock occurs between JsonSerializable, CosmosItemRequestOptions, CosmosAsyncClient, and related classes — all through ImplementationBridgeHelpers.initializeAllAccessors().
A similar issue was previously identified and fixed specifically for Kafka connectors in PR #46378: fixKafkaConnectorStuckIssue, but that fix was not comprehensive. The same circular pattern exists in the core Cosmos SDK classes and can be triggered by any application that concurrently initializes Cosmos SDK classes from multiple threads.
JVM class initialization monitors have no timeout — this deadlock is permanent and unrecoverable without killing the process.
Exception or Stack Trace
Thread dump captured from a production JVM (~336 seconds uptime, 200/200 Tomcat request threads stuck permanently):
Thread A (exec-31, elapsed=221.74s):
Thread.State: RUNNABLE
- new SqlParameter() → JsonSerializable.<clinit>()
→ ImplementationBridgeHelpers.initializeAllAccessors()
→ CosmosItemRequestOptions.<clinit>()
→ CosmosDiagnosticsThresholdsHelper → FeedResponse.<clinit>()
→ CosmosPagedFluxDefaultImpl.<clinit>()
→ CosmosAsyncContainer.<clinit>()
→ BridgeInternal.initializeAllAccessors()
→ ⛔ WAITS on CosmosAsyncClient class init (owned by Thread B)
Thread B (exec-13, elapsed=228.03s):
Thread.State: RUNNABLE
- CosmosClientBuilder.buildClient() → CosmosAsyncClient.<clinit>()
→ ImplementationBridgeHelpers.initializeAllAccessors()
→ ModelBridgeInternal.initializeAllAccessors()
→ ⛔ WAITS on CosmosItemRequestOptions class init (part of Thread A's chain)
Result: Classic AB/BA deadlock on JVM class initialization monitors.
- 192 threads blocked waiting on
JsonSerializableclass init monitor (held by Thread A / exec-31) - 6 threads blocked on Guava cache
waitForLoadingValue(waiting for Thread B / exec-13 to finishbuildClient()) - 1 thread (exec-31) deadlocked in
JsonSerializable.<clinit>()chain - 1 thread (exec-13) deadlocked in
CosmosAsyncClient.<clinit>()chain - 200/200 request threads permanently stuck — process is completely unresponsive
All 192 threads blocked on JsonSerializable show the same stack:
at com.azure.cosmos.models.SqlParameter.<init>(SqlParameter.java:41)
- waiting on the Class initialization monitor for com.azure.cosmos.implementation.JsonSerializable
To Reproduce
The deadlock is a non-deterministic race condition triggered when two threads concurrently initiate Cosmos SDK class loading for the first time:
- Thread A creates a
new SqlParameter()— triggersJsonSerializable.<clinit>() - Thread B calls
CosmosClientBuilder.buildClient()— triggersCosmosAsyncClient.<clinit>() - Both
<clinit>methods callImplementationBridgeHelpers.initializeAllAccessors(), which eagerly initializes multiple SDK classes - The initialization chains create circular dependencies:
JsonSerializable.<clinit>()chain eventually needsCosmosAsyncClientto be initializedCosmosAsyncClient.<clinit>()chain eventually needsCosmosItemRequestOptionsto be initialized (part ofJsonSerializable's chain)
- Permanent deadlock — neither thread can ever make progress
The race window exists during application startup or any time Cosmos SDK classes are loaded for the first time. Pods/processes where all Cosmos SDK classes happen to be loaded by a single thread (or sequentially) are unaffected.
Code Snippet
The root cause is in ImplementationBridgeHelpers.initializeAllAccessors(), which is called from multiple class <clinit> methods. This creates the following circular initialization dependency graph:
JsonSerializable.<clinit>()
└→ ImplementationBridgeHelpers.initializeAllAccessors()
└→ CosmosItemRequestOptions.<clinit>()
└→ ... → FeedResponse.<clinit>()
└→ CosmosPagedFluxDefaultImpl.<clinit>()
└→ CosmosAsyncContainer.<clinit>()
└→ BridgeInternal.initializeAllAccessors()
└→ needs CosmosAsyncClient initialized ← CIRCULAR
CosmosAsyncClient.<clinit>()
└→ ImplementationBridgeHelpers.initializeAllAccessors()
└→ ModelBridgeInternal.initializeAllAccessors()
└→ needs CosmosItemRequestOptions initialized ← CIRCULAR
Expected behavior
Cosmos SDK class initialization should be safe under concurrent class loading. ImplementationBridgeHelpers.initializeAllAccessors() should not create circular <clinit> dependencies that can deadlock. The fix applied in PR #46378 for the Kafka connector addressed one instance of this pattern, but a comprehensive fix is needed across all Cosmos SDK classes that call initializeAllAccessors() from their <clinit> methods.
Screenshots
N/A — diagnosed via JVM thread dump analysis.
Setup (please complete the following information):
- OS: Linux (Kubernetes)
- IDE: VS Code
- Library/Libraries: com.azure:azure-sdk-bom:1.3.4
- Java version: 17
- App Server/Environment: Tomcat
- Frameworks: Spring Boot
Additional context
Workaround: Force eager single-threaded initialization of all Cosmos SDK accessor bridges at application startup, before concurrent access is possible:
@Component
@Order(Ordered.HIGHEST_PRECEDENCE)
public class CosmosClassInitializer {
/**
* Eagerly initializes Cosmos SDK accessor bridges on the main thread
* during application startup, before any concurrent access is possible.
*/
@PostConstruct
public void initializeCosmosClasses() {
try {
ImplementationBridgeHelpers.initializeAllAccessors();
} catch (Exception e) {
throw new IllegalStateException(
"Cosmos SDK initialization failed, causing application startup failure.", e);
}
}
}This works because @Order(Ordered.HIGHEST_PRECEDENCE) ensures the bean initializes before any other component, and @PostConstruct runs on the main thread during Spring context startup — before Tomcat begins accepting requests. Calling ImplementationBridgeHelpers.initializeAllAccessors() on a single thread forces all Cosmos SDK classes in the circular dependency chain to complete their <clinit> sequentially, eliminating the concurrent class loading race window.
Information Checklist
Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report
- Bug Description Added
- Repro Steps Added
- Setup information Added