Skip to content

Reduce global lock contention in Factory resolution#365

Draft
FKREISE wants to merge 2 commits into
hmlongco:mainfrom
FKREISE:feature/reduce-global-lock-contention
Draft

Reduce global lock contention in Factory resolution#365
FKREISE wants to merge 2 commits into
hmlongco:mainfrom
FKREISE:feature/reduce-global-lock-contention

Conversation

@FKREISE
Copy link
Copy Markdown

@FKREISE FKREISE commented May 20, 2026

Summary

This PR eliminates the global recursive lock as a bottleneck during factory resolution. The current implementation holds globalRecursiveLock for the entire duration of resolve() — including while factory closures execute. This serializes all resolution across all threads, even when factories are independent and cached values are available.

The new design splits resolution into a fast path (cache hits, no global lock) and a slow path (first resolve only, minimal lock duration), achieving significant throughput improvement on concurrent workloads with realistic factory costs.

Motivation

In apps with 100+ services resolving at launch across multiple threads (e.g., app startup on modern multi-core devices), the global lock creates a convoy effect: threads queue up waiting for unrelated factories to complete. Every microsecond a factory closure spends reading a config, opening a database, or decoding JSON blocks every other thread in the process from resolving anything.

Design

Resolution Plan Cache

On first resolve, a lightweight ResolutionPlan struct is built and cached per-key on the ContainerManager. Subsequent resolves read the plan under a CrossPlatformLock (nanoseconds) and go directly to the scope cache — bypassing the global lock, options dictionary lookups, and registration checks entirely.

Plans are invalidated on any write operation: .register {}, .reset(), .decorator(), .push()/.pop(), scope changes, and context modifications. A monotonic generation counter prevents stale deferred plan stores from racing with concurrent mutations.

Lock Hierarchy

Fast path:   planLock (CrossPlatformLock) → cache.lock (CrossPlatformLock) → done
Slow path:   globalRecursiveLock → planLock → cache.lock → per-key MutexLock (if singleton/cached)

No inversion possible — fast path never touches the global lock, slow path acquires in consistent order.

Per-Key Creation Locks

Singleton and cached scopes use a per-key MutexLock (sleeping, non-recursive) with double-checked locking to guarantee factory-called-once semantics without holding the global lock during factory execution. Creation locks are auto-pruned after use.

Thread-Local State

  • Graph scope: Depth counter and graph cache moved to TLS — concurrent resolution cycles no longer interfere.
  • DEBUG state: Circular dependency tracking and trace resolutions moved to TLS — no contention on debug-only globals.

@MainActorLazyInjected

New lock-free property wrapper for @MainActor-isolated types. Omits all locking since main actor isolation already guarantees single-threaded access — zero overhead for the common case of view models and UI services.

This is a pragmatic compromise. Ideally, the @LazyInjected macro could inspect the enclosing declaration's actor isolation at compile time and expand to the correct code automatically — no wrapper choice, no lock allocation, no per-access lock overhead. However, Swift macros depend on SwiftSyntax, which significantly increases build times and still has reliability issues with prebuilt binary support. Until the macro ecosystem matures, a separate property wrapper is the simplest correct solution.

New Lock Backends (prepared, not yet active)

The codebase includes conditional implementations for modern lock primitives behind compiler flags (SynchronizationLock, AllocatedLock). These are not enabled in this PR — SPM package traits cannot conditionally raise the deployment target, so enabling them would either break consumers on older OS versions or require a hard minimum bump. They're ready to activate when Factory's deployment target is raised in a future release.

  • SynchronizationLock — Uses Swift 6 Mutex (zero-cost inline lock, requires iOS 18+ / macOS 15+)
  • AllocatedLock — Uses OSAllocatedUnfairLock (single allocation, requires iOS 16+ / macOS 13+)
  • Default (active)os_unfair_lock via UnsafeMutablePointer (all OS versions)

Profiling (Instruments)

Measured with os_signpost intervals around resolve() and global lock acquisition in a real app resolving ~1,000 factories during launch.

Baseline (main)

factory_baseline
Metric Count Duration Avg Max
Resolve 1,079 1.49 s 1.38 ms 51.67 ms
Lock Acquire 1,079 667.90 ms 619.00 µs 51.67 ms

This PR

final_results
Metric Count Duration Avg Max
Resolve 1,053 1.04 s 984.79 µs 30.94 ms
Lock Acquire 519 11.19 ms 21.57 µs 1.95 ms

Key Improvements

  • Lock contention eliminated: Average lock wait dropped from 619 µs → 21.57 µs (~29× reduction)
  • Tail latency crushed: Max lock wait dropped from 51.67 ms → 1.95 ms (~26× reduction)
  • Fast path working: 52% fewer lock acquisitions — cache hits bypass the global lock entirely
  • Total time in locks: 667.90 ms → 11.19 ms (~60× less time blocked)

What's Changed

  • Registrations.swift — Fast path / slow path split in resolve(), ResolutionPlan struct, generation-guarded deferred plan store for circular dependency safety
  • Scopes.swiftScope.Cache is now internally synchronized (CrossPlatformLock + per-key MutexLock for exclusive creation with auto-prune), Graph scope uses TLS
  • Containers.swift — Plan cache on ContainerManager with generation counter, invalidation calls on reset/push/pop/decorator/defaultScope
  • Globals.swiftThreadLocalDebugState replacing global mutable vars
  • Locking.swiftCrossPlatformLock (renamed from SpinLock) with conditional backends (prepared, not active), new MutexLock type
  • MainActorInjections.swift — New @MainActorLazyInjected property wrapper (lock-free)
  • Package.swift — Removed .dynamic from FactoryTesting
  • Tests — Concurrency stress tests

Breaking Changes

None. Public API is unchanged. Internal behavior is semantically equivalent with the documented eventual-consistency caveat on concurrent registration (which was already racy by design in the original — the global lock only prevented crashes, not deterministic ordering).

Risks

  • Scope and complexity of change: This PR touches the core resolution engine. At ~1,000 lines across 8 files, it replaces the single-lock-guards-everything model with a multi-lock design involving plan caches, generation counters, per-key creation locks, and TLS state. Any subtle bug here (missed invalidation, lock ordering issue, race window) could manifest as wrong instances, deadlocks, or crashes in production apps — potentially under load patterns that unit tests don't cover. The existing test suite passes, and new stress tests cover the key invariants, but the nature of concurrency bugs is that they hide.
  • Eventual consistency on concurrent registration: A resolve() in flight during a .register {} may return the old value. This was already true before (the global lock prevented crashes but not deterministic ordering of concurrent read/write). Now it's documented explicitly.
  • Circular dependency deadlock in release builds: Scoped factories (singleton/cached) use non-recursive per-key creation locks. A circular dependency (A → B → A, all singletons) will deadlock the thread. In DEBUG builds, circular dependencies are detected before the per-key lock is acquired. This is acceptable — circular dependencies are always programmer errors, and DEBUG detection covers the development cycle.
  • Plan cache staleness: If a plan is cached and an external mutation path is missed (i.e., doesn't call invalidatePlan/invalidateAllPlans), the fast path could serve stale data. Mitigated by invalidating on all known write paths and by the generation counter on deferred stores.

Testing

  • All existing tests pass
  • New stress tests verify:
    • Singleton returns same instance across 100 threads
    • Singleton factory executes exactly once under contention
    • Cached scope returns stable instances under contention
    • Graph scope isolates per-thread resolution cycles
    • Decorators added after first resolve fire correctly (plan invalidation)
    • Container-level decorators invalidate plans

Declaration

The program was tested solely for our own use cases, which might differ from yours.

Link to provider information

https://github.com/mercedes-benz

Split resolve() into a fast path (plan cache + scope cache, no global
lock) and a slow path (first resolve only, minimal lock hold time).
Add per-key creation locks for singleton/cached scope, move graph
scope and DEBUG state to TLS, and thread-safe @LazyInjected.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@FKREISE
Copy link
Copy Markdown
Author

FKREISE commented May 20, 2026

@hmlongco What are your thoughts? This is quite a big change, but we noticed the lock contention due to Factory resolutions in our app and this brings it down significantly with no changes on the consumer side.

- Replace CFAbsoluteTimeGetCurrent (Darwin-only, wall clock) with a
  cross-platform currentTimestamp() helper using CLOCK_MONOTONIC
- Add makePthreadKey(destructor:) to handle UnsafeMutableRawPointer
  optionality difference between Darwin and Linux
- Monotonic clock is correct for TTL: immune to NTP and clock adjustments

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@hmlongco
Copy link
Copy Markdown
Owner

hmlongco commented May 20, 2026

Looking at it. Not sure of ramifications yet, especially in regard to cross-container evaluations and other race conditions.

Just how many cross-threaded resolutions are you doing on launch, anyway?

On a side note, the property wrapper mentions "until Factory supports macros".

Are you aware of: https://github.com/hmlongco/Factory/tree/macros#factory-macros

@hmlongco
Copy link
Copy Markdown
Owner

hmlongco commented May 22, 2026

Attempted the following since it breaks fewer contracts, but multiple locking performance is even worse. The globalRecursiveLock is extremely fast, and the only way I could get performance to be worse in the stress test is to add delay loops to the initializers (duplicating a ton of heavy work done on init). Which shouldn't be occurring in the first place.

https://github.com/hmlongco/Factory/tree/locks

@FKREISE
Copy link
Copy Markdown
Author

FKREISE commented May 26, 2026

Attempted the following since it breaks fewer contracts, but multiple locking performance is even worse. The globalRecursiveLock is extremely fast, and the only way I could get performance to be worse in the stress test is to add delay loops to the initializers (duplicating a ton of heavy work done on init). Which shouldn't be occurring in the first place.

https://github.com/hmlongco/Factory/tree/locks

Thanks for your time. I'll check out your branch and give it a try.

The main problem on our end is how we use Factory. Some of our registrations can be pretty expensive — file I/O, waiting on the keychain and so on. Currently, it's all contended by the global lock. I understand the design philosophy though and that initializers should be slim.

I'm torn between fixing the contention in Factory itself or slimming down our initializers. Unfortunately, keeping init fast and deferring setup lazily is non-trivial in some of our cases.

@hmlongco
Copy link
Copy Markdown
Owner

Updates locks branch

@FKREISE
Copy link
Copy Markdown
Author

FKREISE commented May 27, 2026

Your locks branch is very promising! It's much simpler but as all our slow registrations are .cached or at least .shared they can now skip the global lock and the lock wait time is not an issue anymore.

feature/reduce-global-lock-contention

contentionrefactoring

locks

locksbranch

Review

I noticed an issue we need to address, though. With .singleton (and .cached) it's reasonable for consumers to assume that the init is only called once and only one instance is created (until reset is called for .cached). However, singletons can now be created more than once if multiple threads hit a cache miss. In Scope.resolve:

 Thread A: cache.value(key) → nil (miss)
 Thread B: cache.value(key) → nil (miss)
 Thread A: instance = factory()  → creates Instance #1
 Thread B: instance = factory()  → creates Instance #2
 Thread A: cache.set(key, #1)
 Thread B: cache.set(key, #2)    → overwrites #1

Result: two "singletons" are created. Thread A receives #1 but is not cached. #2 is cached for everyone else. If the singleton has side effects on init (or keeps state), there is a bug.
On this branch there is the exclusiveCreation flag in order to address this issue.

Claude also found two other issues which look legit.

Graph scope depth is process-global (cross-thread interference)

Location: Scopes.swift:125–150

internal func enter() {
    lock.withLock { depth += 1 }
}
internal func leave() {
    lock.withLock {
        depth -= 1
        if depth == 0 { cache.reset() }
    }
}
public private(set) var depth: Int = 0    // ← single Int, shared by all threads
internal var cache = Cache()               // ← single Cache, shared by all threads

Every resolve() calls enter()/leave() regardless of scope. Depth is one integer for the entire process.

Failure scenario:

Thread A: resolving root (graph-scoped deps)
  enter() → depth = 1
  resolves ServiceX → cached in graph.cache

Thread B: resolving unrelated service (no graph scope)
  enter() → depth = 2

Thread A: finishes
  leave() → depth = 1  ← NOT zero! Cache doesn't reset!

Thread C: resolving same root
  enter() → depth = 2
  resolves ServiceX → gets STALE cached instance from A's cycle ← BUG

push()/pop() bypasses the cache's own lock

Location: Containers.swift:392–404

public func push() {
    lock.withLock {                            // ← manager.lock
        stack.append((options, cache.cache, state))
                                      //     ↑ reads cache.cache (raw dictionary)
    }
}
public func pop() {
    lock.withLock {                            // ← manager.lock
        if let values = stack.popLast() {
            cache.cache = values.1             // ← writes cache.cache directly
        }
    }
}

Normal cache operations go through cache.lock (ReadWriteLock). But push()/pop() access cache.cache directly under manager.lock — a different lock entirely.

Failure scenario:

Thread A (test tearDown): pop()
  manager.lock.lock()
  cache.cache = savedSnapshot         // raw dictionary write

Thread B (resolve in progress):
  cache.value(forKey: key)
    cache.lock.withReadLock { ... }    // reading same dictionary under different lock

Data race on a Swift Dictionary — undefined behavior.

@hmlongco
Copy link
Copy Markdown
Owner

Updated locks branch to fix race on cache miss

@hmlongco
Copy link
Copy Markdown
Owner

Locks is now in develop as 3.1

@FKREISE
Copy link
Copy Markdown
Author

FKREISE commented May 28, 2026

Locks is now in develop as 3.1

Very cool! I have more time next week to look into this and test it in a big app.
The automated code review found a couple minor correctness issues, but I haven't looked at it more closely. If you have time, feel free to check it out:
locks-branch-review-plan.md

The decorator change you made is not quite clear to me. It's unrelated to the locking, but now the (default) decorator doesn't run anymore when a decorator is set explicitly?

Oh and the swift-atomics dependency I'm not sure if it's worth adding that just for the one file. A regular lock would work, too. However, most apps will include swift-atomics anyway as it's very widely used nowadays. 🤷

@hmlongco
Copy link
Copy Markdown
Owner

If you set a default scope on a container that scope is used unless a given factory overrides it.

If you set a default decorator on a container that decorator is used unless a given factory overrides it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants