Reduce global lock contention in Factory resolution by FKREISE · Pull Request #365 · hmlongco/Factory

FKREISE · 2026-05-20T14:25:01Z

Summary

This PR eliminates the global recursive lock as a bottleneck during factory resolution. The current implementation holds globalRecursiveLock for the entire duration of resolve() — including while factory closures execute. This serializes all resolution across all threads, even when factories are independent and cached values are available.

The new design splits resolution into a fast path (cache hits, no global lock) and a slow path (first resolve only, minimal lock duration), achieving significant throughput improvement on concurrent workloads with realistic factory costs.

Motivation

In apps with 100+ services resolving at launch across multiple threads (e.g., app startup on modern multi-core devices), the global lock creates a convoy effect: threads queue up waiting for unrelated factories to complete. Every microsecond a factory closure spends reading a config, opening a database, or decoding JSON blocks every other thread in the process from resolving anything.

Design

Resolution Plan Cache

On first resolve, a lightweight ResolutionPlan struct is built and cached per-key on the ContainerManager. Subsequent resolves read the plan under a CrossPlatformLock (nanoseconds) and go directly to the scope cache — bypassing the global lock, options dictionary lookups, and registration checks entirely.

Plans are invalidated on any write operation: .register {}, .reset(), .decorator(), .push()/.pop(), scope changes, and context modifications. A monotonic generation counter prevents stale deferred plan stores from racing with concurrent mutations.

Lock Hierarchy

Fast path:   planLock (CrossPlatformLock) → cache.lock (CrossPlatformLock) → done
Slow path:   globalRecursiveLock → planLock → cache.lock → per-key MutexLock (if singleton/cached)

No inversion possible — fast path never touches the global lock, slow path acquires in consistent order.

Per-Key Creation Locks

Singleton and cached scopes use a per-key MutexLock (sleeping, non-recursive) with double-checked locking to guarantee factory-called-once semantics without holding the global lock during factory execution. Creation locks are auto-pruned after use.

Thread-Local State

Graph scope: Depth counter and graph cache moved to TLS — concurrent resolution cycles no longer interfere.
DEBUG state: Circular dependency tracking and trace resolutions moved to TLS — no contention on debug-only globals.

`@MainActorLazyInjected`

New lock-free property wrapper for @MainActor-isolated types. Omits all locking since main actor isolation already guarantees single-threaded access — zero overhead for the common case of view models and UI services.

This is a pragmatic compromise. Ideally, the @LazyInjected macro could inspect the enclosing declaration's actor isolation at compile time and expand to the correct code automatically — no wrapper choice, no lock allocation, no per-access lock overhead. However, Swift macros depend on SwiftSyntax, which significantly increases build times and still has reliability issues with prebuilt binary support. Until the macro ecosystem matures, a separate property wrapper is the simplest correct solution.

New Lock Backends (prepared, not yet active)

The codebase includes conditional implementations for modern lock primitives behind compiler flags (SynchronizationLock, AllocatedLock). These are not enabled in this PR — SPM package traits cannot conditionally raise the deployment target, so enabling them would either break consumers on older OS versions or require a hard minimum bump. They're ready to activate when Factory's deployment target is raised in a future release.

SynchronizationLock — Uses Swift 6 Mutex (zero-cost inline lock, requires iOS 18+ / macOS 15+)
AllocatedLock — Uses OSAllocatedUnfairLock (single allocation, requires iOS 16+ / macOS 13+)
Default (active) — os_unfair_lock via UnsafeMutablePointer (all OS versions)

Profiling (Instruments)

Measured with os_signpost intervals around resolve() and global lock acquisition in a real app resolving ~1,000 factories during launch.

Baseline (`main`)

Metric	Count	Duration	Avg	Max
Resolve	1,079	1.49 s	1.38 ms	51.67 ms
Lock Acquire	1,079	667.90 ms	619.00 µs	51.67 ms

This PR

Metric	Count	Duration	Avg	Max
Resolve	1,053	1.04 s	984.79 µs	30.94 ms
Lock Acquire	519	11.19 ms	21.57 µs	1.95 ms

Key Improvements

Lock contention eliminated: Average lock wait dropped from 619 µs → 21.57 µs (~29× reduction)
Tail latency crushed: Max lock wait dropped from 51.67 ms → 1.95 ms (~26× reduction)
Fast path working: 52% fewer lock acquisitions — cache hits bypass the global lock entirely
Total time in locks: 667.90 ms → 11.19 ms (~60× less time blocked)

What's Changed

Registrations.swift — Fast path / slow path split in resolve(), ResolutionPlan struct, generation-guarded deferred plan store for circular dependency safety
Scopes.swift — Scope.Cache is now internally synchronized (CrossPlatformLock + per-key MutexLock for exclusive creation with auto-prune), Graph scope uses TLS
Containers.swift — Plan cache on ContainerManager with generation counter, invalidation calls on reset/push/pop/decorator/defaultScope
Globals.swift — ThreadLocalDebugState replacing global mutable vars
Locking.swift — CrossPlatformLock (renamed from SpinLock) with conditional backends (prepared, not active), new MutexLock type
MainActorInjections.swift — New @MainActorLazyInjected property wrapper (lock-free)
Package.swift — Removed .dynamic from FactoryTesting
Tests — Concurrency stress tests

Breaking Changes

None. Public API is unchanged. Internal behavior is semantically equivalent with the documented eventual-consistency caveat on concurrent registration (which was already racy by design in the original — the global lock only prevented crashes, not deterministic ordering).

Risks

Scope and complexity of change: This PR touches the core resolution engine. At ~1,000 lines across 8 files, it replaces the single-lock-guards-everything model with a multi-lock design involving plan caches, generation counters, per-key creation locks, and TLS state. Any subtle bug here (missed invalidation, lock ordering issue, race window) could manifest as wrong instances, deadlocks, or crashes in production apps — potentially under load patterns that unit tests don't cover. The existing test suite passes, and new stress tests cover the key invariants, but the nature of concurrency bugs is that they hide.
Eventual consistency on concurrent registration: A resolve() in flight during a .register {} may return the old value. This was already true before (the global lock prevented crashes but not deterministic ordering of concurrent read/write). Now it's documented explicitly.
Circular dependency deadlock in release builds: Scoped factories (singleton/cached) use non-recursive per-key creation locks. A circular dependency (A → B → A, all singletons) will deadlock the thread. In DEBUG builds, circular dependencies are detected before the per-key lock is acquired. This is acceptable — circular dependencies are always programmer errors, and DEBUG detection covers the development cycle.
Plan cache staleness: If a plan is cached and an external mutation path is missed (i.e., doesn't call invalidatePlan/invalidateAllPlans), the fast path could serve stale data. Mitigated by invalidating on all known write paths and by the generation counter on deferred stores.

Testing

All existing tests pass
New stress tests verify:
- Singleton returns same instance across 100 threads
- Singleton factory executes exactly once under contention
- Cached scope returns stable instances under contention
- Graph scope isolates per-thread resolution cycles
- Decorators added after first resolve fire correctly (plan invalidation)
- Container-level decorators invalidate plans

Declaration

The program was tested solely for our own use cases, which might differ from yours.

Link to provider information

https://github.com/mercedes-benz

Split resolve() into a fast path (plan cache + scope cache, no global lock) and a slow path (first resolve only, minimal lock hold time). Add per-key creation locks for singleton/cached scope, move graph scope and DEBUG state to TLS, and thread-safe @LazyInjected. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

FKREISE · 2026-05-20T14:44:21Z

@hmlongco What are your thoughts? This is quite a big change, but we noticed the lock contention due to Factory resolutions in our app and this brings it down significantly with no changes on the consumer side.

- Replace CFAbsoluteTimeGetCurrent (Darwin-only, wall clock) with a cross-platform currentTimestamp() helper using CLOCK_MONOTONIC - Add makePthreadKey(destructor:) to handle UnsafeMutableRawPointer optionality difference between Darwin and Linux - Monotonic clock is correct for TTL: immune to NTP and clock adjustments Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

hmlongco · 2026-05-20T22:42:52Z

Looking at it. Not sure of ramifications yet, especially in regard to cross-container evaluations and other race conditions.

Just how many cross-threaded resolutions are you doing on launch, anyway?

On a side note, the property wrapper mentions "until Factory supports macros".

Are you aware of: https://github.com/hmlongco/Factory/tree/macros#factory-macros

hmlongco · 2026-05-22T18:17:13Z

Attempted the following since it breaks fewer contracts, but multiple locking performance is even worse. The globalRecursiveLock is extremely fast, and the only way I could get performance to be worse in the stress test is to add delay loops to the initializers (duplicating a ton of heavy work done on init). Which shouldn't be occurring in the first place.

https://github.com/hmlongco/Factory/tree/locks

FKREISE · 2026-05-26T11:18:39Z

Attempted the following since it breaks fewer contracts, but multiple locking performance is even worse. The globalRecursiveLock is extremely fast, and the only way I could get performance to be worse in the stress test is to add delay loops to the initializers (duplicating a ton of heavy work done on init). Which shouldn't be occurring in the first place.

https://github.com/hmlongco/Factory/tree/locks

Thanks for your time. I'll check out your branch and give it a try.

The main problem on our end is how we use Factory. Some of our registrations can be pretty expensive — file I/O, waiting on the keychain and so on. Currently, it's all contended by the global lock. I understand the design philosophy though and that initializers should be slim.

I'm torn between fixing the contention in Factory itself or slimming down our initializers. Unfortunately, keeping init fast and deferring setup lazily is non-trivial in some of our cases.

hmlongco · 2026-05-27T05:48:19Z

Updates locks branch

FKREISE · 2026-05-27T07:45:16Z

Your locks branch is very promising! It's much simpler but as all our slow registrations are .cached or at least .shared they can now skip the global lock and the lock wait time is not an issue anymore.

feature/reduce-global-lock-contention

locks

Review

I noticed an issue we need to address, though. With .singleton (and .cached) it's reasonable for consumers to assume that the init is only called once and only one instance is created (until reset is called for .cached). However, singletons can now be created more than once if multiple threads hit a cache miss. In Scope.resolve:

 Thread A: cache.value(key) → nil (miss)
 Thread B: cache.value(key) → nil (miss)
 Thread A: instance = factory()  → creates Instance #1
 Thread B: instance = factory()  → creates Instance #2
 Thread A: cache.set(key, #1)
 Thread B: cache.set(key, #2)    → overwrites #1

Result: two "singletons" are created. Thread A receives #1 but is not cached. #2 is cached for everyone else. If the singleton has side effects on init (or keeps state), there is a bug.
On this branch there is the exclusiveCreation flag in order to address this issue.

Claude also found two other issues which look legit.

Graph scope depth is process-global (cross-thread interference)

Location: Scopes.swift:125–150

internal func enter() {
    lock.withLock { depth += 1 }
}
internal func leave() {
    lock.withLock {
        depth -= 1
        if depth == 0 { cache.reset() }
    }
}
public private(set) var depth: Int = 0    // ← single Int, shared by all threads
internal var cache = Cache()               // ← single Cache, shared by all threads

Every resolve() calls enter()/leave() regardless of scope. Depth is one integer for the entire process.

Failure scenario:

Thread A: resolving root (graph-scoped deps)
  enter() → depth = 1
  resolves ServiceX → cached in graph.cache

Thread B: resolving unrelated service (no graph scope)
  enter() → depth = 2

Thread A: finishes
  leave() → depth = 1  ← NOT zero! Cache doesn't reset!

Thread C: resolving same root
  enter() → depth = 2
  resolves ServiceX → gets STALE cached instance from A's cycle ← BUG

`push()`/`pop()` bypasses the cache's own lock

Location: Containers.swift:392–404

public func push() {
    lock.withLock {                            // ← manager.lock
        stack.append((options, cache.cache, state))
                                      //     ↑ reads cache.cache (raw dictionary)
    }
}
public func pop() {
    lock.withLock {                            // ← manager.lock
        if let values = stack.popLast() {
            cache.cache = values.1             // ← writes cache.cache directly
        }
    }
}

Normal cache operations go through cache.lock (ReadWriteLock). But push()/pop() access cache.cache directly under manager.lock — a different lock entirely.

Failure scenario:

Thread A (test tearDown): pop()
  manager.lock.lock()
  cache.cache = savedSnapshot         // raw dictionary write

Thread B (resolve in progress):
  cache.value(forKey: key)
    cache.lock.withReadLock { ... }    // reading same dictionary under different lock

Data race on a Swift Dictionary — undefined behavior.

hmlongco · 2026-05-27T20:57:39Z

Updated locks branch to fix race on cache miss

hmlongco · 2026-05-28T16:53:50Z

Locks is now in develop as 3.1

FKREISE · 2026-05-28T17:51:02Z

Locks is now in develop as 3.1

Very cool! I have more time next week to look into this and test it in a big app.
The automated code review found a couple minor correctness issues, but I haven't looked at it more closely. If you have time, feel free to check it out:
locks-branch-review-plan.md

The decorator change you made is not quite clear to me. It's unrelated to the locking, but now the (default) decorator doesn't run anymore when a decorator is set explicitly?

Oh and the swift-atomics dependency I'm not sure if it's worth adding that just for the one file. A regular lock would work, too. However, most apps will include swift-atomics anyway as it's very widely used nowadays. 🤷

hmlongco · 2026-05-29T00:37:44Z

If you set a default scope on a container that scope is used unless a given factory overrides it.

If you set a default decorator on a container that decorator is used unless a given factory overrides it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce global lock contention in Factory resolution#365

Reduce global lock contention in Factory resolution#365
FKREISE wants to merge 2 commits into
hmlongco:mainfrom
FKREISE:feature/reduce-global-lock-contention

FKREISE commented May 20, 2026

Uh oh!

FKREISE commented May 20, 2026

Uh oh!

hmlongco commented May 20, 2026 •

edited

Loading

Uh oh!

hmlongco commented May 22, 2026 •

edited

Loading

Uh oh!

FKREISE commented May 26, 2026

Uh oh!

hmlongco commented May 27, 2026

Uh oh!

FKREISE commented May 27, 2026

Uh oh!

hmlongco commented May 27, 2026

Uh oh!

hmlongco commented May 28, 2026

Uh oh!

FKREISE commented May 28, 2026 •

edited

Loading

Uh oh!

hmlongco commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

FKREISE commented May 20, 2026

Summary

Motivation

Design

Resolution Plan Cache

Lock Hierarchy

Per-Key Creation Locks

Thread-Local State

@MainActorLazyInjected

New Lock Backends (prepared, not yet active)

Profiling (Instruments)

Baseline (main)

This PR

Key Improvements

What's Changed

Breaking Changes

Risks

Testing

Declaration

Link to provider information

Uh oh!

FKREISE commented May 20, 2026

Uh oh!

hmlongco commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hmlongco commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FKREISE commented May 26, 2026

Uh oh!

hmlongco commented May 27, 2026

Uh oh!

FKREISE commented May 27, 2026

feature/reduce-global-lock-contention

locks

Review

Graph scope depth is process-global (cross-thread interference)

push()/pop() bypasses the cache's own lock

Uh oh!

hmlongco commented May 27, 2026

Uh oh!

hmlongco commented May 28, 2026

Uh oh!

FKREISE commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hmlongco commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`@MainActorLazyInjected`

Baseline (`main`)

hmlongco commented May 20, 2026 •

edited

Loading

hmlongco commented May 22, 2026 •

edited

Loading

`push()`/`pop()` bypasses the cache's own lock

FKREISE commented May 28, 2026 •

edited

Loading