Skip to content

Improved symbol handling#1942

Closed
threeseed wants to merge 1 commit intosonos:mainfrom
threeseed:sym
Closed

Improved symbol handling#1942
threeseed wants to merge 1 commit intosonos:mainfrom
threeseed:sym

Conversation

@threeseed
Copy link
Copy Markdown

Based on benchmarks the following performance improvements:

LinearClassifier / 100,000 - 94.7%
LinearClassifier / 1,000,000 - 13.7%

LinearRegressor / 100,000 - 48.3%
LinearRegressor / 1,000,000 - 7.07%

@kali
Copy link
Copy Markdown
Collaborator

kali commented Jan 12, 2026

If I get this right, you "just" replaced the locking structure, right ? The previous iteration was adding an extra cache, but this PR does not, correct ?

@threeseed
Copy link
Copy Markdown
Author

If I get this right, you "just" replaced the locking structure, right ? The previous iteration was adding an extra cache, but this PR does not, correct ?

Yes. We did have a PR for symbol cache which we are still looking into.

But this is about locking and the string interning micro optimisation.

@kali
Copy link
Copy Markdown
Collaborator

kali commented Jan 12, 2026

Pleasse make sure clippy is happy, and we'll merge this.

@kali
Copy link
Copy Markdown
Collaborator

kali commented Jan 14, 2026

I think we still have a problem though, some expression will trigger a dead lock (which put the CI in trouble, but that's on me, i need to make it more resilient i guess).

In tract/data, cargo test dim::assertion::tests::low_bound_1 shows the deadlock (a couple other tests in data also do).

kali added a commit that referenced this pull request Apr 23, 2026
prove_positive_or_zero_inner was filter_map'ing as_known_positive over
every assertion on each call. The result is stable for base assertions;
cache the positive form at add_assertion time. Only the scenario-extra
assertions are still filtered per-call.

Note: as_known_positive() recurses through TDim simplify() which
re-enters the scope lock, so it must be called before the lock is
acquired.

Cherry-picked from PR #1942.
kali added a commit that referenced this pull request Apr 23, 2026
SymbolValues::get is called on every TDim::eval; hashing a Symbol (which
hashes its narrow SymbolU32 id) is dominant. FxHashMap's faster hash
pays off in the hot path.

Cherry-picked from PR #1942.
kali added a commit that referenced this pull request Apr 23, 2026
Previously cloned into a fresh Vec on every call; with the proof cache
hitting all_assertions from multiple places during inference, Arc's
shared ownership avoids repeated heap churn.

Cherry-picked from PR #1942.
kali added a commit that referenced this pull request Apr 23, 2026
Tiny allocation reduction: the retry loop no longer builds a fresh
String on every iteration.

Cherry-picked from PR #1942.
kali added a commit that referenced this pull request Apr 23, 2026
SimplePlan::resolve calls guess_scenario on every set_input; models
without scenarios (the common case) still paid a full scope lock just
to observe that scenarios is empty. Under 32-way parallel inference
that acquire/release was the dominant bottleneck (100% of scope-lock
acquisitions counted during a LinearClassifier benchmark).

Wrap the scope's Arc<ReentrantMutex<...>> in a SymbolScopeInner that
also carries a lock-free `has_scenarios` AtomicBool, updated by
add_scenario / add_scenario_assertion. guess_scenario checks it first
and returns immediately when the scope has no scenarios. Deref on the
wrapper keeps the `scope.0.lock()` call pattern working unchanged.

LinearClassifier @ 100K runs, 32 threads: 100ms → 26ms.
Now beats the PR #1942 RwLock approach (36ms) without the deadlock,
since no lock at all beats a concurrent-reader lock.
@kali
Copy link
Copy Markdown
Collaborator

kali commented Apr 23, 2026

Thanks for this PR and the accompanying benchmark — the bench itself was invaluable and is what made the underlying contention tractable to track down. I'm going to close
this in favour of a narrower fix, but wanted to share what we landed and why.

Root cause. On the LinearClassifier bench at high thread counts, every set_input was hitting two locked paths in SymbolScope:

  1. TDim::to_i64() was being used inside plan.rs::resolve as a "is this concrete?" probe. On a symbolic TDim it constructed an error containing self.to_string(), which
    formats through Symbol::Display and takes the scope lock. (This was regressed by 106161a in January; swapping to the already-existing as_i64() -> Option avoids any
    allocation and any lock.)
  2. guess_scenario() acquired the scope lock just to observe that scenarios is empty — the common case.

Numbers (100K parallel runs, 32 threads, Ryzen 9 9950X3D):

  ┌───────────────────────────────┬─────┐
  │                               │ ms  │
  ├───────────────────────────────┼─────┤
  │ main today                    │ 197 │
  ├───────────────────────────────┼─────┤
  │ this PR if it didn't deadlock │ 36  │
  ├───────────────────────────────┼─────┤
  │ with the two fixes above      │ 26  │
  └───────────────────────────────┴─────┘

The lock-free fast path actually edges out the RwLock approach — no lock beats a concurrent-reader lock when the contended work is trivial.

Why not this PR directly.

  • The ReentrantMutex<RefCell<_>> → parking_lot::RwLock swap deadlocks on any path that re-enters the scope lock while a guard is held (dim::assertion::tests::low_bound_1 is
    the reproducer). RwLock isn't reentrant, and several hot paths — TDim arithmetic calling simplify(), Symbol::Display, the proof-cache path — re-enter transitively. That's
    fundamental to the approach rather than a localised bug.
  • For the rest of the PR I benched each change individually against the bench suite included here. DefaultStringInterner → StringInterner is a no-op (both
    resolve to the same type in string-interner 0.19 and 0.15). FxHashMap on SymbolValues, the positives_cache, Arc<[Assertion]>, and the String::with_capacity buffer reuse all
    stayed within ±1–3 ms on every thread count. They may well pay off in other workloads, but I didn't want to carry them in without a workload where they pull their weight.

Replacement is three commits on #2151 :

  • plan: use as_i64 probe instead of to_i64 in shape resolve
  • deps: workspace-ize parking_lot
  • sym: lock-free fast path for guess_scenario on empty scope

Thanks again — the problem this PR exposed was real and worth fixing.

@kali kali closed this Apr 23, 2026
kali added a commit that referenced this pull request Apr 23, 2026
SimplePlan::resolve calls guess_scenario on every set_input; models
without scenarios (the common case) still paid a full scope lock just
to observe that scenarios is empty. Under 32-way parallel inference
that acquire/release was the dominant bottleneck (100% of scope-lock
acquisitions counted during a LinearClassifier benchmark).

Wrap the scope's Arc<ReentrantMutex<...>> in a SymbolScopeInner that
also carries a lock-free `has_scenarios` AtomicBool, updated by
add_scenario / add_scenario_assertion. guess_scenario checks it first
and returns immediately when the scope has no scenarios. Deref on the
wrapper keeps the `scope.0.lock()` call pattern working unchanged.

LinearClassifier @ 100K runs, 32 threads: 100ms → 26ms.
Now beats the PR #1942 RwLock approach (36ms) without the deadlock,
since no lock at all beats a concurrent-reader lock.
kali added a commit that referenced this pull request Apr 24, 2026
SimplePlan::resolve calls guess_scenario on every set_input; models
without scenarios (the common case) still paid a full scope lock just
to observe that scenarios is empty. Under 32-way parallel inference
that acquire/release was the dominant bottleneck (100% of scope-lock
acquisitions counted during a LinearClassifier benchmark).

Wrap the scope's Arc<ReentrantMutex<...>> in a SymbolScopeInner that
also carries a lock-free `has_scenarios` AtomicBool, updated by
add_scenario / add_scenario_assertion. guess_scenario checks it first
and returns immediately when the scope has no scenarios. Deref on the
wrapper keeps the `scope.0.lock()` call pattern working unchanged.

LinearClassifier @ 100K runs, 32 threads: 100ms → 26ms.
Now beats the PR #1942 RwLock approach (36ms) without the deadlock,
since no lock at all beats a concurrent-reader lock.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants