Improved symbol handling#1942
Conversation
|
If I get this right, you "just" replaced the locking structure, right ? The previous iteration was adding an extra cache, but this PR does not, correct ? |
Yes. We did have a PR for symbol cache which we are still looking into. But this is about locking and the string interning micro optimisation. |
|
Pleasse make sure clippy is happy, and we'll merge this. |
|
I think we still have a problem though, some expression will trigger a dead lock (which put the CI in trouble, but that's on me, i need to make it more resilient i guess). In tract/data, |
prove_positive_or_zero_inner was filter_map'ing as_known_positive over every assertion on each call. The result is stable for base assertions; cache the positive form at add_assertion time. Only the scenario-extra assertions are still filtered per-call. Note: as_known_positive() recurses through TDim simplify() which re-enters the scope lock, so it must be called before the lock is acquired. Cherry-picked from PR #1942.
SymbolValues::get is called on every TDim::eval; hashing a Symbol (which hashes its narrow SymbolU32 id) is dominant. FxHashMap's faster hash pays off in the hot path. Cherry-picked from PR #1942.
Previously cloned into a fresh Vec on every call; with the proof cache hitting all_assertions from multiple places during inference, Arc's shared ownership avoids repeated heap churn. Cherry-picked from PR #1942.
Tiny allocation reduction: the retry loop no longer builds a fresh String on every iteration. Cherry-picked from PR #1942.
SimplePlan::resolve calls guess_scenario on every set_input; models without scenarios (the common case) still paid a full scope lock just to observe that scenarios is empty. Under 32-way parallel inference that acquire/release was the dominant bottleneck (100% of scope-lock acquisitions counted during a LinearClassifier benchmark). Wrap the scope's Arc<ReentrantMutex<...>> in a SymbolScopeInner that also carries a lock-free `has_scenarios` AtomicBool, updated by add_scenario / add_scenario_assertion. guess_scenario checks it first and returns immediately when the scope has no scenarios. Deref on the wrapper keeps the `scope.0.lock()` call pattern working unchanged. LinearClassifier @ 100K runs, 32 threads: 100ms → 26ms. Now beats the PR #1942 RwLock approach (36ms) without the deadlock, since no lock at all beats a concurrent-reader lock.
|
Thanks for this PR and the accompanying benchmark — the bench itself was invaluable and is what made the underlying contention tractable to track down. I'm going to close Root cause. On the LinearClassifier bench at high thread counts, every set_input was hitting two locked paths in SymbolScope:
Numbers (100K parallel runs, 32 threads, Ryzen 9 9950X3D): The lock-free fast path actually edges out the RwLock approach — no lock beats a concurrent-reader lock when the contended work is trivial. Why not this PR directly.
Replacement is three commits on #2151 :
Thanks again — the problem this PR exposed was real and worth fixing. |
SimplePlan::resolve calls guess_scenario on every set_input; models without scenarios (the common case) still paid a full scope lock just to observe that scenarios is empty. Under 32-way parallel inference that acquire/release was the dominant bottleneck (100% of scope-lock acquisitions counted during a LinearClassifier benchmark). Wrap the scope's Arc<ReentrantMutex<...>> in a SymbolScopeInner that also carries a lock-free `has_scenarios` AtomicBool, updated by add_scenario / add_scenario_assertion. guess_scenario checks it first and returns immediately when the scope has no scenarios. Deref on the wrapper keeps the `scope.0.lock()` call pattern working unchanged. LinearClassifier @ 100K runs, 32 threads: 100ms → 26ms. Now beats the PR #1942 RwLock approach (36ms) without the deadlock, since no lock at all beats a concurrent-reader lock.
SimplePlan::resolve calls guess_scenario on every set_input; models without scenarios (the common case) still paid a full scope lock just to observe that scenarios is empty. Under 32-way parallel inference that acquire/release was the dominant bottleneck (100% of scope-lock acquisitions counted during a LinearClassifier benchmark). Wrap the scope's Arc<ReentrantMutex<...>> in a SymbolScopeInner that also carries a lock-free `has_scenarios` AtomicBool, updated by add_scenario / add_scenario_assertion. guess_scenario checks it first and returns immediately when the scope has no scenarios. Deref on the wrapper keeps the `scope.0.lock()` call pattern working unchanged. LinearClassifier @ 100K runs, 32 threads: 100ms → 26ms. Now beats the PR #1942 RwLock approach (36ms) without the deadlock, since no lock at all beats a concurrent-reader lock.
Based on benchmarks the following performance improvements:
LinearClassifier / 100,000 - 94.7%
LinearClassifier / 1,000,000 - 13.7%
LinearRegressor / 100,000 - 48.3%
LinearRegressor / 1,000,000 - 7.07%