Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Some fixes for threading:
Previously, the chunk boundary was computed as:
Since
thread_idx / num_threadsperformed integer division, this always evaluated to 0 forthread_idx < num_threads.I also removed the
sqrt, since (as far as I can see), where theend_chunk_idxis used, it covers "linear" work, not triangular/quadratically growing work.Additionally, I encountered segfaults on large problem sizes (many determinants), which I eventually tracked down to a heap-use-after-free from
pyci/src/hci.cpp:277.Not entirely sure why this would happen, but maybe the order of threads joining is not guaranteed to correspond to the order of the
v_wfns? The proposed change first joins all threads, then adds the determinants: