Performance Optimizations for TokenTextChunker #8
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR introduces performance optimizations to
TokenTextChunkerthat significantly reduce memory allocations while maintaining correctness. The optimizations target the hot path of document chunking during indexing pipelines.Problem Description
The original
Chunkmethod had several allocation-heavy patterns:List<T>.GetRange()- Allocates a new List + backing array per chunk iterationnew int[]for token values - Allocates per chunk, immediately becomes garbage.Select().Distinct().ToArray()creates multiple intermediate allocationsSolution
Optimizations Applied
GetRange()(allocates List)CollectionsMarshal.AsSpan().Slice()(zero-alloc view)new int[n]per chunkArrayPool<int>.Shared.Rent/Return.Distinct().ToArray()HashSet<string>withClear()Proposed Changes
Reduce allocations by
flattenedlist without allocationsBenchmark Results
Scenario 1: Thread-Safe (New List Per Call, implementation in this PR)
Overall Improvement
Small Documents
Medium Documents
Large Documents
Scenario 2: Singleton Pattern (Reused List, not implemented)
Overall Improvement
Small Documents
Medium Documents
Large Documents
Checklist
Additional Notes
NOTE scenario 2, with reusable flattened list allocation improvements are huge, but the implication is that
TokenTextChunkeris not thread safe - a design choice I don't feel comfortable to make, due to limited knowledge of intended use/tradeoffs.Implementation would imply: