Performance Optimizations for TokenTextChunker #8

gmanvel · 2025-12-08T14:58:50Z

Description

This PR introduces performance optimizations to TokenTextChunker that significantly reduce memory allocations while maintaining correctness. The optimizations target the hot path of document chunking during indexing pipelines.

Problem Description

The original Chunk method had several allocation-heavy patterns:

List<T>.GetRange() - Allocates a new List + backing array per chunk iteration
new int[] for token values - Allocates per chunk, immediately becomes garbage
LINQ chain for document IDs - .Select().Distinct().ToArray() creates multiple intermediate allocations
No capacity pre-allocation - Lists grow via repeated reallocation

Solution

Optimizations Applied

Optimization	Before	After
Chunk token access	`GetRange()` (allocates List)	`CollectionsMarshal.AsSpan().Slice()` (zero-alloc view)
Token value buffer	`new int[n]` per chunk	`ArrayPool<int>.Shared.Rent/Return`
Document ID collection	LINQ `.Distinct().ToArray()`	Reusable `HashSet<string>` with `Clear()`
Document ID iteration	Check every token	Check only on slice boundary transitions
Results list	Default capacity	Pre-calculated capacity estimate
Flattened list	New per call	Option to reuse across calls (singleton scenario, see Additional Notes)

Proposed Changes

Reduce allocations by

Use Span based API to process chunks of flattened list without allocations
Use pooled array to set token values instead of allocating
Remove LINQ from for finding documentIds (improve allocations)

Benchmark Results

Scenario 1: Thread-Safe (New List Per Call, implementation in this PR)

Overall Improvement

Document Size	Avg Perf Improvement	Avg Allocation Reduction
Small	~7.5% faster	~24.5% less alloc
Medium	~7.0% faster	~20.7% less alloc
Large	~9.0% faster	~23.5% less alloc

Small Documents

Chunk	Baseline (us)	Optimized (us)	Perf Gain	Baseline KB	Optimized KB	Alloc Reduction
512/0	14.83	13.75	7.3%	11.84	8.95	24.4%
512/64	14.93	13.69	8.3%	11.84	8.95	24.4%
512/128	14.95	13.84	7.4%	11.84	8.95	24.4%
1024/0	15.02	13.78	8.2%	11.84	8.95	24.4%
1024/64	15.06	13.88	7.8%	11.84	8.95	24.4%
1024/128	14.92	13.71	8.1%	11.84	8.95	24.4%
2048/0	14.88	13.71	7.9%	11.84	8.95	24.4%
2048/64	14.92	13.61	8.8%	11.84	8.95	24.4%
2048/128	15.06	13.74	8.8%	11.84	8.95	24.4%

Medium Documents

Chunk	Baseline (us)	Optimized (us)	Perf Gain	Baseline KB	Optimized KB	Alloc Reduction
512/0	1423.50	1345.73	5.4%	1229.23	972.81	20.8%
512/64	1482.25	1366.19	7.8%	1294.57	1001.84	22.6%
512/128	1539.59	1405.99	8.7%	1379.28	1039.42	24.7%
1024/0	1416.45	1331.83	6.0%	1218.56	968.75	20.5%
1024/64	1482.40	1339.31	9.7%	1247.44	981.60	21.3%
1024/128	1452.28	1348.64	7.1%	1279.12	995.73	22.1%
2048/0	1419.49	1326.13	6.6%	1213.16	966.71	20.3%
2048/64	1426.52	1331.34	6.7%	1226.86	972.87	20.7%
2048/128	1438.68	1350.76	6.1%	1240.25	978.79	21.1%

Large Documents

Chunk	Baseline (us)	Optimized (us)	Perf Gain	Baseline KB	Optimized KB	Alloc Reduction
512/0	14431.18	13195.54	8.6%	10740.19	8181.02	23.8%
512/64	14857.46	13865.31	6.7%	11394.85	8471.49	25.7%
512/128	15668.28	14183.00	9.5%	12275.82	8858.92	27.9%
1024/0	14254.90	13182.59	7.5%	10633.63	8139.95	23.5%
1024/64	14326.83	13505.65	5.7%	10931.39	8272.53	24.3%
1024/128	14614.92	13603.63	6.9%	11273.66	8424.56	25.3%
2048/0	14105.07	13348.39	5.4%	10580.09	8119.37	23.3%
2048/64	14231.75	13290.38	6.6%	10722.42	8182.49	23.7%
2048/128	14467.38	13382.53	7.5%	10873.30	8249.97	24.2%

Scenario 2: Singleton Pattern (Reused List, not implemented)

Overall Improvement

Document Size	Avg Perf Improvement	Avg Allocation Reduction
Small	~4–6% faster	~59–60% less alloc
Medium	~6–8% faster	~62–65% less alloc
Large	~5–8% faster	~61–63% less alloc

Small Documents

Chunk	Baseline (us)	Optimized (us)	Perf Gain	Baseline KB	Optimized KB	Alloc Reduction
512/0	14.74	14.33	2.8%	11.84	4.79	59.5%
512/64	15.01	14.34	4.5%	11.84	4.79	59.5%
512/128	15.01	14.07	6.3%	11.84	4.79	59.5%
1024/0	15.25	14.28	6.3%	11.84	4.79	59.5%
1024/64	14.97	14.30	4.5%	11.84	4.79	59.5%
1024/128	14.96	14.34	4.1%	11.84	4.79	59.5%
2048/0	14.89	14.28	4.1%	11.84	4.79	59.5%
2048/64	14.94	14.20	5.0%	11.84	4.79	59.5%
2048/128	15.09	14.33	5.0%	11.84	4.79	59.5%

Medium Documents

Chunk	Baseline (us)	Optimized (us)	Perf Gain	Baseline KB	Optimized KB	Alloc Reduction
512/0	1433.14	1341.80	6.4%	1229.23	460.47	62.5%
512/64	1470.74	1375.65	6.5%	1294.57	489.50	62.2%
512/128	1529.06	1406.22	8.0%	1379.28	527.08	61.8%
1024/0	1411.96	1341.78	5.0%	1218.55	456.41	62.6%
1024/64	1441.19	1363.08	5.4%	1247.44	469.25	62.4%
1024/128	1445.92	1375.79	4.8%	1279.12	483.38	62.2%
2048/0	1410.96	1342.18	4.9%	1213.17	454.36	62.5%
2048/64	1420.30	1357.43	4.4%	1226.86	460.53	62.5%
2048/128	1438.00	1364.08	5.1%	1240.25	466.45	62.4%

Large Documents

Chunk	Baseline (us)	Optimized (us)	Perf Gain	Baseline KB	Optimized KB	Alloc Reduction
512/0	14333.60	13579.96	5.2%	10740.19	4084.48	62.0%
512/64	14889.53	13956.63	6.3%	11394.80	4374.99	61.6%
512/128	15610.82	14532.23	6.9%	12275.76	4762.45	61.2%
1024/0	14293.71	13700.90	4.2%	10633.63	4043.39	62.0%
1024/64	14435.81	13786.02	4.5%	10931.39	4175.95	61.8%
1024/128	14579.98	13913.99	4.6%	11273.66	4328.09	61.6%
2048/0	13971.48	13534.53	3.1%	10580.09	4022.77	62.0%
2048/64	14157.79	13762.70	2.8%	10722.42	4085.99	61.9%
2048/128	14464.31	13927.94	3.7%	10873.30	4153.47	61.8%

Checklist

I have tested these changes locally.
I have reviewed the code changes.
I have updated the documentation (if necessary).
I have added appropriate unit tests (if applicable).

Additional Notes

NOTE scenario 2, with reusable flattened list allocation improvements are huge, but the implication is that TokenTextChunker is not thread safe - a design choice I don't feel comfortable to make, due to limited knowledge of intended use/tradeoffs.

Implementation would imply:

private List<(int SliceIndex, int Token)> _flattened;

public IReadOnlyList<TextChunk> Chunk(IReadOnlyList<ChunkSlice> slices, ChunkingConfig config)
{
    ...
    _flattened ??= new List<(int SliceIndex, int Token)>(capacity:4096);
   _flattened.Clear();
}

codecov · 2025-12-08T15:04:44Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.46%. Comparing base (feb6294) to head (22e277d).
⚠️ Report is 7 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main       #8      +/-   ##
==========================================
+ Coverage   75.24%   75.46%   +0.22%     
==========================================
  Files         115      115              
  Lines        4751     4757       +6     
  Branches      798      798              
==========================================
+ Hits         3575     3590      +15     
+ Misses        861      854       -7     
+ Partials      315      313       -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

KSemenenko · 2025-12-09T13:46:04Z

As ususal amazing PR
thanks a lot!

Performance Optimizations for TokenTextChunker

22e277d

KSemenenko merged commit 45b370f into managedcode:main Dec 9, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance Optimizations for TokenTextChunker #8

Performance Optimizations for TokenTextChunker #8

Uh oh!

gmanvel commented Dec 8, 2025

Uh oh!

codecov bot commented Dec 8, 2025 •

edited

Loading

Uh oh!

KSemenenko commented Dec 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Performance Optimizations for TokenTextChunker #8

Performance Optimizations for TokenTextChunker #8

Uh oh!

Conversation

gmanvel commented Dec 8, 2025

Description

Problem Description

Solution

Optimizations Applied

Proposed Changes

Benchmark Results

Scenario 1: Thread-Safe (New List Per Call, implementation in this PR)

Overall Improvement

Small Documents

Medium Documents

Large Documents

Scenario 2: Singleton Pattern (Reused List, not implemented)

Overall Improvement

Small Documents

Medium Documents

Large Documents

Checklist

Additional Notes

Uh oh!

codecov bot commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

KSemenenko commented Dec 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Dec 8, 2025 •

edited

Loading