Single-pass writeString fast path for short strings in ByteBuffersDataOutput#16280
Open
neoremind wants to merge 2 commits into
Open
Single-pass writeString fast path for short strings in ByteBuffersDataOutput#16280neoremind wants to merge 2 commits into
neoremind wants to merge 2 commits into
Conversation
…teBuffersDataOutput
Contributor
Author
|
@dweiss would you mind taking a look? It's a small change that improves performance. Many thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background
In #13863,
ByteBuffersDataOutput.writeString()was optimized to avoid allocatingBytesRefand copying bytes to the dest buffer, instead it encoded directly in place. Indeed, it requires two passes over the input string chars: firstcalcUTF16toUTF8Lengthto get the VInt length prefix, thenUTF16toUTF8for the utf8 encoding. The opportunity is: for short strings, we can save that first pass.What this PR does
This PR adds a single-pass fast path for short strings (charCount <= 42) where the max UTF-8 byte length is
42 * 3 = 126, it always fits as 1-byte VInt. So we know the VInt prefix size without needing to go over the string chars upfront. Reserve 1 byte, encode directly into the dest buffer, then backfill the length. For strings that don't hit the shortcut, fall to existing logic.To my understanding, this could benefit stored fields writes of short strings like business related keywords, IDs, titles, etc. Plus short strings like field infos, codec metadata, segment names, etc.
Benchmarks
I added a JMH benchmark comparing the new impl against the current across ASCII, CJK, and Latin-extended strings at various lengths, see here for keeping the current impl to do apple-to-apple compare. Target written byte size matches stored fields chunk sizes: 80KB (BEST_SPEED default), 480KB (BEST_COMPRESSION default), and 2MB (imagine customized larger chunk in store fields .fdt). The benchmark uses a resettable
ByteBuffersDataOutputstarting with 1KB blocks to mimic real-world workload.Results show notable gains on short strings with no regressions on medium/long/very large strings (only acceptable jitter as I saw) which fall to the unchanged logic.
Throughput in ops/s. Each run writes target written byte size into the buffer. Measured on EC2 m5.2xlarge.
See detailed results
80KB target (BEST_SPEED chunk size)
480KB target (BEST_COMPRESSION chunk size)
2MB target (larger workload)
More thoughts
I initially attempted a more aggressive approach: adding a second fast path also for 2-byte VInt (charCount 128–5461) and a
calcVIntSizeForUTF8Lengthutility method with early-exit scanning for ambiguous ranges. This showed strong wins for almost all setups but for configurations with larger block sizes or larger target written size (enlarged docs per chunk or chunk size). But for the default settings (80KB chunk / 1024 docs), there is one ~5% regression onascii_medium, plus it introduced extra branches, more complex logic. So I kept it simple: only the 1-byte VInt fast path. The code is straightforward, easy to read, and no regressions for all cases.