Perf: Replace UTF8Data's Vec<Vec<u8>> with Vec<Utf8Char>#6948
Perf: Replace UTF8Data's Vec<Vec<u8>> with Vec<Utf8Char>#6948jacinta-stacks wants to merge 12 commits intostacks-network:developfrom
UTF8Data's Vec<Vec<u8>> with Vec<Utf8Char>#6948Conversation
Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>
Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>
Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>
Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>
Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #6948 +/- ##
===========================================
- Coverage 77.73% 70.46% -7.27%
===========================================
Files 412 412
Lines 218667 218696 +29
Branches 338 338
===========================================
- Hits 169981 154115 -15866
- Misses 48686 64581 +15895
... and 239 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
jcnelson
left a comment
There was a problem hiding this comment.
I think the approach overall LGTM. I just had one question about how we construct Utf8Char
Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>
… into chore/utf8data-fixed-array-repr
Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>
… into chore/utf8data-fixed-array-repr
… into chore/utf8data-fixed-array-repr
Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>
benjamin-stacks
left a comment
There was a problem hiding this comment.
Awesome work! Looks great, just left a few small notes.
I have to ask, though: Why aren't we just using built-in rust strings or char vectors for all this? As far as I can tell, the only advantage over normal strings is random access in O(1) (which I don't think we really need?), and even that advantage disappears when using a char vector (i.e. essentially UTF-32).
| if bytes.len() > 1 { | ||
| // We escape extended charset | ||
| result.push_str(&format!("\\u{{{}}}", hash::to_hex(&c[..]))); | ||
| result.push_str(&format!("\\u{{{}}}", hash::to_hex(bytes))); |
There was a problem hiding this comment.
You didn't add this here, but I need to point out that this is wrong -- the number inside \u{...} isn't the unicode code point, but the utf-8 encoding.
let mut b = [0u8; 4];
let c = '🦄';
let u = c.encode_utf8(&mut b).as_bytes();
let esc = format!("{}", c.escape_default());
let dsp = format!("\\u{{{}}}", hash::to_hex(u));
// proper unicode escape
assert_eq!(esc, "\\u{1f984}");
// the logic from `UTF8Data:fmt`
assert_eq!(dsp, "\\u{f09fa684}");Should we fix that? I think (hope) this shouldn't break consensus because I assume it's only used in developer-mode print and in other dev tools like REPLs?
My suggestion would be to add a to_char() method to Utf8Char, then all this could just be replaced with a single call to char::escape_default.
There was a problem hiding this comment.
Also, my understanding is that this formatting is not consensus-breaking and is only used for debugging and error messages (on the error side, it’s possible that some consensus tests may break due to snapshot recording, but those can be safely updated if needed)
| // This first InvalidUTF8Encoding is logically unreachable: the escape regex rejects non-hex digits, | ||
| // so from_str_radix only sees valid hex and never errors here. |
There was a problem hiding this comment.
Not your code, but I don't think this comment is right -- the regex accepts any length, so from_str_radix can absolutely return an error from overflowing a u32.
| } else { | ||
| let ascii_char = window[0..1].to_string().into_bytes(); | ||
| data.push(ascii_char); | ||
| data.push(Utf8Char::from_char( |
There was a problem hiding this comment.
This change makes it even less obvious than it already was why the cursor += 1 isn't a giant vulnerability that allows you to crash all the nodes with a simple contract deploy transaction.
The answer (as you probably know, but I needed to do some goose chasing first) is that the lexer guarantees that even unicode string literals only contain printable ASCII characters (and anything outside that set has to be encoded).
Therefore, the first char in window (just like all chars in window) only has a single bytes as UTF-8, and thus cursor += 1 is safe.
The old code here explicitly called it ascii_char (and sliced the string right there, which would panic it weren't one), so that at least made the assumption a little clearer.
Long story short, would you mind adding a comment like this for the next Ben?
| data.push(Utf8Char::from_char( | |
| // unicode string literals are guaranteed by the lexer to only contain | |
| // ASCII characters, so we know that this character takes a single byte | |
| // and advancing the cursor by 1 is safe | |
| data.push(Utf8Char::from_char( |
| /// Returns the raw UTF-8 bytes with zero-padding stripped. | ||
| pub fn to_utf8_bytes(&self) -> Result<Vec<u8>, ClarityTypeError> { | ||
| self.data.iter().try_fold(Vec::new(), |mut acc, c| { | ||
| acc.extend_from_slice(c.as_bytes()?); |
There was a problem hiding this comment.
Super minor nit (feel free to leave as is): I think this can be just extend since u8s are Copyable.
| acc.extend_from_slice(c.as_bytes()?); | |
| acc.extend(c.as_bytes()?); |
| } | ||
| Sequence(SequenceData::String(UTF8(value))) => { | ||
| let total_len: u32 = value.data.iter().fold(0u32, |len, c| len + c.len() as u32); | ||
| let total_len: u32 = value.data.iter().try_fold( |
There was a problem hiding this comment.
Nit: This could be a method on UTF8Data, reducing the noise in this serialization function. Not sure what the name for that function would be, the concept of "length" is seriously overloaded in this context 😅
This builds on top of Jacinta's excellent work in stacks-network#6948 (and sits on top of that PR's branch), but it changes the representation of characters in `UTF8Data::data` to be native `char`s instead four-byte arrays with pre-encoded UTF8. The memory footprint is exactly the same; both the now-removed `Utf8Char` and the built-in UTF-32 `char` have four bytes. The advantages are: - It requires less custom code for things that are essentially part of the Rust standard library - It requires less defensive checking -- a `Utf8Char` could in theory contain invalid data, which required extra checks, while a Rust `char` is guaranteed to be valid unicode - It's easier to read and understand - Postponing UTF-8 encoding until it's actually needed might give a slight performance improvement, although that'll likely be negligible in practice, especially in light of the major perf improvements from Jacinta's original PR (re-running the benchmarks was inconclusive)
| } | ||
|
|
||
| /// Number of significant UTF-8 bytes (1–4). | ||
| pub fn byte_len(&self) -> Result<usize, ClarityTypeError> { |
There was a problem hiding this comment.
With the introduction of new_unchecked() gated behind test flags (see #6948 (comment)), we could make byte_len() infallible by construction and simplify its return type to just usize.
| } | ||
|
|
||
| /// The significant bytes of this character. | ||
| pub fn as_bytes(&self) -> Result<&[u8], ClarityTypeError> { |
There was a problem hiding this comment.
making byte_len infallible, would allow to make as_bytes infallible as well and simplify its return type to &[u8].
|
@benjamin-stacks please remove @jcnelson as reviewer. |
UTF8Datapreviously stored each Unicode codepoint as a separateVec<u8>. Since UTF-8 encodes codepoints in 1–4 bytes, each character required its own heap allocation (24-byte Vec header + allocator overhead + 1–4 bytes of payload). A 1000-character string meant 1000+ heap allocations.By storing each codepoint in a
Utf8Charnewtype (a#[repr(transparent)]wrapper around[u8; 4], zero-padded), the entire string becomes a single contiguousVec<Utf8Char>: one allocation, one memcpy to clone, and cache-friendly iteration.Zero-padding preserves lexicographic comparison semantics because UTF-8 byte ordering equals Unicode codepoint ordering.
Utf8CharisCopy, so cloning aVec<Utf8Char>is a singlememcpy.Changes:
Utf8Charnewtype wrapping[u8; 4]withfrom_char,byte_len,as_bytes, andleading_bytemethodsVec<Vec<u8>>) with a single contiguous allocation (Vec<Utf8Char>) inUTF8Data, eliminating N heap allocations per UTF-8 stringstring_utf8_from_bytes())SerializationError::SerializationFailure(e.to_string()))mapping for the underlyingClarityTypeErrorsince seems unnecessary..but can revert this if desired.Benchmark results: run
cargo bench --bench utf8_data -p clarity-typesto reproduceClone
Construction
Full bytes→data pipeline (UTF-8 validate + decode + collect)
Memory improvement
TLDR: UTF-8 strings now use one contiguous allocation instead of one heap allocation per character. Cloning is 69–294x faster, construction is 75–346x faster, and the full decode pipeline is 5–22x faster. Memory usage drops ~10x. Not 100% sure what the effect will be on block processing but in theory contracts with UTF-8 string heavy manipulation should see a measurable execution speedup of maybe a couple percent I think...