Skip to content

Perf: Replace UTF8Data's Vec<Vec<u8>> with Vec<Utf8Char>#6948

Open
jacinta-stacks wants to merge 12 commits intostacks-network:developfrom
jacinta-stacks:chore/utf8data-fixed-array-repr
Open

Perf: Replace UTF8Data's Vec<Vec<u8>> with Vec<Utf8Char>#6948
jacinta-stacks wants to merge 12 commits intostacks-network:developfrom
jacinta-stacks:chore/utf8data-fixed-array-repr

Conversation

@jacinta-stacks
Copy link
Copy Markdown
Contributor

@jacinta-stacks jacinta-stacks commented Mar 2, 2026

UTF8Data previously stored each Unicode codepoint as a separate Vec<u8>. Since UTF-8 encodes codepoints in 1–4 bytes, each character required its own heap allocation (24-byte Vec header + allocator overhead + 1–4 bytes of payload). A 1000-character string meant 1000+ heap allocations.

By storing each codepoint in a Utf8Char newtype (a #[repr(transparent)] wrapper around [u8; 4], zero-padded), the entire string becomes a single contiguous Vec<Utf8Char>: one allocation, one memcpy to clone, and cache-friendly iteration.

Zero-padding preserves lexicographic comparison semantics because UTF-8 byte ordering equals Unicode codepoint ordering. Utf8Char is Copy, so cloning a Vec<Utf8Char> is a single memcpy.

Changes:

  • Added Utf8Char newtype wrapping [u8; 4] with from_char, byte_len, as_bytes, and leading_byte methods
  • Replaced per-codepoint heap allocation (Vec<Vec<u8>>) with a single contiguous allocation (Vec<Utf8Char>) in UTF8Data, eliminating N heap allocations per UTF-8 string
  • Added custom Serialize/Deserialize impls to preserve backward-compatible JSON/serde format
  • Updated consensus serialization write path to extract significant bytes from zero-padded arrays (read path unchanged — goes through string_utf8_from_bytes())
  • Also I couldn't help but sneak in what I think is a sensible cleanup...Removed SerializationError::SerializationFailure(e.to_string())) mapping for the underlying ClarityTypeError since seems unnecessary..but can revert this if desired.

Benchmark results: run cargo bench --bench utf8_data -p clarity-types to reproduce

Clone

Operation Old (Vec<Vec<u8>>) New (Vec<Utf8Char>) Speedup
clone ascii 100 1,569 ns 23 ns 70x
clone multibyte 100 1,569 ns 23 ns 69x
clone ascii 1000 14,958 ns 52 ns 285x
clone multibyte 1000 15,103 ns 51 ns 294x

Construction

Operation Old (Vec<Vec<u8>>) New (Vec<Utf8Char>) Speedup
construct ascii 100 1,432 ns 19 ns 76x
construct multibyte 100 1,420 ns 19 ns 75x
construct ascii 1000 13,610 ns 42 ns 328x
construct multibyte 1000 13,828 ns 40 ns 346x

Full bytes→data pipeline (UTF-8 validate + decode + collect)

Operation Old (Vec<Vec<u8>>) New (Vec<Utf8Char>) Speedup
value ascii 100 1,663 ns 130 ns 13x
value multibyte 100 1,830 ns 338 ns 5.4x
value ascii 1000 15,075 ns 671 ns 22x
value multibyte 1000 17,625 ns 2,961 ns 6.0x

Memory improvement

String length Old heap usage New heap usage
100 chars ~100 allocations (~4.1 KB) 1 allocation (400 B)
1000 chars ~1000 allocations (~41 KB) 1 allocation (4 KB)

TLDR: UTF-8 strings now use one contiguous allocation instead of one heap allocation per character. Cloning is 69–294x faster, construction is 75–346x faster, and the full decode pipeline is 5–22x faster. Memory usage drops ~10x. Not 100% sure what the effect will be on block processing but in theory contracts with UTF-8 string heavy manipulation should see a measurable execution speedup of maybe a couple percent I think...

Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>
Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>
Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>
Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>
Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 3, 2026

Codecov Report

❌ Patch coverage is 96.55172% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.46%. Comparing base (58545bc) to head (a0b7669).

Files with missing lines Patch % Lines
clarity-types/src/types/mod.rs 98.55% 1 Missing ⚠️
clarity-types/src/types/serialization.rs 91.66% 1 Missing ⚠️
clarity/src/vm/functions/arithmetic.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #6948      +/-   ##
===========================================
- Coverage    77.73%   70.46%   -7.27%     
===========================================
  Files          412      412              
  Lines       218667   218696      +29     
  Branches       338      338              
===========================================
- Hits        169981   154115   -15866     
- Misses       48686    64581   +15895     
Files with missing lines Coverage Δ
clarity/src/vm/ast/parser/v2/mod.rs 58.37% <100.00%> (-1.94%) ⬇️
clarity/src/vm/functions/conversions.rs 90.22% <100.00%> (-2.73%) ⬇️
clarity/src/vm/types/mod.rs 80.48% <ø> (ø)
clarity-types/src/types/mod.rs 85.95% <98.55%> (-3.07%) ⬇️
clarity-types/src/types/serialization.rs 92.06% <91.66%> (-0.96%) ⬇️
clarity/src/vm/functions/arithmetic.rs 87.85% <0.00%> (ø)

... and 239 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 58545bc...a0b7669. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

brice-stacks
brice-stacks previously approved these changes Mar 11, 2026
Copy link
Copy Markdown
Contributor

@brice-stacks brice-stacks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

Comment thread clarity-types/src/types/mod.rs Outdated
Comment thread clarity-types/src/types/serialization.rs Outdated
Copy link
Copy Markdown
Member

@jcnelson jcnelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the approach overall LGTM. I just had one question about how we construct Utf8Char

Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>
Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>
Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>
brice-stacks
brice-stacks previously approved these changes Mar 16, 2026
Copy link
Copy Markdown
Contributor

@brice-stacks brice-stacks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Copy Markdown
Contributor

@benjamin-stacks benjamin-stacks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work! Looks great, just left a few small notes.

I have to ask, though: Why aren't we just using built-in rust strings or char vectors for all this? As far as I can tell, the only advantage over normal strings is random access in O(1) (which I don't think we really need?), and even that advantage disappears when using a char vector (i.e. essentially UTF-32).

if bytes.len() > 1 {
// We escape extended charset
result.push_str(&format!("\\u{{{}}}", hash::to_hex(&c[..])));
result.push_str(&format!("\\u{{{}}}", hash::to_hex(bytes)));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You didn't add this here, but I need to point out that this is wrong -- the number inside \u{...} isn't the unicode code point, but the utf-8 encoding.

    let mut b = [0u8; 4];
    let c = '🦄';
    let u = c.encode_utf8(&mut b).as_bytes();

    let esc = format!("{}", c.escape_default());
    let dsp = format!("\\u{{{}}}", hash::to_hex(u));

    // proper unicode escape
    assert_eq!(esc, "\\u{1f984}");

    // the logic from `UTF8Data:fmt`
    assert_eq!(dsp, "\\u{f09fa684}");

Should we fix that? I think (hope) this shouldn't break consensus because I assume it's only used in developer-mode print and in other dev tools like REPLs?

My suggestion would be to add a to_char() method to Utf8Char, then all this could just be replaced with a single call to char::escape_default.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, my understanding is that this formatting is not consensus-breaking and is only used for debugging and error messages (on the error side, it’s possible that some consensus tests may break due to snapshot recording, but those can be safely updated if needed)

Comment on lines 1158 to 1159
// This first InvalidUTF8Encoding is logically unreachable: the escape regex rejects non-hex digits,
// so from_str_radix only sees valid hex and never errors here.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not your code, but I don't think this comment is right -- the regex accepts any length, so from_str_radix can absolutely return an error from overflowing a u32.

} else {
let ascii_char = window[0..1].to_string().into_bytes();
data.push(ascii_char);
data.push(Utf8Char::from_char(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes it even less obvious than it already was why the cursor += 1 isn't a giant vulnerability that allows you to crash all the nodes with a simple contract deploy transaction.

The answer (as you probably know, but I needed to do some goose chasing first) is that the lexer guarantees that even unicode string literals only contain printable ASCII characters (and anything outside that set has to be encoded).

Therefore, the first char in window (just like all chars in window) only has a single bytes as UTF-8, and thus cursor += 1 is safe.

The old code here explicitly called it ascii_char (and sliced the string right there, which would panic it weren't one), so that at least made the assumption a little clearer.

Long story short, would you mind adding a comment like this for the next Ben?

Suggested change
data.push(Utf8Char::from_char(
// unicode string literals are guaranteed by the lexer to only contain
// ASCII characters, so we know that this character takes a single byte
// and advancing the cursor by 1 is safe
data.push(Utf8Char::from_char(

/// Returns the raw UTF-8 bytes with zero-padding stripped.
pub fn to_utf8_bytes(&self) -> Result<Vec<u8>, ClarityTypeError> {
self.data.iter().try_fold(Vec::new(), |mut acc, c| {
acc.extend_from_slice(c.as_bytes()?);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super minor nit (feel free to leave as is): I think this can be just extend since u8s are Copyable.

Suggested change
acc.extend_from_slice(c.as_bytes()?);
acc.extend(c.as_bytes()?);

}
Sequence(SequenceData::String(UTF8(value))) => {
let total_len: u32 = value.data.iter().fold(0u32, |len, c| len + c.len() as u32);
let total_len: u32 = value.data.iter().try_fold(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This could be a method on UTF8Data, reducing the noise in this serialization function. Not sure what the name for that function would be, the concept of "length" is seriously overloaded in this context 😅

benjamin-stacks added a commit to benjamin-stacks/stacks-core that referenced this pull request Mar 23, 2026
This builds on top of Jacinta's excellent work in stacks-network#6948 (and sits on top
of that PR's branch), but it changes the representation of characters in
`UTF8Data::data` to be native `char`s instead four-byte arrays with
pre-encoded UTF8.

The memory footprint is exactly the same; both the now-removed
`Utf8Char` and the built-in UTF-32 `char` have four bytes.

The advantages are:
- It requires less custom code for things that are essentially part of
  the Rust standard library
- It requires less defensive checking -- a `Utf8Char` could in theory
  contain invalid data, which required extra checks, while a Rust `char`
  is guaranteed to be valid unicode
- It's easier to read and understand
- Postponing UTF-8 encoding until it's actually needed might give a
  slight performance improvement, although that'll likely be negligible
  in practice, especially in light of the major perf improvements from
  Jacinta's original PR (re-running the benchmarks was inconclusive)
}

/// Number of significant UTF-8 bytes (1–4).
pub fn byte_len(&self) -> Result<usize, ClarityTypeError> {
Copy link
Copy Markdown
Contributor

@federico-stacks federico-stacks Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the introduction of new_unchecked() gated behind test flags (see #6948 (comment)), we could make byte_len() infallible by construction and simplify its return type to just usize.

}

/// The significant bytes of this character.
pub fn as_bytes(&self) -> Result<&[u8], ClarityTypeError> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

making byte_len infallible, would allow to make as_bytes infallible as well and simplify its return type to &[u8].

@dhaney-stacks
Copy link
Copy Markdown
Contributor

@benjamin-stacks please remove @jcnelson as reviewer.

@benjamin-stacks benjamin-stacks removed the request for review from jcnelson April 20, 2026 07:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants