Perf: Replace `UTF8Data`'s `Vec<Vec<u8>>` with `Vec<Utf8Char>` by jacinta-stacks · Pull Request #6948 · stacks-network/stacks-core

jacinta-stacks · 2026-03-02T22:29:36Z

UTF8Data previously stored each Unicode codepoint as a separate Vec<u8>. Since UTF-8 encodes codepoints in 1–4 bytes, each character required its own heap allocation (24-byte Vec header + allocator overhead + 1–4 bytes of payload). A 1000-character string meant 1000+ heap allocations.

By storing each codepoint in a Utf8Char newtype (a #[repr(transparent)] wrapper around [u8; 4], zero-padded), the entire string becomes a single contiguous Vec<Utf8Char>: one allocation, one memcpy to clone, and cache-friendly iteration.

Zero-padding preserves lexicographic comparison semantics because UTF-8 byte ordering equals Unicode codepoint ordering. Utf8Char is Copy, so cloning a Vec<Utf8Char> is a single memcpy.

Changes:

Added Utf8Char newtype wrapping [u8; 4] with from_char, byte_len, as_bytes, and leading_byte methods
Replaced per-codepoint heap allocation (Vec<Vec<u8>>) with a single contiguous allocation (Vec<Utf8Char>) in UTF8Data, eliminating N heap allocations per UTF-8 string
Added custom Serialize/Deserialize impls to preserve backward-compatible JSON/serde format
Updated consensus serialization write path to extract significant bytes from zero-padded arrays (read path unchanged — goes through string_utf8_from_bytes())
Also I couldn't help but sneak in what I think is a sensible cleanup...Removed SerializationError::SerializationFailure(e.to_string())) mapping for the underlying ClarityTypeError since seems unnecessary..but can revert this if desired.

Benchmark results: run cargo bench --bench utf8_data -p clarity-types to reproduce

Clone

Operation	Old (Vec<Vec<u8>>)	New (Vec<Utf8Char>)	Speedup
clone ascii 100	1,569 ns	23 ns	70x
clone multibyte 100	1,569 ns	23 ns	69x
clone ascii 1000	14,958 ns	52 ns	285x
clone multibyte 1000	15,103 ns	51 ns	294x

Construction

Operation	Old (Vec<Vec<u8>>)	New (Vec<Utf8Char>)	Speedup
construct ascii 100	1,432 ns	19 ns	76x
construct multibyte 100	1,420 ns	19 ns	75x
construct ascii 1000	13,610 ns	42 ns	328x
construct multibyte 1000	13,828 ns	40 ns	346x

Full bytes→data pipeline (UTF-8 validate + decode + collect)

Operation	Old (Vec<Vec<u8>>)	New (Vec<Utf8Char>)	Speedup
value ascii 100	1,663 ns	130 ns	13x
value multibyte 100	1,830 ns	338 ns	5.4x
value ascii 1000	15,075 ns	671 ns	22x
value multibyte 1000	17,625 ns	2,961 ns	6.0x

Memory improvement

String length	Old heap usage	New heap usage
100 chars	~100 allocations (~4.1 KB)	1 allocation (400 B)
1000 chars	~1000 allocations (~41 KB)	1 allocation (4 KB)

TLDR: UTF-8 strings now use one contiguous allocation instead of one heap allocation per character. Cloning is 69–294x faster, construction is 75–346x faster, and the full decode pipeline is 5–22x faster. Memory usage drops ~10x. Not 100% sure what the effect will be on block processing but in theory contracts with UTF-8 string heavy manipulation should see a measurable execution speedup of maybe a couple percent I think...

Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>

codecov · 2026-03-03T05:22:47Z

Codecov Report

❌ Patch coverage is 96.55172% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.46%. Comparing base (58545bc) to head (a0b7669).

Files with missing lines	Patch %	Lines
clarity-types/src/types/mod.rs	98.55%	1 Missing ⚠️
clarity-types/src/types/serialization.rs	91.66%	1 Missing ⚠️
clarity/src/vm/functions/arithmetic.rs	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #6948      +/-   ##
===========================================
- Coverage    77.73%   70.46%   -7.27%     
===========================================
  Files          412      412              
  Lines       218667   218696      +29     
  Branches       338      338              
===========================================
- Hits        169981   154115   -15866     
- Misses       48686    64581   +15895

Files with missing lines	Coverage Δ
clarity/src/vm/ast/parser/v2/mod.rs	`58.37% <100.00%> (-1.94%)`	⬇️
clarity/src/vm/functions/conversions.rs	`90.22% <100.00%> (-2.73%)`	⬇️
clarity/src/vm/types/mod.rs	`80.48% <ø> (ø)`
clarity-types/src/types/mod.rs	`85.95% <98.55%> (-3.07%)`	⬇️
clarity-types/src/types/serialization.rs	`92.06% <91.66%> (-0.96%)`	⬇️
clarity/src/vm/functions/arithmetic.rs	`87.85% <0.00%> (ø)`

... and 239 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 58545bc...a0b7669. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

brice-stacks

nice!

jcnelson

I think the approach overall LGTM. I just had one question about how we construct Utf8Char

Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>

… into chore/utf8data-fixed-array-repr

Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>

… into chore/utf8data-fixed-array-repr

Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>

brice-stacks

lgtm

benjamin-stacks

Awesome work! Looks great, just left a few small notes.

I have to ask, though: Why aren't we just using built-in rust strings or char vectors for all this? As far as I can tell, the only advantage over normal strings is random access in O(1) (which I don't think we really need?), and even that advantage disappears when using a char vector (i.e. essentially UTF-32).

benjamin-stacks · 2026-03-17T10:28:48Z

+            if bytes.len() > 1 {
                // We escape extended charset
-                result.push_str(&format!("\\u{{{}}}", hash::to_hex(&c[..])));
+                result.push_str(&format!("\\u{{{}}}", hash::to_hex(bytes)));


You didn't add this here, but I need to point out that this is wrong -- the number inside \u{...} isn't the unicode code point, but the utf-8 encoding.

let mut b = [0u8; 4]; let c = '🦄'; let u = c.encode_utf8(&mut b).as_bytes(); let esc = format!("{}", c.escape_default()); let dsp = format!("\\u{{{}}}", hash::to_hex(u)); // proper unicode escape assert_eq!(esc, "\\u{1f984}"); // the logic from `UTF8Data:fmt` assert_eq!(dsp, "\\u{f09fa684}");

Should we fix that? I think (hope) this shouldn't break consensus because I assume it's only used in developer-mode print and in other dev tools like REPLs?

My suggestion would be to add a to_char() method to Utf8Char, then all this could just be replaced with a single call to char::escape_default.

Also, my understanding is that this formatting is not consensus-breaking and is only used for debugging and error messages (on the error side, it’s possible that some consensus tests may break due to snapshot recording, but those can be safely updated if needed)

benjamin-stacks · 2026-03-17T10:38:34Z

                    // This first InvalidUTF8Encoding is logically unreachable: the escape regex rejects non-hex digits,
                    // so from_str_radix only sees valid hex and never errors here.


Not your code, but I don't think this comment is right -- the regex accepts any length, so from_str_radix can absolutely return an error from overflowing a u32.

benjamin-stacks · 2026-03-17T11:38:40Z

            } else {
-                let ascii_char = window[0..1].to_string().into_bytes();
-                data.push(ascii_char);
+                data.push(Utf8Char::from_char(


This change makes it even less obvious than it already was why the cursor += 1 isn't a giant vulnerability that allows you to crash all the nodes with a simple contract deploy transaction.

The answer (as you probably know, but I needed to do some goose chasing first) is that the lexer guarantees that even unicode string literals only contain printable ASCII characters (and anything outside that set has to be encoded).

Therefore, the first char in window (just like all chars in window) only has a single bytes as UTF-8, and thus cursor += 1 is safe.

The old code here explicitly called it ascii_char (and sliced the string right there, which would panic it weren't one), so that at least made the assumption a little clearer.

Long story short, would you mind adding a comment like this for the next Ben?

Suggested change

data.push(Utf8Char::from_char(

// unicode string literals are guaranteed by the lexer to only contain

// ASCII characters, so we know that this character takes a single byte

// and advancing the cursor by 1 is safe

data.push(Utf8Char::from_char(

benjamin-stacks · 2026-03-17T11:47:53Z

+    /// Returns the raw UTF-8 bytes with zero-padding stripped.
+    pub fn to_utf8_bytes(&self) -> Result<Vec<u8>, ClarityTypeError> {
+        self.data.iter().try_fold(Vec::new(), |mut acc, c| {
+            acc.extend_from_slice(c.as_bytes()?);


Super minor nit (feel free to leave as is): I think this can be just extend since u8s are Copyable.

Suggested change

acc.extend_from_slice(c.as_bytes()?);

acc.extend(c.as_bytes()?);

benjamin-stacks · 2026-03-17T11:56:32Z

            }
            Sequence(SequenceData::String(UTF8(value))) => {
-                let total_len: u32 = value.data.iter().fold(0u32, |len, c| len + c.len() as u32);
+                let total_len: u32 = value.data.iter().try_fold(


Nit: This could be a method on UTF8Data, reducing the noise in this serialization function. Not sure what the name for that function would be, the concept of "length" is seriously overloaded in this context 😅

This builds on top of Jacinta's excellent work in stacks-network#6948 (and sits on top of that PR's branch), but it changes the representation of characters in `UTF8Data::data` to be native `char`s instead four-byte arrays with pre-encoded UTF8. The memory footprint is exactly the same; both the now-removed `Utf8Char` and the built-in UTF-32 `char` have four bytes. The advantages are: - It requires less custom code for things that are essentially part of the Rust standard library - It requires less defensive checking -- a `Utf8Char` could in theory contain invalid data, which required extra checks, while a Rust `char` is guaranteed to be valid unicode - It's easier to read and understand - Postponing UTF-8 encoding until it's actually needed might give a slight performance improvement, although that'll likely be negligible in practice, especially in light of the major perf improvements from Jacinta's original PR (re-running the benchmarks was inconclusive)

federico-stacks · 2026-03-26T11:21:01Z

+    }
+
+    /// Number of significant UTF-8 bytes (1–4).
+    pub fn byte_len(&self) -> Result<usize, ClarityTypeError> {


With the introduction of new_unchecked() gated behind test flags (see #6948 (comment)), we could make byte_len() infallible by construction and simplify its return type to just usize.

federico-stacks · 2026-03-26T11:24:52Z

+    }
+
+    /// The significant bytes of this character.
+    pub fn as_bytes(&self) -> Result<&[u8], ClarityTypeError> {


making byte_len infallible, would allow to make as_bytes infallible as well and simplify its return type to &[u8].

dhaney-stacks · 2026-04-17T18:03:37Z

@benjamin-stacks please remove @jcnelson as reviewer.

jacinta-stacks added 3 commits March 2, 2026 14:05

Replace Utf8Data's Vec<Vec<u8>> with Vec<Utf8Char> (Vec<[u8; 4]>)

d3c2e78

Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>

Add benchmarking

3cf485a

Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>

Cleanup test name to be reflective

9abde8d

Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>

jacinta-stacks requested review from aaronb-stacks, brice-stacks and cylewitruk-stacks March 2, 2026 22:29

jacinta-stacks added 2 commits March 2, 2026 15:04

Add copyright

9bfb200

Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>

Fix cargo-fmt

a0b7669

Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>

jacinta-stacks requested a review from francesco-stacks March 3, 2026 20:06

brice-stacks previously approved these changes Mar 11, 2026

View reviewed changes

jcnelson reviewed Mar 11, 2026

View reviewed changes

Comment thread clarity-types/src/types/mod.rs Outdated

jcnelson reviewed Mar 11, 2026

View reviewed changes

Comment thread clarity-types/src/types/serialization.rs Outdated

jcnelson reviewed Mar 11, 2026

View reviewed changes

CRC: make Utf8Char inner field private and validate on deserialization

49d0851

Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>

jacinta-stacks dismissed brice-stacks’s stale review via 49d0851 March 12, 2026 00:58

Merge branch 'develop' of https://github.com/stacks-network/stacks-core…

1de6ba8

… into chore/utf8data-fixed-array-repr

jacinta-stacks requested review from brice-stacks and jcnelson March 12, 2026 00:59

Fix clippy

9b2bd61

Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>

jacinta-stacks closed this Mar 12, 2026

jacinta-stacks reopened this Mar 12, 2026

jacinta-stacks added 3 commits March 13, 2026 13:43

Merge branch 'develop' of https://github.com/stacks-network/stacks-core…

83e67bc

… into chore/utf8data-fixed-array-repr

Merge branch 'develop' of https://github.com/stacks-network/stacks-core…

f3236b6

… into chore/utf8data-fixed-array-repr

Added a changelog entry

4f63d59

Signed-off-by: Jacinta Ferrant <236437600+jacinta-stacks@users.noreply.github.com>

brice-stacks previously approved these changes Mar 16, 2026

View reviewed changes

benjamin-stacks reviewed Mar 17, 2026

View reviewed changes

benjamin-stacks mentioned this pull request Mar 23, 2026

use rust-native char instead of custom representation for string data #7021

Draft

Merge branch 'develop' into chore/utf8data-fixed-array-repr

1ce6a44

benjamin-stacks dismissed brice-stacks’s stale review via 1ce6a44 March 23, 2026 13:11

federico-stacks reviewed Mar 26, 2026

View reviewed changes

brice-stacks approved these changes Mar 26, 2026

View reviewed changes

benjamin-stacks self-assigned this Apr 13, 2026

benjamin-stacks mentioned this pull request Apr 14, 2026

perf: fold/map/filter optimization #7022

Merged

5 tasks

benjamin-stacks removed the request for review from jcnelson April 20, 2026 07:09

		// This first InvalidUTF8Encoding is logically unreachable: the escape regex rejects non-hex digits,
		// so from_str_radix only sees valid hex and never errors here.

-                data.push(Utf8Char::from_char(
+                // unicode string literals are guaranteed by the lexer to only contain
+                // ASCII characters, so we know that this character takes a single byte
+                // and advancing the cursor by 1 is safe
+                data.push(Utf8Char::from_char(

	acc.extend_from_slice(c.as_bytes()?);
	acc.extend(c.as_bytes()?);

Conversation

jacinta-stacks commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

brice-stacks left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jcnelson left a comment

Choose a reason for hiding this comment

Uh oh!

brice-stacks left a comment

Choose a reason for hiding this comment

Uh oh!

benjamin-stacks left a comment

Choose a reason for hiding this comment

Uh oh!

benjamin-stacks Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

federico-stacks Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

benjamin-stacks Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

benjamin-stacks Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

benjamin-stacks Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

benjamin-stacks Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

federico-stacks Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

federico-stacks Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

dhaney-stacks commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jacinta-stacks commented Mar 2, 2026 •

edited

Loading

codecov Bot commented Mar 3, 2026 •

edited

Loading

federico-stacks Mar 26, 2026 •

edited

Loading