Skip to content

Comments

Provide access to a wrapper that validates that bytes being read are valid UTF-8#947

Draft
dralley wants to merge 8 commits intotafia:masterfrom
dralley:assume-utf8
Draft

Provide access to a wrapper that validates that bytes being read are valid UTF-8#947
dralley wants to merge 8 commits intotafia:masterfrom
dralley:assume-utf8

Conversation

@dralley
Copy link
Collaborator

@dralley dralley commented Feb 22, 2026

This is a small stepping stone towards #158

This adds 4 new constructors when [cfg(not(feature = "encoding"))]

  • Reader::from_reader_validating()
  • Reader::from_file_validating()
  • NsReader::from_reader_validating()
  • NsReader::from_file_validating()

Which wrap the reader with Utf8BytesReader. Utf8BytesReader is currently only implemented for [cfg(not(feature = "encoding"))] as transparent decoding is currently out of scope of this PR. if encoding is enabled, it's just a pass-through shim implementation - the goal would be to eventually implement that side w/ decoding functionality instead of validation.

This PR does and will not make any additional public API changes - those will only take place once the underpinnings are in place and confirmed sound.

Obviously the long-term plan is to not need these separate constructors, but given the scope of the work I thought it prudent to touch the current API surface as little as possible.

fn from(error: IoError) -> Error {
Self::Io(Arc::new(error))
match error.kind() {
IoErrorKind::InvalidData => Self::Encoding(error.downcast::<EncodingError>().expect(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently stabilized in 1.79 https://doc.rust-lang.org/std/io/struct.Error.html#method.downcast

So we would need to bump MSRV.

}
}

impl<R: Read> Read for Utf8ValidatingReader<R> {
Copy link
Collaborator Author

@dralley dralley Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to do more testing and review on this implementation myself as well, I don't trust it yet.

@codecov-commenter
Copy link

codecov-commenter commented Feb 22, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 92.99848% with 46 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.68%. Comparing base (2b21d40) to head (9fbf135).
⚠️ Report is 26 commits behind head on master.

Files with missing lines Patch % Lines
src/encoding.rs 94.72% 33 Missing ⚠️
src/reader/ns_reader.rs 0.00% 6 Missing ⚠️
src/reader/buffered_reader.rs 60.00% 4 Missing ⚠️
src/events/attributes.rs 0.00% 2 Missing ⚠️
src/errors.rs 80.00% 1 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #947      +/-   ##
==========================================
+ Coverage   55.00%   57.68%   +2.68%     
==========================================
  Files          44       44              
  Lines       16816    18207    +1391     
==========================================
+ Hits         9249    10503    +1254     
- Misses       7567     7704     +137     
Flag Coverage Δ
unittests 57.68% <92.99%> (+2.68%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

///
/// [`Utf8BytesReader`]: crate::encoding::Utf8BytesReader
#[cfg(not(feature = "encoding"))]
pub fn from_reader_validating(reader: R) -> Self {
Copy link
Collaborator Author

@dralley dralley Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New constructors vs. opt-in feature is an open question, I have no particularly strong feelings. I leaned towards this route because Reader::from_reader() is frequently used with slices, and in that case we would be throwing a buffer into the mix which would require signatures changing, etc... I wanted to avoid changing public API as much as possible for now.

Copy link
Collaborator

@Mingun Mingun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me, rather that implement own decoding stream, we may use encoding_rs_io or similar library to do the conversion.

I also just created #948 to show you where we may change the reader from the reader that guessing encoding to the reader that transparently decodes. It seems to me that it is at the level of a slightly higher level XmlReader that honest decoding should be introduced.


// Read more data from the underlying reader
let read_size = buf.len().max(64); // Read at least 64 bytes for efficiency
let mut temp = vec![0u8; read_size];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a new Vec in the loop in hot path will ruin performance. You already have self.buffer, why not read into it? Or at least read into a fixed-size buffer. Or even better, require BufRead and get buffers from it.

Copy link
Collaborator Author

@dralley dralley Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is correct - as mentioned this is really a proof of concept level implemenation, it needs some iteration before being production ready. But I can address this in the next push.

if let Some(encoding) = $reader.detect_encoding() $(.$await)? ? {
if $self.state.encoding.can_be_refined() {
$self.state.encoding = crate::reader::EncodingRef::BomDetected(encoding);
$self.state.encoding = crate::reader::EncodingRef::BomDetected(encoding.encoder());
Copy link
Collaborator Author

@dralley dralley Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this was ever strictly correct, detect_encoding() would return Some((encoder, bom_len) even when there was no BOM per-se. So for e.g. an encoding similar to or compatible with UTF-8 that is not UTF-8, I would think this is providing false certainty about BOM presence.

edit: to be clear, if it was wrong before, it's still wrong.

It does not currently do any decoding, but this provides a place where
the validating and decoding functionality can be abstracted.
The goal is to adopt this functionality into the standard constructors,
but backwards compatibility is tricky - this gives more room to
experiment first.

Reader::from_reader_validating()
Reader::from_file_validating()
NsReader::from_reader_validating()
NsReader::from_file_validating()

(when "encoding" feature is not enabled)

Also, add some tests for validation errors being bubbled up through the
reader.
Do all of the plumbing necessary to return EncodingError directly from
Utf8ValidatingReader using IoError::InvalidData + error downcasting.

The Utf8 variant of EncodingError now holds an error enum, as we cannot
create instances of Utf8Error ourselves.
/// |`3C 3F 78 6D`|UTF-8, ISO 646, ASCII, some part of ISO 8859, Shift-JIS, EUC, or any other 7-bit, 8-bit, or mixed-width encoding which ensures that the characters of ASCII have their normal positions, width, and values; the actual encoding declaration must be read to detect which of these applies, but since all of these encodings use the same bit patterns for the relevant ASCII characters, the encoding declaration itself may be read reliably
#[cfg(feature = "encoding")]
pub fn detect_encoding(bytes: &[u8]) -> Option<(&'static Encoding, usize)> {
pub fn detect_encoding(bytes: &[u8]) -> Option<DetectedEncoding> {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implements #158 (comment)

@dralley dralley force-pushed the assume-utf8 branch 2 times, most recently from c2dcb4b to a6fe9e5 Compare February 22, 2026 18:59
let event = BytesDecl::from_start(BytesStart::wrap(content, 3, self.decoder()));

// TODO: once we can assume that the parser is operating on UTF-8, then we can throw
// an error here if we see a non-UTF-8 encoding... if encoding/decoding is not enabled.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't assume this yet because it's still possible to create non-validating readers under `[cfg(not(feature = "encoding"))]

@dralley dralley force-pushed the assume-utf8 branch 3 times, most recently from f302282 to 2a1bbb0 Compare February 22, 2026 19:46
@dralley
Copy link
Collaborator Author

dralley commented Feb 22, 2026

@Mingun Ignore for the time being the exact details of the Utf8ValidatingReader implementation - I already mentioned it still needs some additional work.

Architecturally, do you have any objections to the direction of this work? The general approach, how it's layered, etc.

Or the (long-term) goal of taking all decoding, encoding detection and/or validation, and BOM detection and/or stripping (and maybe EOL normalization) OUT of the parser itself and into a pre-processor, provided doing so either improves performance or has minimal cost?

It seems to me, rather that implement own decoding stream, we may use encoding_rs_io or similar library to do the conversion.

Yes, I agree, that's the path I was going down on the initial implementation. But assuming we want to continue keeping a separate encoding feature, the non encoding implementation can still provide the same guarantees with validation, and that then simplifies everything written on top. You could get rid of the API divergence between encoding and not(encoding), safely move to a str / String based API, etc.

edit: probably this path could also be implemented with encoding_rs, it would just require carrying that dependency always. I am flexible on the details

I also just created #948 to show you where we may change the reader from the reader that guessing encoding to the reader that transparently decodes. It seems to me that it is at the level of a slightly higher level XmlReader that honest decoding should be introduced.

I still think the better approach is to just push all preprocessing down underneath the parsing code, and take advantage of the simplifications that makes possible. And the ability to be able to parse UTF-16.

Yes maybe you can skip over performing some validation that way, but there is also a cost to running the validation (and maybe allocations) many times over small buffers instead of once over a large buffer. Not to mention the additional internal complexity of Decoder infecting so many different object types (which also increases the size of all of the structs), even if it was otherwise hidden from the user.

Have it return a generic enum so that it can be used with or without the
encoding feature.
In cases where the input is sufficiently short and doesn't contain
invalid sequences, Utf8ValidatingReader was unable to detect the input
as being not-UTF-8

We now call detect_encoding() during the first read() so that it can
more effectively raise the appropriate errors. Doing this (and BOM
stripping) upstream of the parser makes it possible to eliminate this
responsibility from the parser, once it can be relied upon on all code
paths.
No functional changes to the tests, or additions/removals
@dralley
Copy link
Collaborator Author

dralley commented Feb 22, 2026

Feature-wise, this PR is now complete. It's just a matter of improving the quality of Utf8ValidatingReader and deciding whether the new constructors are how we want it to be exposed.

Global decoding, improving APIs, etc. -- all of that is for future PRs. It should be able to be bolted onto this infrastructure though.


/// A reader wrapper that ensures only valid UTF-8 bytes are read.
///
/// This reader uses [`str::from_utf8()`] and [`Utf8Error::valid_up_to()`] to validate
Copy link
Collaborator Author

@dralley dralley Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to also try simdutf8

Probably this could also be done with encoding_rs. If you are open to requiring encoding_rs always, then we could ditch this implementation completely. I just figured that we probably wouldn't want to do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants