Proposal
Problem statement
There's an inconsistency in the standard library's methods for converting a byte slice into a string slice:
| Type |
Unchecked Conversion |
Checked Conversion |
&str |
str::from_utf8_unchecked |
str::from_utf8 |
&mut str |
str::from_utf8_unchecked_mut |
str::from_utf8_mut |
Box<str> |
str::from_boxed_utf8_unchecked |
none |
While workarounds are possible in both safe and unsafe code, the inconsistency can be a source of confusion, and the workarounds are clunkier than they need to be.
The pull request that added str::from_boxed_utf8_unchecked originally was going to add str::from_boxed_utf8 as well, but did not, as it would drop the allocation. With that concern in mind, instead of returning Utf8Error, str::from_boxed_utf8 could return a FromUtf8Error, as that preserves the allocation.
Motivating examples or use cases
I recently was trying to implement a function to load a struct containing a Box<str> from a file, and was able to read it as a Box<[u8]>. In order to convert the byte slice into a string slice without rolling my own UTF-8 validation, I was trying to find a checked equivalent to str::from_boxed_utf8_unchecked, and was surprised that I didn't find one.
My workaround was a bit clunky:
let label_text = if str::from_utf8(&raw_label_text).is_ok() {
// SAFETY: Already validated
unsafe { std::str::from_boxed_utf8_unchecked(raw_label_text) }
} else {
return Err(Error::NonUtf8Label(raw_label_text));
};
Solution sketch
Add fn from_boxed_utf8() -> Result<Box<str>, {ERROR}> to alloc::str, where {ERROR} is either core::str::Utf8Error, or alloc::string::FromUtf8Error. My test implementation used the former type, but that was before I saw the concern about the dropped allocation.
If core::str::validations::run_utf8_validation were re-exported by core::str as part of str_internals, it could be a two-line function returning a Utf8Error, and only slightly more complex if returning a FromUtf8Error.
The Utf8Error version:
pub fn from_boxed_utf8(v: Box<[u8]>) -> Result<Box<str>, Utf8Error> {
core::str::run_utf8_validation(&v)?;
// SAFETY: validation succeeded.
Ok(unsafe { from_boxed_utf8_unchecked(v) })
}
A FromUtf8Error version that avoids discarding the allocation could look like this, with the current FromUtf8Error definition:
pub fn from_boxed_utf8(v: Box<[u8]>) -> Result<Box<str>, FromUtf8Error> {
if let Err(error) = core::str::run_utf8_validation(&v) {
return Err(FromUtf8Error { bytes: Vec::from(v), error });
}
// SAFETY: validation succeeded.
Ok(unsafe { from_boxed_utf8_unchecked(v) })
}
Given that #151822 is working on making FromUtf8Error support other bytes types (see quote in Alternatives), it might make sense to wait for that work to complete, in order to avoid the need to convert the Box<[u8]> into a Vec<u8> for the error if this approach is taken.
Alternatives
The first alternative I can think of is simply leaving things as they are. There are at least 2 relatively simple ways to convert a Box<[u8]> into a Box<str> with UTF-8 validation, without API changes, though one requires unsafe:
// returns `FromUtf8Error` for invalid bytes
String::from_utf8(Vec::from(boxed)).map(String::into_boxed_str)
// returns `Utf8Error` for invalid bytes
str::from_utf8(&boxed)?;
Ok(unsafe { str::from_boxed_utf8_unchecked(boxed) })
The second alternative was mentioned by @programmerjake on Zulip:
actually, #151822 is busy making FromUtf8Error generic over the bytes type, we could easily add:
Box: TryFrom<Box[u8], Error = FromUtf8Error<Box<[u8]>>>
Edit: To be clear, I think that the inconsistency is worth addressing for the sake of a more consistent API. Implementing TryFrom would certainly make for a more ergonomic safe conversion, and the current approaches I mentioned make the conversion possible already, but it's still an unexpected (minor) source of potential confusion.
Links and related work
GitHub issue comment explaining why str::from_boxed_utf8 wasn't added originally.
What happens now?
This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.
Possible responses
The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):
- We think this problem seems worth solving, and the standard library might be the right place to solve it.
- We think that this probably doesn't belong in the standard library.
Second, if there's a concrete solution:
- We think this specific solution looks roughly right, approved, you or someone else should implement this. (Further review will still happen on the subsequent implementation PR.)
- We're not sure this is the right solution, and the alternatives or other materials don't give us enough information to be sure about that. Here are some questions we have that aren't answered, or rough ideas about alternatives we'd want to see discussed.
Proposal
Problem statement
There's an inconsistency in the standard library's methods for converting a byte slice into a string slice:
&strstr::from_utf8_uncheckedstr::from_utf8&mut strstr::from_utf8_unchecked_mutstr::from_utf8_mutBox<str>str::from_boxed_utf8_uncheckedWhile workarounds are possible in both safe and unsafe code, the inconsistency can be a source of confusion, and the workarounds are clunkier than they need to be.
The pull request that added
str::from_boxed_utf8_uncheckedoriginally was going to addstr::from_boxed_utf8as well, but did not, as it would drop the allocation. With that concern in mind, instead of returningUtf8Error,str::from_boxed_utf8could return aFromUtf8Error, as that preserves the allocation.Motivating examples or use cases
I recently was trying to implement a function to load a struct containing a
Box<str>from a file, and was able to read it as aBox<[u8]>. In order to convert the byte slice into a string slice without rolling my own UTF-8 validation, I was trying to find a checked equivalent tostr::from_boxed_utf8_unchecked, and was surprised that I didn't find one.My workaround was a bit clunky:
Solution sketch
Add
fn from_boxed_utf8() -> Result<Box<str>, {ERROR}>toalloc::str, where{ERROR}is eithercore::str::Utf8Error, oralloc::string::FromUtf8Error. My test implementation used the former type, but that was before I saw the concern about the dropped allocation.If
core::str::validations::run_utf8_validationwere re-exported bycore::stras part ofstr_internals, it could be a two-line function returning aUtf8Error, and only slightly more complex if returning aFromUtf8Error.The
Utf8Errorversion:A
FromUtf8Errorversion that avoids discarding the allocation could look like this, with the currentFromUtf8Errordefinition:Given that #151822 is working on making
FromUtf8Errorsupport otherbytestypes (see quote in Alternatives), it might make sense to wait for that work to complete, in order to avoid the need to convert theBox<[u8]>into aVec<u8>for the error if this approach is taken.Alternatives
The first alternative I can think of is simply leaving things as they are. There are at least 2 relatively simple ways to convert a
Box<[u8]>into aBox<str>with UTF-8 validation, without API changes, though one requiresunsafe:The second alternative was mentioned by @programmerjake on Zulip:
Edit: To be clear, I think that the inconsistency is worth addressing for the sake of a more consistent API. Implementing
TryFromwould certainly make for a more ergonomic safe conversion, and the current approaches I mentioned make the conversion possible already, but it's still an unexpected (minor) source of potential confusion.Links and related work
GitHub issue comment explaining why
str::from_boxed_utf8wasn't added originally.What happens now?
This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.
Possible responses
The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):
Second, if there's a concrete solution: