Pre-RFC: Add len_utf8_at method to str

hnakamur · September 19, 2020, 1:46pm

I'd like to slice str by char boundaries. I read Slice a string containing Unicode chars - Stack Overflow, but I am not satisfied with the solution.

I know str has char_indices method, but I don't want to pay extra cost for decoding characters. So how about adding len_utf8_at method to str?

/// Returns the number of bytes for the encoded UTF-8 character
/// at the index.
///
/// That number of bytes is always between 1 and 4, inclusive.
///
/// Panics if `index` is not on a UTF-8 code point boundary, or if it is
/// past the end of the last code point of the string slice.
///
/// # Examples
///
/// Basic usage:
///
/// ```
/// let s = "\u{0024}\u{00A2}\u{20AC}\u{10348}";
/// assert_eq!(s.len_utf8_at(0), 1);
/// assert_eq!(s.len_utf8_at(1), 2);
/// assert_eq!(s.len_utf8_at(3), 3);
/// assert_eq!(s.len_utf8_at(6), 4);
/// ```
#[unstable(feature = "len_utf8_at", issue="none")]
#[inline]
pub fn len_utf8_at(&self, index: usize) -> usize {
    let b = self.as_bytes()[index];
    if b & 0b1000_0000 == 0 {
        1
    } else if b & 0b1110_0000u8 == 0b1100_0000u8 {
        2
    } else if b & 0b1111_0000u8 == 0b1110_0000u8 {
        3
    } else if b & 0b1111_1000u8 == 0b1111_0000u8 {
        4
    } else {
        panic!("index not at char boundary: {}", index);
    }
}

Eh2406 · September 19, 2020, 3:03pm

Can you show a use case that is clearer with this api? Can you describe how this is better than with str[index..].chars().next().unwrap().len_utf8()?

Tom-Phinney · September 19, 2020, 3:14pm

One advantage is that it's a lot easier to remember and write, just as a type alias is much easier to write than the long sequence that is aliased. I feel that the proposed len_utf8_at makes the language more approachable to those who are relatively new to Rust and UTF-8, giving them an easy way to slice a String at character boundaries, since direct indexing of characters as in C/++ is not defined on Strings.

steffahn · September 19, 2020, 5:10pm

One thing to note is that, since str is transparent about having a byte buffer internally, this kind of method can also be library provided through an extension trait, like this:

pub trait StrLenUtf8Ext {
    fn len_utf8_at(&self, n: usize) -> usize;
}
impl StrLenUtf8Ext for str {
    fn len_utf8_at(&self, n: usize) -> usize {
        const MAX_ONE_BYTES:   u8 = 0b10000000;
        const MAX_NONLEADING:  u8 = 0b11000000;
        const MAX_TWO_BYTES:   u8 = 0b11100000;
        const MAX_THREE_BYTES: u8 = 0b11110000;

        let b = self.as_bytes()[n];
        if b < MAX_ONE_BYTES {
            1
        } else if b < MAX_NONLEADING {
            panic!("not at byte boundary") // FIXME: make me more fancy
        } else if b < MAX_TWO_BYTES {
            2
        } else if b < MAX_THREE_BYTES {
            3
        } else {
            4
        }
    }
}

@Eh2406 One advantage is that the char does not need to be decoded. I guess the compiler would be free to optimize the other call to the same thing, too, and I haven’t benchmarked anything yet or tried to understand the assembly, so there is the potential for this kind of direct implementation to be faster.

Another thing: I’m not sure if panicking on non-char-boundaries is the best API. Perhaps returning an Option<u8> is cleaner. Kind-of being a generalization of is_char_boundary then (but handling the out-of-range case differently).

Back to performance: @hnakamur Looking at the source code of char_indices makes me believe there’s a very good chance of the char decoding being skipped entirely if the code gets inlined, as the values for the indices do not depend on the decoded characters at all. I haven’t tested this hypothesis, but for the purpose of finding the 𝑛th codepoint (i.e. char) in a String or slicing by those kinds of indices, you may get optimal or near-optimal code already using char_indices.

Finally, this thread is missing the obligatory “codepoint indices are rarely useful in practice” kind of remark. The SO question kind-of had an implicit “Python can do it, so how does it work in Rust” argument for why it would be useful to do something like this, and Python seems to have taken the “it’s slightly less confusing to beginners to offer indexing by code-points, so let’s do use a complicated (but abstracted so it’s hidden), and potentially high-overhead (in case you have any 32bit chars in your string) string representation in our programming language rather than UTF8” approach with the effect that codepoint-indices are the only way to index strings in Python and they’re constant time, so you’d also use them in all the cases where byte-indices are used in Rust.

H2CO3 · September 19, 2020, 5:48pm

Is there a benchmark you can show us that demonstrates that this is indeed a problem? I'd expect the optimizer to be smart enough to completely eliminate the character decoding if you throw the result away.

Furthermore, char_indices() has the advantage that it never panics because you can't abuse it by starting at a non-UTF-8-boundary index.

hnakamur · September 20, 2020, 4:07am

Thanks for your comment! I confirmed you are correct about the char decoding skipped entirely.

So I withdraw this pre-proposal. Thank you all for your comments!

And I learned how to use cargo-asm to view the assembler output with rust code interleaved. Nice!

Here is what I confirmed. I wrote the following two functions.

pub fn calc_with_char_indices_print(s: &str) -> usize {
    s.char_indices().fold(0, |acc, (len, x)| {
        println!("x={}", x);
        acc + len
    })
}

pub fn calc_with_char_indices(s: &str) -> usize {
    s.char_indices().fold(0, |acc, (len, _)| acc + len)
}

Then I followed https://stackoverflow.com/a/54287770/1391518 to see assembler output.

cargo install cargo-asm
cargo build --release
cargo asm --no-color --rust len_utf8_at_benchmark::calc_with_char_indices_print > calc_with_char_indices_print.asm
cargo asm --no-color --rust len_utf8_at_benchmark::calc_with_char_indices > calc_with_char_indices.asm

In the standard library,

CharIndices.next() calls Chars.next() at library/core/src/str/mod.rs#L693
And Chars.next() calls next_code_point() at library/core/src/str/mod.rs#L589
next_code_point() is defined at library/core/src/str/mod.rs#L516-L548

/// Reads the next code point out of a byte iterator (assuming a
/// UTF-8-like encoding).
#[unstable(feature = "str_internals", issue = "none")]
#[inline]
pub fn next_code_point<'a, I: Iterator<Item = &'a u8>>(bytes: &mut I) -> Option<u32> {
    // Decode UTF-8
    let x = *bytes.next()?;
    if x < 128 {
        return Some(x as u32);
    }

    // Multibyte case follows
    // Decode from a byte combination out of: [[[x y] z] w]
    // NOTE: Performance is sensitive to the exact formulation here
    let init = utf8_first_byte(x, 2);
    let y = unwrap_or_0(bytes.next());
    let mut ch = utf8_acc_cont_byte(init, y);
    if x >= 0xE0 {
        // [[x y z] w] case
        // 5th bit in 0xE0 .. 0xEF is always clear, so `init` is still valid
        let z = unwrap_or_0(bytes.next());
        let y_z = utf8_acc_cont_byte((y & CONT_MASK) as u32, z);
        ch = init << 12 | y_z;
        if x >= 0xF0 {
            // [x y z w] case
            // use only the lower 3 bits of `init`
            let w = unwrap_or_0(bytes.next());
            ch = (init & 7) << 18 | utf8_acc_cont_byte(y_z, w);
        }
    }

    Some(ch)
}

utf8_acc_cont_byte() is defined at library/core/src/str/mod.rs#L495-L499

/// Returns the value of `ch` updated with continuation byte `byte`.
#[inline]
fn utf8_acc_cont_byte(ch: u32, byte: u8) -> u32 {
    (ch << 6) | (byte & CONT_MASK) as u32
}

The part after if x >= 0xF0 in calc_with_char_indices_print() has assembler for ch = (init & 7) << 18 | utf8_acc_cont_byte(y_z, w);

     if x >= 0xF0 { (/home/hnakamur/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/str/mod.rs:539)
     cmp     dl, -16
     if x >= 0xF0 { (/home/hnakamur/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/str/mod.rs:539)
     jae     .LBB1_15
.LBB1_14:
     shl     esi, 12
.LBB1_9:
     or      ecx, esi
     mov     edx, ecx
     jmp     .LBB1_20
.LBB1_16:
     xor     edx, edx
.LBB1_18:
     ch = (init & 7) << 18 | utf8_acc_cont_byte(y_z, w); (/home/hnakamur/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/str/mod.rs:543)
     and     esi, 7
     shl     esi, 18
     (ch << 6) | (byte & CONT_MASK) as u32 (/home/hnakamur/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/str/mod.rs:498)
     shl     ecx, 6
     or      ecx, esi
     ch = (init & 7) << 18 | utf8_acc_cont_byte(y_z, w); (/home/hnakamur/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/str/mod.rs:543)
     or      ecx, edx
     mov     edx, ecx
     None => None, (/home/hnakamur/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/str/mod.rs:694)
     cmp     ecx, 1114112
     jne     .LBB1_20
     jmp     .LBB1_21
.LBB1_1:
     xor     r12d, r12d
.LBB1_21:
 }

On the other hand, the part after if x >= 0xF0 in calc_with_char_indices() does NOT have assembler for ch = (init & 7) << 18 | utf8_acc_cont_byte(y_z, w);

     if x >= 0xF0 { (/home/hnakamur/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/str/mod.rs:539)
     cmp     r9b, -16
     if x >= 0xF0 { (/home/hnakamur/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/str/mod.rs:539)
     jb      .LBB0_16
.LBB0_12:
     if is_empty!(self) { (/home/hnakamur/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/slice/mod.rs:3750)
     cmp     rdi, rsi
     if is_empty!(self) { (/home/hnakamur/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/slice/mod.rs:3750)
     jne     .LBB0_14
     xor     ebx, ebx
     jmp     .LBB0_15
.LBB0_17:
 }

steffahn · September 20, 2020, 5:03am

I see. Indeed, looking at exactly that code for next_code_point that you quoted was what made me believe that the decoding part should be optimized away. It’s pretty late for me at the moment, so I’m not going to be reading through that assembly in detail right now—but it’s good to hear you could confirm something being optimized away!

I’d still be curious to see a benchmark of whatever codepoint-index slicing scenarios you had in mind, implemented using char_indices vs. implemented with a len_utf8_at. And perhaps also compared to a completely manual implementation of a slicing function directly looping over an iterator on the underlying u8 slice.

However, I don’t feel like coding this up myself as I don’t see too much value in this kind of slicing function in the first place (and also because I’m lazy). I just wanted to point out that a benchmark would probably give the most accurate evaluation of how good what kind of code performs.

hnakamur · September 21, 2020, 8:59am

Thanks for your comment.

I ran benchmarks using criterion - crates.io: Rust Package Registry for my usecase.

My usecase is LimitedWriter which writes input to the inner writer up to the specified limit of number of bytes.

I implemented two version, LimitedWriter using len_utf8_at, LimitedWriter2 using char_indices.

Here is the report built by benchmarks.

try #1 LimitedWriter Summary - Criterion.rs
try #2 LimitedWriter Summary - Criterion.rs

By looking at these results, I think two implementations are about the same speed. To tell the truth, this is my first time to run benchmarks using criterion crate and I don't know my benchmark code is appropriate. I feel distributions in Violin Plot are so different between try #1 and try #2. Maybe this is too micro benchmark and I'd better increase amount of work in benchmark target code, right?

system · December 20, 2020, 8:59am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
`str` method for slicing code-point (i.e. `char`) ranges libs	23	2953	March 25, 2019
What if strings were Code Point aware? language design	19	1964	March 30, 2023
Wild idea: deprecating APIs that conflate str and [u8] libs	59	3614	November 12, 2020
Pre-RFC - allow usage of str::utf8_char_width libs	2	656	March 8, 2022
Pre-RFC: String from ASCII (not allowing UTF-8) libs	16	2368	August 8, 2021

Pre-RFC: Add len_utf8_at method to str

Related topics