Tracking feature `core::str::next_code_point`

I assume there is no tracking issue for core::str::next_code_point because it was only meant to be used internally for str::Chars, but then why is it exposed at all? This function is very useful for byte-based parsing function where one doesn't want to decode chars (str.chars()) except for specific parts, such as a lexer that accepts non-ascii idents. You can work with bytes most of the time, but then you need a way to get the next utf8 codepoint, and obviously decoding the rest of the string defeats the point. I've had to copy it into my own projects twice now :slight_smile: Looking around, it seems boa's lexer has done so as well. I think it's worth exposing.

Likely due to the core/std split, and alloc/std wanting to use something defined in core. That, or it used to be that way and is just exposed for legacy reasons.

If you know your input is valid UTF-8, and you always have the whole input in a contiguous slice, you can work with &str the whole time. str::next_code_point for &str is equivalent to .chars().next(); Chars, like every other std Iterator, is lazy and doesn't do any work until asked to do so (by requesting output).

I believe they're looking for something like bstr::decode_utf8 so they only pay the cost of validating utf8 in idents. Another approach with otherwise-ascii data would be to scan for the next non-ident byte and validate what's between as utf8 (convert only that &[u8] into a &str).

Note that next_code_point doesn't validate utf8, it assumes it, which is an argument against stabilizing it: it will give you misleading results for invalid input.

1 Like

That is exactly what I was looking for. I guess I misunderstood what next_code_point actually does, thanks!

1 Like

It's also worth noting that Go has DecodeRuneInString and DecodeRune in it's std.

We have decode_utf16, so why is there no decode_utf8?

EDIT: I see it was deprecated: Tracking issue: UTF-8 decoder in libcore · Issue #33906 · rust-lang/rust · GitHub