Tracking feature `core::str::next_code_point`

ibraheemdev · July 25, 2021, 1:46am

I assume there is no tracking issue for core::str::next_code_point because it was only meant to be used internally for str::Chars, but then why is it exposed at all? This function is very useful for byte-based parsing function where one doesn't want to decode chars (str.chars()) except for specific parts, such as a lexer that accepts non-ascii idents. You can work with bytes most of the time, but then you need a way to get the next utf8 codepoint, and obviously decoding the rest of the string defeats the point. I've had to copy it into my own projects twice now Looking around, it seems boa's lexer has done so as well. I think it's worth exposing.

CAD97 · July 25, 2021, 2:08am

Likely due to the core/std split, and alloc/std wanting to use something defined in core. That, or it used to be that way and is just exposed for legacy reasons.

If you know your input is valid UTF-8, and you always have the whole input in a contiguous slice, you can work with &str the whole time. str::next_code_point for &str is equivalent to .chars().next(); Chars, like every other std Iterator, is lazy and doesn't do any work until asked to do so (by requesting output).

quinedot · July 25, 2021, 2:41am

I believe they're looking for something like bstr::decode_utf8 so they only pay the cost of validating utf8 in idents. Another approach with otherwise-ascii data would be to scan for the next non-ident byte and validate what's between as utf8 (convert only that &[u8] into a &str).

Note that next_code_point doesn't validate utf8, it assumes it, which is an argument against stabilizing it: it will give you misleading results for invalid input.

ibraheemdev · July 25, 2021, 3:18am

That is exactly what I was looking for. I guess I misunderstood what next_code_point actually does, thanks!

ibraheemdev · July 25, 2021, 1:49pm

It's also worth noting that Go has DecodeRuneInString and DecodeRune in it's std.

ibraheemdev · August 5, 2021, 7:33pm

We have decode_utf16, so why is there no decode_utf8?

EDIT: I see it was deprecated: Tracking issue: UTF-8 decoder in libcore · Issue #33906 · rust-lang/rust · GitHub

system · November 3, 2021, 7:33pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Pre-RFC: Add len_utf8_at method to str libs	8	1053	December 20, 2020
Implement Index<usize> for String and &str libs	48	6206	March 19, 2021
`str` method for slicing code-point (i.e. `char`) ranges libs	23	3179	March 25, 2019
What if strings were Code Point aware? language design	19	2126	March 30, 2023
Is there a good reason why String has no `into_chars` libs	39	2490	December 18, 2023

Tracking feature `core::str::next_code_point`

Related topics