Iterating over Range<char>?

Today I found myself, for advent-related reasons, wanting to iterate over 'a'..='z'. Turns out this isn’t possible since char doesn’t implement Step, and looking at it, a lot of Step doesn’t make too much sense for char:

  • replace_zero, replace_one (both of which appear unused in the rustc codebase though)
  • add_one and sub_one: they don’t seem to allow for failure and just overflow? (also, how to handle the gap with surrogates?)

It also seems that Step isn’t going to last: should I wait with this until that time?

3 Likes

In some cases you solve the problem with:

(b'a' ..= b'z').map(char::from)

1 Like

That’s what I did, essentially. It feels like I should be able to do this with chars directly, though.

I made unic-char-range a while back for this (in the context of the unic group of libraries), which iirc offers both a custom Iterator as well as a from_u32 map for when either case optimize better.

I'd be happy to see Range<char> "just work", though. (2023 update, since I got a "popular link" notification for this post: Range<char> does just work now! (Hopefully I didn't necrobump the thread...) And I did the impl work for that to be the case :blush:)

(I'd also be willing to make a standalone char-range and make the unic one just re-export it, since unic is on an indefinite hiatus)

1 Like

I think iterating over a range of chars should be possible, and chars should just skip the gap if a range would iterate past it. chars have a defined order and a defined set of valid values, so iterating through a set of consecutive values should be possible (and it shouldn’t fail).

replace_zero and replace_one seem like very odd methods, it doesnt seem correct to assume anything that can be “stepped” has a conceptual one and zero.

3 Likes

Rust char permits any Unicode scalar value, “0 to D7FF16 and E00016 to 10FFFF16 inclusive.” But the vast majority of these values are unassigned, which makes ranges less useful IMO.

1 Like

The main use unic gets out of character ranges is doing a “for all codepoints” test, which is useful for text processing libraries to catch low-hanging fruit errors.

For the general crowd, though, I agree; what is actually useful (if anything) is iteration over the set of codepoints with a specific General_Category or other binary property. (That is, the of codepoints matched my a \p{something} regex.)

1 Like

regex also uses it: https://github.com/rust-lang/regex/blob/d4b9419ed41907d5e8b43166ce7aef77e6fb93d9/regex-syntax/src/hir/mod.rs#L866

I have wanted it more than once in tests.

In general though, I don’t need this particular type of iteration very often. I did similarly want it in today’s AoC puzzle, and I did actually write 'A'..='Z' first!

1 Like

@birkenfeld: Today I found myself, for advent-related reasons, wanting to iterate over 'a'..='z' . Turns out this isn’t possible since char doesn’t implement Step

@CAD97: I’d be happy to see Range<char> “just work”, though.

@withoutboats: I think iterating over a range of chars should be possible, and chars should just skip the gap if a range would iterate past it. chars have a defined order and a defined set of valid values, so iterating through a set of consecutive values should be possible (and it shouldn’t fail).

Iterating over chars, in terms of letters, are completely different from iterating over chars in terms of unicode code-points. Say, you want to iterate over 'a'..='z', how many chars will that be?

The naïve response is 26, but that is if you iterate over the English alphabet. However, if the program is written on an English computer, but run on a non-English computer, the result would be different: For example, the Icelandic alphabet doesn't even contain z anymore, so it would imply an undefined behaviour. And even if the alphabet did contain z (which it actually have had) , the Icelandic alphabet contains 32 letters: 'a'..='ö'

The letter ö, by the way, has code-point U+00F6, quite far from z. Not to mention the fact that the letters between a and d, are á (code-point U+00E1), and b, but no c. Another interesting thing, is that prior to 2006, w was not a letter in itself in the Swedish alphabet. Instead it was sorted together with v.

All those examples are trivial - it gets even worse when we leave the unicode Latin-1 block and get code-points above U+00FF ...

So, for a properly working char iterator - treating chars as letters, rather than code-points - it must take the locale into account, and also handle the issues when the user's computer doesn't even contain the letters representing the start and end of the set.

So, all in all, a char iterator is really more tied to the concept of locales, rather than unicode.

2 Likes

A char is defined as a Unicode scalar value and the set of all Unicode scalar values defines an ordering without regard to locale. That ordering is, in and of itself, useful. Locale specific orderings are also useful in their own context. Rust’s standard library does not do any locale specific tailoring that is described in Unicode, and instead leaves that to ecosystem crates. I would think the same principle should apply here.

5 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.