Mini RFC: Make Range<char> work

After the Step trait was reworked, it's now possible (and fairly plausible) to implement Step for char, allowing RangeInclusive<char> (and other ranges) to just work. I've already implemented character ranges twice before, because I've familiarized myself with workloads that appreciate having nice range types.

Does this sound like something we should add to the standard library? It would be insta-stable. This would be the first type usable in Range with a noncontinuous domain.

const SURROGATE_RANGE: ops::Range<u32> = 0xD800..0xE000;

unsafe impl Step for char {
    fn steps_between(&l: &Self, &r: &Self) -> Option<usize> {
        let l = l as u32;
        let r = r as u32;
        if l >= r {
            let diff = r - l;
            if l < SURROGATE_RANGE.start && SURROGATE_RANGE.end < r {
                Some(diff as usize - SURROGATE_RANGE.len())
            } else {
                Some(diff as usize)
            }
        } else {
            None
        }
    }

    fn forward_checked(c: Self, n: usize) -> std::option::Option<Self> {
        let c = c as u32;
        let mut r = Step::forward_checked(c, n)?;
        if c < SURROGATE_RANGE.start && SURROGATE_RANGE.start <= r {
            r = Step::forward_checked(r, SURROGATE_RANGE.len())?;
        }
        if r <= char::MAX as u32 {
            Some(unsafe { char::from_u32_unchecked(r) })
        } else {
            None
        }
    }

    fn backward_checked(c: Self, n: usize) -> std::option::Option<Self> {
        let c = c as u32;
        let mut r = Step::backward_checked(c, n)?;
        if c > SURROGATE_RANGE.end && SURROGATE_RANGE.end >= r {
            r = Step::backward_checked(r, SURROGATE_RANGE.len())?;
        }
        Some(unsafe { char::from_u32_unchecked(r) })
    }
}

godbolt (EDIT: actually linked the godbolt example of this code)

11 Likes

It seems like a good idea to me. I know I've written code that ranges over codepoints, although not that frequently and it's probably pretty niche.

I suspect the most compelling argument against this is whether this sort of thing can be misused to do something that ends up being logically incorrect. But I can't quite come up with anything off the top of my head.

1 Like

I went ahead and filed this as PR#72413. Any concerns should probably be raised there.

4 Likes

Note that char ranges are already supported in patterns, so allowing Range<char> as well makes sense, even if it is rarely used. I don't see any downsides.

7 Likes

I'd definitely like to see this.

I guess the one weird thing that comes to mind is that .. ranges are generally preferred, but you can't do things like '\u{0}'..'\u{D800}' because the latter's not a USV. That said it's a compilation error and can totally be worked around, so there's not really a reason to prevent ranges from existing.

This is a really good point. If I can matches!(x, 'a'..='z'), then I might as well be able to ('a'..='z').contains(x).

1 Like

I'm not sure what workaround you had in mind, but '\u{0}'..'\u{E000}' should work.

I think what @scottmcm meant was '\u{0}'..'\u{D800}'. Note that \u{D7FF} is a valid char, but \u{D800} is not. Possible workarounds are:

'\u{0}'..='\u{D7FF}'  // inclusive range
'\u{0}'..'\u{E000}'   // next valid `char`

EDIT: I misread your comment, you're correct.

I think we're on the same page -- U+E000 is the proper exclusive terminator given the surrogate gap.