Representation of generalized ranges

Rust currently has stable support for inclusive-exclusive ranges a .. b (usually called "ranges") and unstable support for inclusive-inclusive ranges a ... b (usually called "inclusive ranges"), added by RFC 1192 and tracked at #28237.

The main complaint about inclusive ranges has been that the syntax is confusing (unfortunately it's already used stably in pattern syntax, but we could deprecate that). I want to write an RFC to change the syntax to a p..q b, where p and q are < or = to specify exclusive/inclusive (resp.) behavior at that endpoint (for compatibility with the current syntax, p defaults to = and q defaults to <). This was originally proposed by @glaebhoerl.

Anyway, this thread isn't about syntax -- that was just the motivation. The question is, supposing it's a good idea to fill out the range types so that inclusive and exclusive bounds are supported on both ends, how should this be represented in the libraries? Currently we have

pub struct Range<Idx> {
    pub start: Idx,
    pub end: Idx,
}

pub struct RangeFrom<Idx> {
    pub start: Idx,
}

pub struct RangeTo<Idx> {
    pub end: Idx,
}

pub struct RangeFull;

which are stable, and

pub enum RangeInclusive<Idx> {
    Empty {
        at: Idx,
    },
    NonEmpty {
        start: Idx,
        end: Idx,
    },
}

pub struct RangeToInclusive<Idx> {
    pub end: Idx,
}

which are unstable.

For backwards compatibility, facts like ".. syntax creates a std::ops::Range" must be preserved:

let r: Range = a .. b;

Below is a list of the ideas I've had so far.

  1. The most straightforward approach is to create a combinatorial explosion of structs: one for each combination of left-inclusivity, right-inclusivity, and endedness.

  2. Going the other way, we could collapse everything into a single type. Quoting from correspondence with @aturon:

The main reason we had separate structs in the first place was to allow static distinctions about which kind of range you support (rather than, e.g. having to panic on an unsupported style). In hindsight, I think this wasn't worth it. So perhaps another option is to introduce an all-encompassing range type (which, sadly, can't be called Range...), and have the new forms generate this type directly, with the existing structs implementing Into<GenericRange> or whatever. Then most APIs would take R: Into<GenericRange>.

  1. We could introduce a new type, enum Bound<T> { Inclusive(T), Exclusive(T) } and use Range<Bound<usize>> instead of Range<usize>, etc. The downsides to this approach are that Range<Bound<usize>> is larger than Range<usize> due to the discriminants, and that it poses a (possibly solvable) problem with the backwards compatibility requirement mentioned above.

Thoughts?

2 Likes

I think #2 is probably the best compromise. I’d expect Bound to most likely still exist for interface reasons (i.e. GenericRange::new(Bound::Inclusive(0), Bound::Exclusive(10))), not necessarily for the internal representation (fusing the enums would get us down to one tag).

To clarify, it really doesn’t make sense for a function to care about what kind of range it’s being called with… all it likely cares about is knowing what the first/last values are, or iterating over said range. Maybe a few range/set operations, but those shouldn’t really depend on specifically what kind of range it has.

(Well, I mean, unless it does matter, but that’s hopefully unusual enough that having a specialised type would make sense anyway…)

I’m usually all for being maximally specific with types… but with the current type system, that feels like it’d be a huge pain to work with, and it’s probably not worth it.

It also gives us something of an out: I’d like to believe that .. covers ~80% of ranges people will work with, ... covers ~15%, and GenericRange gives us the last 5%, without having to come up with actual new syntax.

Of course, we could come up with syntax for GenericRange and promote that over the “older” forms. I think what would look nice would be *strangles self to prevent bikeshedding*

1 Like

I think a nice type representation of generalized range is important, especially because I am against a generalized range syntax. All syntax proposed were hurting readability too much. It don’t worth introducing an alien syntax, only to cover edge cases. That’s why there should be a nice types to use for the cases not covered by the current syntax.

The solution #2 seems the way to go but I am not sure it is doable with a syntax easy enough to replace the .. syntax. We could get it nicer using a tuple struct instead of regular struct. And if we get constant type parameters we could have something like : struct GenRange<T,const inclusive_start=true, const inclusive_end=false>(Idx,Idx)

So we can do :

for i in GenRange::<i32,false,false>(10,20) {
    ...
}

Boolean generic parameters can be emulated pretty easily with void types, e.g. enum Open {} enum Closed {}, with the benefit of names instead of boolean soup.

Also note that we can add generalized syntax without removing or changing the meaning of current syntax (obviously, we can’t consider a proposal that removes stable syntax).

From these two replies I’m getting the sense that “generalized range representation” and “generalized range syntax” should be separate RFCs.

2 Likes

Has it been considered whether some kind of bound makes sense? I guess back-compat blocks a hard requirement, but it feels weird to me that let x = f64::NAN..f64::NAN; is totally fine. What does that range even mean? (The docs just refer to contains, which that value can't instantiate.) Part of my brain wants it to be a "valid range" in an Elements of Programming iterator range sense, so it's not PartialOrd. Maybe it's some kind of often-only-checked-in-debug SameSubdomain? A place to hook a "yes, they're both linked list iterators, but they're into different lists" assert would be nice.

FWIW, here's Dijkstra 831 on range conventions recommends only using what we have as ..:

The programming language Mesa, developed at Xerox PARC, has special notations for intervals of integers in all four conventions. Extensive experience with Mesa has shown that the use of the other three conventions has been a constant source of clumsiness and mistakes.

Using that to say "no inclusive ranges" is probably overstating it, especially given the parallels to patterns. But I'm definitely not convinced that 0<..=10 is something people would realistically use. (Though maybe having it internally as a better way to reverse would be useful? It's a nice description of how you'd produce 10, 9, ..., 1.)

One API I'd like to see, at least as a thought experiment, is split. I think for our current Range, it's fairly obvious, and makes sense in a context like using a Range for binary search bounds: (a..b).split(k) => (a..k, k..b) (Maybe with some kind of check that k ∈ [a, b]?)

I think that does fit best for most things (merge sort, binary search, ...) since it gives full coverage with no overlap. I assume that's the root of the EWD statement. (Of course this is also true for half-closed ranges, but we're using [0,n) as our indexing strategy. If this was Lua, Fortran, ALGOL 68, etc, then half-closed (0,n] could arguably be the "best" choice.)

The others are a bit odd from a divide-and-conquer perspective. Fully-open (a, b) => (a, k) (k, c) means that k isn't in either range. The only algorithm I can come up with that kind of step is single-pivot binary-comparison partition, where one of the elements is in the right place and you don't need to look at it again. (Of course the good quicksorts all use a more complex partition...)

Then fully-closed [a, b] => [a,k] [k, c] means that k is in both sides. I struggle to think of an algorithm that wants to look at the item again on both sides. The best I could come up with is splitting curves into smaller curves, but I think that's actually a property of curves usually being given ends of a dense number type. If I make a complete butchery of maths, [0i32, 1i32) and [0f64, 1f64) both have "size" 1, but [0i32, 0i32] and [0f64, 0f64] have "sizes" 1 and 0 respectively. (Relatedly, the range [f64::consts::PI, f64::consts::PI] doesn't actually contain π, but does contain something not-π. Either PI..succ(PI) or pred(PI)..PI--I'm not sure which--does at least contain π. Or course, making that kind of range useful leads to interval arithmetic, so I'm going way off course. Unless that's the only rigorous way to decide how to handle a new Range?)

As a bit of a probably-not-actionable aside, I'm not really a fan of many of the "there's no value outside the range" examples I've seen. MON...FRI seems reasonable initially, but then you try FRI...MON for the long weekend, and it silently does what you obviously didn't want. (See trait bounds above.) Anything using 'a'...'z' also feels like a unicode-support antipattern. On the other hand, I can totally understand wanting to be able to iterate the full range of u8...

(Sorry that this thread is a bit old; I got here from an issue and am not sure where is best to comment.)

I was thinking about a nice syntax for generalized range. This one seems nice to me :

pub enum Bound<T> {
    In(T),
    Out(T),
    None,
}

pub struct Span<T> (Bound<T>, Bound<T>);

for i in Span(Out(10), None){} //from 11 to infinity and beyond
for i in Span(Out(10), Out(20)){} //from 11 to 19

But, it is just a idea about a nice syntax to me, I don’t know if it is doable. Especially, I’m not sure it is possible to do a Zero Cost iterators with an enum as a Bound.

That would make sense to me as an interval, so you can do things like [0,10] - [1,10) => (-10, 9]. The In/Out probably aren’t any harder to make iterators from than iterating a current (half-open) range backwards. The None, though, would prevent the type system from saying “you can’t iterate unless there’s a lower bound” like it does today.

Hmm, RangeFrom has an iterator, but it doesn’t stop on overflow. One way to solve the “I want to iterate all the u8s” problem would be to just require that the iterator stop at overflow, then “all positive i8s” is iterable with “0i8…”. I suppose that adds the downside of overflow checks, but the termination might actually make the loop easier for the compiler to reason about…

I’ve been preferring #3; I agree that more range structs is not a good solution.(*)

Custom endpoint types are an API design opportunity, and libraries/consumers are fine to invent their own protocols (The endpoints of a std::ops::Range<Foo> can be inclusive, exclusive, fuzzy-matching or whatever fits the library).

For example, BTreeMap could use Range<Bound<&K>> to represent its key interval, since it allows expressing inclusion or exclusion at either endpoint. The enum type could be something like enum Bound<K> { Include(K), Exclude(K) }.

(*) I have a corner case use where it’s useful to statically ensure only ranges with at most one point are accepted in an API: only .. and a.. and ..b are statically trusted to be valid slicing ranges, if a and b are trusted indices (from the indexing project).

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.