The term "niche" when referring to disallowed bit patterns and reusing them in option-like enums

Apologies for further derailing — this thread was separated from another discussion by a moderator — but what’s the etymology for “niche” as used by Rust? Is it a term used outside of the Rust community (e.g., general programming language theory)?

The etymology is "@eddyb had to pick a name when implementing this". :wink: It's a Rusty term. I do not know of another language having a name for the same concept.

6 Likes

Let's seriously decide on something that's closer to being self-explanatory, like "invalid values" and maybe "invalid (value) reuse"?

3 Likes

As a native speaker of American English and the responsible editor for over 15.000 pages of IEC computer communications standards, and as a once mathematician, I prefer "niche".

Computer representations of information are all bit strings, usually of fixed length for a given elementary item of information. With analogy to other basic Rust types, we can denote an N-bit representation bit string as bN. Various semantic mapping functions are used to map from the bN representation domain to various semantic ranges such as uN, iN and fpN.

The semantic mapping from a bN representation to uN is unipartite (i.e., one-part).

The semantic mapping from a bN representation to iN is usually unipartite, but it is possible to map the distinguished bit string 0b10…0, usually representing -2N-1, to a distinguished value, such as OVERFLOW, in which case the mapping would be bipartite (i.e., two part). In that case we could say that a niche has been carved out of the traditional two's-complement mapping to create a representation for that OVERFLOW value.

Similarly the Rust Option<T> type, when applied to a NonZeroN<U> base type, reserves the representation domain bN value 0b0…0 for None while the other representation bN values are used to represent the NonZeroN<U> base type. This is another bipartite semantic mapping.

The semantic mapping from a bN representation to IEEE floating-point fpN is five-part: normalized values, denormalized values, ±∞, quiet NaNs, and signaling NaNs. In this case the niche is not just for a singleton representation-domain value like the previous OVERFLOW or None, but entire subranges of the representation domain for the denormalized values and the two classes of NaNs.

@eddyb Please note that, by the above analysis, none of the values in any representation domain bN are "invalid values"; that designation is logically incorrect. Rather, the representation domain has been partitioned into a multi-part semantic mapping to disjoint ranges of semantic concepts. To come back to use of the word "niche", I find it to be an appropriate shorthand way to refer to the existence of such multi-partite mappings, and in particular to the portions of the representation domain that have been carved out to map to the secondary (in some sense) semantic ranges.

9 Likes

@Tom-Phinney Some of what you are saying reminds be of what I wrote up recently in this PR, though our terminology is not entirely aligned. I think your "semantic mapping function" corresponds to my "value representation relation". A difference is that I define one semantic domain of values and then have partial relations for each type whereas you are carving out the semantic domain covered by each type separately.

I chose a relation as opposed to a function because sometimes you start with a representation and want a semantic value, and sometimes you need to go the other way. I wanted to avoid making one of the two directions the "primary" one, the way a function does. A relation is also strictly more general, e.g. I am not convinced that for unions in C, the mapping from representations to semantic values actually is functional. I sure hope we can make it functional in Rust, but I think it is still useful to have the terminology to talk about the more general case as well.

Also I think the relation is somewhat easier to define (see my PR for a few examples); using a function would IMO only make sense if we define it in a way that actually computes the semantic value from the concrete representation. In contrast, I am stating the relation in a more propositional, less computational style, making it shorter and easier to read and understand. But of course then one has to be careful not to accidentally relation one representation to two values.

If the relation is functional (in the representation -> value direction), it could equivalently be viewed as a partial function from representations to semantic values, and that's your semantic mapping function. And then if you consider the subset of the domain of all values that can actually be hit by this function, that's your semantic range for a particular type.

I am not familiar with the "unipartite"/"bipartite"/"N-part" terminology you are using. I know about bipartite graphs, but I don't think "unipartite graphs" are a thing and anyway you are talking about a function or relation, not a graph.

I agree though that "invalid value" is a bad choice, but for a different reason: "value" refers to the "high-level" view whereas niches are inherently about the "low-level" view -- so, in the context of my PR, a better name would be "invalid byte lists", aka byte lists that are not related to any value, or "invalid representations" if one wants to be less concrete.

1 Like

I used the term "mapping" in a mathematical sense, because the projection from the representation domain to the semantic range, or potentially set of disjoint semantic ranges, is neither 1-to-1 nor onto. This is particularly apparent when considering mappings between bN and the disjoint semantic range of fpN, which I view as the real number field ℝ augmented by +∞, -∞, and the sets of quiet NANs and signaling NaNs. The mapping is not 1-to-1 because there are two bN representations for zero, +0 (0b00…0) and -0 (0b10…0). It is not onto because there are an infinite number of values in the real number field ℝ that have no exact representation in bN.

I consider the semantic ranges disjoint because the sets of quiet NaNs and signaling NaNs and the values +∞ and -∞ are not part of the real number field ℝ. While the values +∞ and -∞ can be considered to be "extensions" of that field that are not, in some sense, "disjoint", that logic cannot be applied to the sets of quiet NaNs and signaling NaNs. The fact that fpN in Rust does not satisfy the trait PartialOrd, because NaNs cannot be compared in an ordinal way, clearly demonstrates their disjointness from the real number field ℝ even when it is augmented by ±∞.

I feel that it is necessary to reach some degree of common understanding here before bikeshedding specific terminology such as relation. Also please note that I used the term "function" only once, to assist in the explanation of my "semantic mapping" concept. Had I known that word would derail understanding I would have found a different word.

1 Like

Okay, so it is neither surjective nor injective. (I have never heard the term "onto" before, probably because I learned all of this in German and surjective is more commonly used in research. And WTF, why does "1-to-1" mean "injective"? I was sure it would mean "bijective".)

But, that still makes it a function in the mathematical sense (for every one input there is only one possible output). That also seems to be the most common way to refer to this in English, at least Wikipedia talks about "(non-)injective functions" and "(non-)surjective functions", not "mappings".

So, if I may rephrase, what you are saying is that the semantic range is in some cases actually defined as the disjoint union of several pieces. Sure, but that's a detail of how this particular semantic range is defined; the general notion of a semantic range need not be concerned with that.

And there is not even a unique way to talk about the "partitions" of a "semantic range", e.g. the integers can be seen as "one thing" or as a disjoint union of three sets (strictly positive numbers, 0, strictly negative numbers). I see this as a meaningless distinction, what we care about is that there's a set of all integers and we can map lists of bytes into it; we do not care about the internal structure of that set.

And I fail to see how this relates to niches.

It seems I was unclear, I was not confused by the "function" part at all. I deal with functions all the time (though usually in type theory when I formalize my proofs in Coq :wink: ), but I rarely encounter the term "mapping" as a noun (I do speak about a function "mapping" things from somewhere to somewhere; the verb is fine, just the noun is weird for me). So just to be clear, I mean this notion of function, which is actually defined as a particular kind of relation.


So let me give another shot at saying what I was already trying to say last time.

Where our approaches differ is in how we view the connection between values and their representation:

  • I view it primary as a relation answering the question "are this semantic value and this concrete representation related". So I assume to be given both of them and then I give a yes/no answer.
  • If I understood correctly, you view it primarily as a function answering the question "given a concrete representation, tell me the value it represents". Implicitly, your approach assumes there is a unique such value (otherwise it wouldn't be a function). And I suppose you also (implicitly) meant the function to be partial because not all representations represent a value (such as 0b11111111 when thinking of bool).

The information content is the same in both cases (at least when the value-representation-connection can be cast as a function; non-functional relations are conceivable here and they would not fit your view).

2 Likes

I believe that we are in agreement here. I wrote my initial post above in between afternoon concerts at a music festival, so did so without much attention to nomenclature. I don't actually care what terminology we end up choosing to describe these partitioned mappings. I avoided the term "function" (or at least tried to) because in computers functions are one-way, whereas mappings can be two-way. Since most readers of this forum are presumably not mathematicians, I hoped my avoidance of "function" would avoid confusion.

My post was triggered in large part by eddyb's statement about "invalid values". I thought that clarity was needed for this topic, and that in my view the issue was one of recognizing that "niches" resulted from partitioning both the domain and the corresponding range of the underlying mathematical "function".

injective simply requires that distinct elements in the domain map to distinct elements of the range. bijective requires that there be a reverse mapping from the range to the domain and that both be injective. For the integers the mapping f(x) = x * 2 is injective but not bijective, because the range does not include the odd integers even though the domain includes all the integers.

1 Like

I think "invalid value" is a bit too broad, because (for example) 42u32 is a perfectly valid u32 but potentially an "invalid" value in certain contexts. It would be nice if the term conveyed this is more about bit representations rather than high-level invariants.

How about "bit-restricted"? So you have bit-restricted types and bit-restricted values. And the compiler/optimizer can exploit these bit-restrictions.

It's longer than "niche" and "invalid" but I think it's more self explanatory.

2 Likes

I still don't get this angle. A partition of a set (and many other things) is sorting all its elements into disjoint subsets that cover the whole set. But for the purposes we use the term "niche" for in Rust, the important bit is not that you can partition the elements which are there, but that some elements "aren't there", i.e. there is a "hole" in the set of representations -- bit strings which don't correspond to any mathematical value.

In your first post, you give the example of (I'll specialize for concretness) Option<NonZeroU32>, partitioning the set of length-32 bit strings into 00...00 (mapped to None) and all the rest (mapped to Some(n) for various n). But niches, or whatever we want to call them, exist and are of interest no matter how or whether they are used elsewhere.

NonZeroU32 has, on its own, 00...00 available as a niche (because it is not a valid bit string for that type). This is true even if nothing in the whole universe ever uses that niche -- for a more realistic example, it's quite unlikely anyone has ever compiled a Rust program that exhausts the entire niche of char, which encompasses over four billion distinct bit strings.

Furthermore, it would be wrong to call 00...00 a niche for the type Option<NonZeroU32>. That type has no niches, every length-32 bit string represents a value of that type. When a previously invalid bit pattern is consumed for layout optimization, it ceases to be a niche/invalid (in the context of the type where it is used that way).

1 Like

Your above points demonstrate why I tried to present this issue as one of partitioning the representation space bN into a disjoint mapping – a multi-entry match statement or enum, if you wish. Perhaps others can explain/explore this subject better than I can.

I agree with this -- the niche is a property of the type before it has been wrapped by Option/whatever to fill that niche. This fits well with the classic definition of niche as a recess in a wall for ornamental objects. Useful synonyms in this sense might be "gap" or "hole", though less colorful.

4 Likes

0000…0 and 1000…0 are not two bit representations of the same value; +0 and −0 are actually distinct floating-point values, even though they compare equal. There are API-surface-level operations you can perform on them that expose this difference, like (+1) ÷ (−0) = −∞. (If you insist, you can consider floating-point numbers to form a bug-eyed line.)

The graph of the successor/predecessor relation provides you with a notion of connectedness that makes the set of all integers 'a single piece', but the set of non-zero integers 'two pieces', even though the set itself is discrete. A topology on a set (like one on ℝ ∪ { −∞, +∞ }) can likewise create a meaningful distinction between connected and non-connected subsets. And let's not forget coarse geometry, which can generalise connectedness as defined on graphs.

However, I agree that talking about the topology of bit representations is a distraction that only serves to perpetuate the confusion between bitstrings and values they represent. It's better to just consider them arbitrary, opaque labels with no structure other than identity. Besides, the successor relation of integers is not the only one or even the most natural one that might be imposed on bitstrings. One might just as well consider the Hamming-distance graph (edge between any two bitstrings of Hamming distance 1), or interpret a bitstring of 2n bits as a pair of n-bit integers and consider the adjacency relation on a 2D coordinate grid; both of which keep the set 'connected' after removing the all-zero bitstring.

Also, sometimes the abstract semantic space has no topology to speak of either. Consider this relation: 00:diamonds:, 01:heart:, 10:clubs:, 11:spades:. Is it a 'unipartite' or a 'quadripartite' mapping? Would it be if I shuffled the suits?

1 Like

I see. With my type theory hat on I would say functions are one-way even in mathematics, but relations can be viewed both ways... but we are digressing. :wink:

This is the point I am still not getting. Or maybe it's just terminology again. The way I view it, your function will simply not be defined on the bit pattern 0b11111111 for type bool -- so the niche would be characterized as being outside the domain of the function. (This is the same point @hanna-kruppe made.) And I don't think the function's range has anything to do with it.

The maximal possible niche consists of those bit patterns that do not represent a "semantic value". The niche used by the compiler is some approximation of that (since exactly representing the maximal possible niche can be tricky). Usually the latter is what we refer to as "niche", and the former as "invalid values" or "values not satisfying the validity invariant".

Oh, so maybe that is the confusion here? @Tom-Phinney did you consider 0b00000000 a niche of Option<NonZeroU8>? That would be wrong, this is not what niche means. 0b00000000 is a valid bitstring for that type. That type has no niche.

This is related to our ongoing struggle of finding good names for the "safety invariant" and the "validity invariant", as I called them in this blog post.

We could go deep into topology here but I don't see how that has any connection to Rust any more. :wink: The successor/predecessor relation is a rather arbitrary choice here. (So basically, I agree with what you are saying later.)

What is "them" here, the bit representations or the values or both?

I meant the former, but sometimes it's the right choice for values themselves as well (like in the card suits example). There's a reason Ord is not an auto trait.

They are distinct in the wierd space defined by the fpN encodings, but not in the extended real number field ℝ ∪ { −∞, +∞ } that they are supposed to represent.

This is why I used the term "multi-partite". If we consider my "mapping" as a Rust match expression on the value of the bN representation, your "function" corresponds to exactly one arm of the match while my "mapping" is for the entire multi-arm match, where the domain of the match spans the entire range of possible values of the bN representation.

0b00000000 is the only niche in the representation of NonZeroU8. Option<NonZeroU8> exploits that niche to provide an encoding for None. Thus the representation of Option<NonZeroU8> has no niches.

I'm a bit late to this party, but it's not clear to my why "niche" is a big deal? It provides good motivation for why we care about such representations (i.e., we want to stick data into an unused bit) and is, ultimately, specialized vocabulary that only the small minority that writes unsafe code will ultimately care about.

It's vocabulary, similar to the sense that "definition" and "declaration" are frequently used interchangeably, but are distinct terms in C++ standardsese. Of all the Rust vocabulary out there, I argue that this is a case where it does not matter.

3 Likes

But there is no value represented by 0b11111111 as bool. So what does the mapping output for that input?

With analogy to IEEE Floating Point, I classify all such representation values as "NotAValue" (NaV). For NonZeroU8 the b8 representation 0b00000000 is NaV. For bool all b8 representation values except 0b00000000 and 0b00000001 are NaV. In my view, a representation value that is NaV implies instant UB.

An enum that incorporates the base type in one variant should be able to use one or more of those representation values that are NaV in the base type as niches to provide representations for disjoint variants. In that case the enum representation could be the same size as the representation of the base type; it simply would have fewer representation values that are NaV than the base type has. Option is the most common enum type in Rust that exploits such a niche to achieve same-size representation.

Not necessarily, you'd also want to use this to explain to beginner-to-intermediate users why e.g. Option<&T> is a zero-cost abstraction.