Pre-RFC: Custom Literals via Traits

mcy · July 24, 2018, 6:46pm

This is a formal collection of my counter proposal from https://internals.rust-lang.org/t/pre-rfc-custom-suffixes-for-integer-and-float-literals/.

Summary

This RFC proposes syntax sugar for custom literals, implemented as std::ops traits. A custom literal is a suffix added to a numeric, string, or character literal to produce a user-defined type. For example:

Complex numbers: 1 + 2i
Time literals: 42ms + 120ns
Typechecked units: 12m / 55s
Compile-time checks of simple embedded languges:
- "foo(.+)"Regex
- "18.4.0.1"IPv4
- "yyyy-mm-dd"DateFormat
Literals for custom integers: 192346712347913279231461927356_BigInt, 1.23f16
Literals for binary blobs: "c29tZSBiaW5hcnkgZGF0YQ=="base64
Literals for dealing with non-utf8: "Hello, World!"utf16, "Hello, World!"latin1

This proposal attempts to define custom literals in the least ad-hoc way possible, in analogy with C++'s operator "" and Rust’s operator overloading.

An explcit non-goal of this RFC is suffixes that are not valid Rust identifiers.

Motivation

The above examples provide ample motivation: shortening calls like Complex::new(1, 2) to 1 + 2i, where the syntax is sufficiently well-known that the shortened sugary form is “generally known”, or for introducing literal syntax for new kinds of compile-time constructs.

In the small cases where the short form is common outside of the context of the defining library, this is an ergonomic win and a readability win.

However, custom literals have the potential (like other overloadable operators) to produce horribly unreadable code. This proposal does not try to prevent misuse of custom literals, and leaves that up to the users’ judgement.

Note: I am in principle against the use of custom literals, since they can be easily abused. However, I think it is inevitable that Rust will get them, since their niche uses are justifiable.

Guide-level explanation

A custom literal is an expression consisting of a numeric (123, 42.3, 12e-5), string (“foo”, b"bar"), or character (‘k’, b’\0’) literal followed by a path. For example,

let _ = 10i32;      // explicit i32 literal
let _ = 2.45f32;    // explicit f32 literal
let _ = 5i;         // imaginary number literal
let _ = 102ms;      // millisecond duration literal
let _ = ".+"regex;  // regular expression literal
let _ = '?'char16;  // UTF-16 codepoint literal

let _ = 0xff_ff_ff_ff_ab_cd_ef_00m8x8 // SIMD mask literal
// though this last one does not strike me as a very good idea...

Custom literals are defined by implementing a trait, like the following example from core:

impl IntLit for i32 {
    type Output = i32;
    const fn int_lit(lit: i__) -> {
        lit as i32
    }
}

(Note: i__ is a “literal type” described here).

The Self type of the impl is the type used in the literal expression. Thus, it can be chosen to be a dummy type that only exists to provide a symbol:

enum ms {}
impl IntLit for ms {
    type Output = Duration;
    // ..
}

Thus, all custom literals have a simple desugaring:

let _ = 123i32;
// becomes
let _ = <i32 as IntLit>::int_lit(123);

let _ = "foo"regex;
// becomes
let _ = <regex as StrLit>::str_lit("foo");

To use a literal, you’ll need to import the “symbol type” into scope. For example,

use std::time::literals::ms;

let millis = 23ms;
let nanos = 45ns; // ERROR: can't find type `ns`

It’s even possible to use the whole path,

let millis = 23std::time::literals::ms;

or rename them with imports or type aliases

use std::time::literals::{ ms, ns as nanos };
type millis = ms;

let millis = 32millis;
let nanos = 45nanos;

Note that there are some parsing ambiguities. 10e100 always parses as a single float literal. Thus, the impl

impl FloatLit for e100 { .. }

will generate a lint warning. It is still possible to call it with UFCS.

We can teach this by comparison with C++'s operator "". Custom literals are intended to be defined analogouslly with Rust’s usual operator overloading.

Reference-level explanation

This RFC changes the grammar as follows:

Literal := (IntLit | FloatLit | StrLit | CharLit | ByteStrLit | ByteCharLit) Path?

Whenever a Literal which includes a path after it is encountered, it is desugared to

<$path as FooLit>::foo_lit($lit)

where FooLit is the relevant literal trait for that literal type. These traits are as following, defined in core::ops. Each one comes with an attendant lang item.

#[lang = "int_literal"]
pub trait IntLit {
    type Output;
    const fn int_lit(lit: i__) -> Self::Output;
}

#[lang = "float_literal"]
pub trait FloatLit: IntLit {
    const fn float_lit(lit: f__) -> Self::Output;
}

#[lang = "string_literal"]
pub trait StrLit {
    type Output;
    const fn str_lit(lit: &'static str) -> Self::Output;
}

#[lang = "char_literal"]
pub trait CharLit {
    type Output;
    const fn char_lit(lit: char) -> Self::Output;
}

#[lang = "byte_string_literal"]
pub trait ByteStrLit {
    type Output;
    const fn byte_str_lit(lit: &'static [u8]) -> Self::Output;
}

#[lang = "byte_char_literal"]
pub trait ByteCharLit {
    type Output;
    const fn byte_char_lit(lit: u8) -> Self::Output;
}

Furthermore, the “obvious” implementations for the 42i32 et. al. literals will be added to core.

All of the expected import and name shadowing rules apply as would be expected, as corrollaries of this being implemented as a trait.

The following impls generate a lint (yet unnamed, please bikeshed):

impl IntLit for e<numbers> { .. }
impl IntLit for E<numbers> { .. }

impl FloatLit for e<numbers> { .. }
impl FloatLit for E<numbers> { .. }

This is to point out a parse ambiguity. Float literals in scientific notation are always lexed as literals, since having 10e100 possibly parse as a custom literal is extremely confusing.

One problem this solves is having the same symbol for different literals. Consider, for example, complex numbers and quaternions:

let z = 1 + 2i;
let q = 3 + 4i + 5j + 6k;

The fact that literals must be imported, and have a unique implementation, completely sidesteps this confusion. To use the first syntax you might need to use num::complex::i;, but for the second, you might need use my_gfx::quat::literals::*;. Now, if somehow num::complex::i and my_gfx::quat::literals::j are in the same scope, we get a type error:

let q = 4i + 5j;
        ^^   ^^
        |    |
        |    of type Quat<i32>
        of type Complex<i32>
= Cannot find impl Add<Quat<i32>> for Complex<i32>

Thus, the contents of the scope determine the type output of a particular literal. Also, we avoid the bikeshed of “do we use mathematics or EE notation for complex numbers?” We can stick with i (which num already uses); our electrical engineer brethren can just use num::complex::i as j;! At the end of the day, this boils down the the usual use name-clash problems and solutions we’re all used to.

Drawbacks

Custom literals open the door to write-only code. For example, introducing literals that have meaning unique to a crate will confuse readers. Luckilly, the fact that literals need to be imported by name (or by glob) makes it somewhat easier to track them down.

I argue that compiler complexity is not a drawback: it actually makes a particular parser rule somewhat simpler, since it no longer has to care about a list of primitive types, and can let the type system deal with it. The only other compiler addition is six lang items and a simple desugaring rule.

Rationale and alternatives

I think this is the best design because it makes use of the trait system. Not only are literals first class types, but it’s also possible to write them as trait bounds. The import and shadowing story are both already part of Rust and thus familiar to both current users and new users learning about advanced operators.

We could define the traits instead as

trait IntLit {
    const fn int_lit(lit: i__) -> Self { .. }
}

This would go a long way to make custom literals less confusing (and still works with 0i32 and friends!). However, unless we want to write type aliases like type i = Complex; or define a type that coerces via addition into Complex, we lose out on a large class of useful literals.

The alternative proposed design (which this RFC started as a counter proposal to) is by attribute. For (an abridged) example:

#[int_literal("s", "ms", "ns")]
fn time_lits(lit: i64, suffix: &'static str) -> Duration { .. }

This is a problematic proposal for a few reasons:

Attributes do not appear in documentation, which makes them hard to document and not discoverable (unlike std::ops).
Attributes are pretty magical, so we need bespoke importing, shadowing, and naming rules, and this can’t be used as a type constraint.
Using strings instead of identifiers invites use of non-identifiers, which will make parsing more difficult and code generally more confusing.
It can’t leverage existing use syntax for renames without a lot of magic name generation.

Another alternative is a macro approach, either making the literal call a procedural macro (which is overkill for most uses) or a postfix macro.

We could also just not do this at all and rely on extension traits:

trait DurationLit {
    fn s(self) -> Duration;
    fn ms(self) -> Duration;
    fn ns(self) -> Duration;
}

impl DurationLit for u64 {
    // ..
}

let _ = 34.ms() + 65.ns();

I think that custom literals, if specified carefully, can make this code slightly more natural (I’m also slightly opposed to doing it like this anyways, since extension traits are a bit [eyebrow raise] in my opinion).

Prior art

The sole language with custom literals that I know of is C++. They are defined as follows:

Duration operator ""_ms(uint64_t lit) { .. }

C++ does not seem to have a good import story for custom literals, or a shadowing story.

C++ also requires user-defined custom literals to start with an underscore, which is to avoid parsing ambiguity. The STL is, however, allowed to define things like

constexpr complex<double> operator ""i(long double arg) { .. }

This is not good for Rust for two reasons:

Rust does not have this parsing issue.
Rust allows undescores to appear anywhere in numeric literals, which mostly defeats this STL/user code distinction. See 0xdeadbeef_u64.
For readability purposes, this is about as useful as Hungarian notation. At the point that custom literals are in play, it is clear which are std‘s: the ones that are named after primitive types (though these may themselves be shadowed, since the primitive types’ names are reserved).

Unresolved questions

It is not clear what the type of lit should be for IntLit and FloatLit. See my literal types proposal. It is tempting to use u128/i128 and f64 in place of i__ and f__, but it leaves things like a big integer type, that can consume arbitrary-size literals, a bit in the dark.

We need a name for the scientific notation parse lint.

The enum ms {} idiom emits a lint; I don’t think we should encourage this. It would be nice to be able to write something like

impl IntLit for newtype ms { .. }

to indicate that ms is a single-use uninhabited. We could also special-case the linter to ignore non_camel_case_types if the sole use of the type is a FooLit trait, but this seems too baroque.

It is not clear if the following should be valid syntax for invoking a literal:

let _ = 42 ms;

I don’t think the grammar minds, and it would probably make things simpler. Am I correct in this assumption? Should we allow it and warn by default?

This RFC explicitly does not consider the following issues:

Literals that are not Rust identifiers. E.g. µs and m/s**2.
Macros like 42.si![m].
Generic literals. In principle I could imagine 12foo::<T>, but for now we should probably not allow things like impl<T> IntLit for ... (I’d suggest a warning and a “may be added in the future”).

varkor · July 24, 2018, 8:05pm

Using uninhabited types as units is very confusing — you’re effectively overloading what types mean just because they coincidentally fit into the rest of your framework. The type itself is useless apart from for providing the impl.

Also, you mention “type-checked units” in the summary, but then don’t tackle one of the most important features of units of measure: equality of units. In this proposal, if we want to compute 2m / 1s, we need three types: Distance, Duration and DistanceByDuration. We can use const generics, 1m, 1m^2, 1m^3, etc. to represent powers of a single unit, but we’re stuck when it comes to products of different units.

Literal suffixes and units of measure are very much related, I think any proposal tackling one needs to provide a solution (or a reason not to) for the other.

(Also, Complex should probably be Imaginary, because in this proposal, only numeric literals suffixed with i are complex.)

mcy · July 24, 2018, 8:16pm

This is not a new pattern. See byteorder. I only suggest it this way because it seems like the simplest solution to the problem of introducing custom literals by way of traits. I'd like to know if someone comes up with a more natural solution.

I think there's already a crate that handles most of this work? Typechecked units are not a use case I'm interested in, so I'd like to see how users of existing units crates would want this to fit into their framework. For example, the units crate seems totally agnostic to what the units actually are. Extending this with my proposal seems straightforward.

I disagree. These problems should not be tightly coupled; the only relation is that custom literals are an ergonomics and readability win that has applications outside of units of measure. I also argue that units of measure are the rarest of the three motivations I suggested, but I don't have data to back it up.

C++ seems to make this distinction, but I think it's unnecessary; Complex addition of literals can be optimized away to compile time. I suppose the only issue is that T: !Add<Complex<T>> where T: Add<T>, for coherence reasons. You could do something like 1r + 2i but that's pretty horrible imo.

Centril · July 24, 2018, 8:24pm

Note that const fns are not permitted in traits or their implementations at the moment.

I think you should; I think the way you can write 1.0<m/s> in F# is quite nice. At least discuss why it is reasonable to ignore this.

mcy · July 24, 2018, 8:40pm

I'm aware. IIRC, this is a planned feature though, and I'm fine with this RFC blocking on it being specified.

I'm going to discuss to orthogonal issues that my RFC is not interested in resolving:

Identifiers that are not alphanumeric, like µs. My experience in languages that do allow these result in people going overkill with non-latin symbols. I think that any situation which requires keyboard chording or copy-pasting from a Unicode table (or elsewhere in the code) is a papercut, and not one which we should allow. Yes, it is possible to include alternatives like us, but in my experience, setting the precedent that such identifiers are ok is a road that ends in madness.
Suffixes that are a compound of symbols and operators feel... ad-hoc, to say the least. I think I can accept that they're somewhat readable, but I don't know how I feel allowing a full DSL as first-class; if anything, it means we need to carve out a bigger parse rule. Also, what folks seem to want is already achievable, though the declaration is nasty (not that I think declaration ergonomics are something worth arguing too much about; code is used far more than it is declared). Here's a quick sketch:

#![allow(all_the_names)]
struct m;
struct s;

impl IntLit for m { 
    type Output = Quantity<m>;
}

impl IntLit for s { .. }

impl Unit for m { .. }
impl Unit for s { .. }
// note that Unit: Mul<Quantity<U>> + Div<Quantity<U>>
// and that Unit: BitXor<i64, Output = UnitPower<Self>>

let _ = 1.0m/s^2;
// desugars to
let _ = <m as IntLit>::int_lit(1.0).div(s.bit_xor(2));

Alternatively, 42.0m / 1.0s. And, as you mention elsewhere, postfix macros give us 42.0.si![m/s], which feels more... right.

tl;dr I don't want to deal with what I think (though I could be convinced otherwise) are rare use-cases, and would prefer this RFC leaves the syntax open enough for extensions.

varkor · July 24, 2018, 8:45pm

Your three motivations are:

Complex numbers
Units of time
Type-checked units (in general)

Units of time are type-checked units. For complex numbers: what difference does i have to a unit of measure?

I think unless there's a good reason not to consider all your motivations as units of measure, they're a primary use-case. If they are a primary use-case, then even though the syntax and type-system considerations are orthogonal, it probably helps to consider them together.

But you're still requiring an Add impl, right? Or is 1 + 2i a single value? If that represents the addition of 1 and 2i, which seems most straightforward, then Imaginary is more precise. (Constant-folding/propagation is by no means guaranteed.)

kornel · July 24, 2018, 8:52pm

Of course ideally they should be orthogonal, but they should also work together nicely. It'd be shame if Rust added suffixes, and then it was discovered that they're unusable for e.g. the uom crate.

mcy · July 24, 2018, 8:55pm

I think they're ostensibly the same, but semantically different.

1 + 2i is not how units of measure behave. Dimensional analysis (at least, as I was taught in freshman physics) forbids adding together different units. Plus, i**2 is -1, not a new unit. This relation makes them very much not units of measure in my mind.
When folks discuss typechecked units, it's usually for the purpose of making the compiler deal with dimensional analysis for you. I think that grouping time units with general dimensional analysis is unifying two use cases that feel quite different to me.

Also, I added more possible uses of custom literals to my RFC. I came up with them as I was writing it and never revised the summary or motivation sections. I have updated it accordingly. But hey, this is what this discussion is all about!

This RFC rests on a number of other ponies that have been discussed, like my i__ and const fn in traits. I imagine this will be fixed by the eventual addition of ConstAdd. I would, however, the compiler can flatten an #[inline] impl of Add and primitive constant addition.

Like I said, I've never used these crates, so I'd like their current users to weigh in on whether this declaration style can be worked into the existing framework.

Centril · July 24, 2018, 9:07pm

Well, it was in my RFC https://github.com/rust-lang/rfcs/pull/2237, but that has since been closed since I wasn't happy with it.

Ad-hoc as compared to what? Today we have a few suffixes which happen to work as paths; but only using paths do not cover, in my view, important use cases in the units-of-measure domain. Also, sometimes ad-hoc solutions are OK if they cover the domain much better than a non-ad-hoc solution. (c.f. extern type is ad-hoc, but useful...).

If you want to use a library or DSL you'll have to understand how it works, and then the declaration becomes important. The way you got let _ = 1.0m/s^2 to work does not seem terribly intuitive and more like a hack.

It works fine syntactically; but as I said, using a post-fix macro is not composable so I think it is less apt a solution than what you've proposed here.

I agree with @varkor that it seems a prime use-case to deal with things like 1m + 1[m/s] * 300s

mcy · July 24, 2018, 9:21pm

Ok, ad-hoc is totally the wrong word. I think the reason I'm so opposed to it in the first place is that my time working with Scala (and other langauges of similar flavor, Haskell included) has made me develop a strong allergy to DSLs and operator overloading in general which I can't give a good word for. (This has, among other things, turned me off from proof-checkers like Coq and Idris.) shrug

I think I just fundamentally disagree with this, but I think that's because most of the software I write is compilers and internet services, so I don't see a use for it. I'm simply afraid of carving out a big part of the grammar for a use case that I have no data as widespread-ness.

I agree. This is how a lot of DSLs are defined in Scala and it's appaling. I am not suggesting people should actually do this. I'm only pointing out that it's not only possible, but that this syntax is unavailable!

I think we should probably back-and-forth on a sync platform instead of here-- would you mind hoping over to Discord? Ping me if that sounds good to you.

Centril · July 24, 2018, 9:51pm

I on the other hand think EDSLs are a fantastic thing that really speaks to the expressiveness of the languages you've enumerated. Sure, you can take it overboard if you throw in Agda mix-fix operators and unicode and use them all over the place (which incidentally led Conor McBride to tweet: https://mobile.twitter.com/pigworker/status/764410137884909568), but most features can be overused and abused. We Swedish have a saying "Lagom" which I think is really fitting here.

I don't think these domains (esp. compilers) are the main beneficiaries of units of measure. I expect scientific computing, game-devs (physics engine and such things) and folks who need to measure quantities in either the real world or some made-up world will need this.

CC me there (I use the same nick); for language design, hop into #design.

newpavlov · July 24, 2018, 10:50pm

I think this proposal has the following problems:

Use of marker types feels quite ad-hoc to me, and no, byteorder and other crates which use marker types are not good examples, because m and s will not be used as type parameters anywhere.
It breaks Rust naming convention, which inevitably will lead to a confusion. What i represents in use foo::bar::i;? My first thought will be a “value”, next “macro”, and type will be the very last.
How const fns will work on traits? Will it be a const in trait definition? Or in implementation?
You will not be able to return BigNum from const fn, same goes for regexp engines and other heap-backed structures. You may return &'static T, but will it result in an ergonomic code or will you have to clone?
Compound units are not supported. I know we have a different opinion on this, but I personally think that 1.2m_per_s2 and other longer suffixes will not be a good thing for readability of Rust.
String prefixes are not included (in many cases they will be a better choice that string suffixes), though they can be described with the proposed approach.

Regarding tying custom prefixes to the full-blown units system, I am not sure. Yes, ideally they should work together, but AFAIK we don’t even have a Pre-RFC for it, no?

mcy · July 24, 2018, 11:19pm

The whole point is to allow for them to be used in type paramters. I think Rust should allow constraining by pretty much any behavior, including literal processing.

This is still an open problem.

There's been RFCs floating about that would give this meaning. The intention is to require all implementations of a custom literal to be const, so that they can be used in constant position like built-in literals.

Actually, MIR evaluation will support heap allocations, but I don't have details beyond that.

This explicitly not a question the RFC wants to consider at this point.

I don't want to mess with prefixes in an already large RFC.

The unit system should not be part of std, so there's no need to write an RFC for it.

newpavlov · July 24, 2018, 11:36pm

Your proposal does not include any examples where unit marker types are used as type parameters outside of desugaring for literals. And, no, it does not count for a reason why marker type have to be used here.

AFAIK it will support heap allocations inside const fn, but will not allow return of an owned heap-backed value.

But IMO it should be included in drawbacks section (which I think is too slim right now) with other voiced issues.

No, if we want unit system to handle unification (see @Centril posts) properly, e.g. teach type system that m/s*m is equivalent to m*m/s, and maybe ideally auto-derivation of unit types, so you will not have to define all possible combinations like KgMeter2PerSecond2 (otherwise known as Joule) of basic SI types. If we want such system, necessary tools will have to be part of the language. (but note, not necesserally SI units themselves)

ExpHP · July 24, 2018, 11:43pm

I hate seeing these discussions about suffixes being turned into discussions about units of measurement. There are other use cases of suffixes. And as people on both sides of the argument apparently agree, suffixes don’t do crap to solve the problems of UoM.

mcy · July 24, 2018, 11:44pm

The marker type is convienient because it gives a unique point of reference to the trait system for the desugaring, and places the literal into the module system. I think not having a trivial desugaring, like with every other operator, is a bad idea.

Huh. I do want some way of being able to make some implementations const though. @Centril, do you have anything useful to say on this? I know you wrote a const fns in the trait system RFC.

The fact that the RFC limits its scope is not something that's to be listed in the drawbacks of the proposed feature.

This feature is orthogonal to the proposal.

newpavlov · July 24, 2018, 11:45pm

Same argument can be made for macro based system (which was roughly described here), which will have trivial desugaring as well.

CAD97 · July 24, 2018, 11:46pm

As I understand it, miri can use mutable allocation internally, and const can return statically borrowed heap memory, but not owned or mutable.

(as it has to be .rodata or similar, I don't know those names)

Centril · July 24, 2018, 11:56pm

cc @eddyb

It is; but there will still be interactions; and we should ensure that these work well. For example, say that we wanted to introduce the syntax 1 [m/s] at some later stage... how do we deal with this then and will we have two systems? We should always consider forward-compatibility in lang design.

mcy · July 24, 2018, 11:59pm

For sure! @kennytm already pointed out that this syntax already has a meaning, though, so we'll need to think of something else. I kind of like the angle brackets. Maybe 123some::path is sugar for 123::<some::path>. We can then make the syntax be $lit::<$some_limited_expr>. My proposal is intentionally light on the grammar, in view that most obvious things (like $lit$expr) can be forced to have meaning. Thus:

let _ = 80m;
let _ = 1.23::<m/s>;

Topic		Replies	Views
[pre-RFC] custom string literals language design	7	3953	March 25, 2019
Pre-RFC: Custom suffixes for integer and float literals language design	49	5027	March 25, 2019
Custom literals via const generics language design	14	2364	June 9, 2021
(Mega-pre-RFC) Reference specialization types (DSTs, proxy-references) language design	32	2639	March 25, 2019
[Pre-RFC] integer and float literals for custom types compiler	8	1644	December 9, 2020