Pre-RFC: Custom Literals via Traits

I propose the following solution to the i__ input type problem:

// marker
trait LiteralInput {}
impl LiteralInput for i32 {}
//impl LiteralInput for u32, i64, ..., f32, f64, &'static str, ...
// possibly add big integer literals in some form in the future

trait Literal<T: LiteralInput> {
    type Output;
    /*const*/ fn apply_literal(input: T) -> Self::Output;
}

struct Meter(i32);
enum m {}
impl Literal<i32> for m {
    type Output = Meter;
    /*const*/ fn apply_literal(input: i32) -> Meter {
        Meter(input)
    }
}

With this, a custom literal can use any of the available input types, and itā€™s possible to support more input types (e.g. i256 or [u8; N]) in the future if needed. If the custom literal requires i32, the compiler will try to interpret the preceeding integer literal as i32 (and fail if itā€™s not a valid i32 literal).

It also will be possible to implement the trait for multiple input types, e.g. impl Literal<i32> for m and impl Literal<i64> for m. This is useful if multiple input types makes sense for this literal (for example, it may even produce a generic Meter<T: Num>). The compiler will pick a suitable implementation for invokation. This can work in a very similar way to how operator traits like Add<Rhs> or Index<Idx> work now.

If there are multiple implementations and the supplied literal can be interpreted as any of them, itā€™s not clear what the compiler chooses. Itā€™s not very good but itā€™s the same situation we already have with normal integer literals. You may not know what type is selected for vec![1] if the context doesnā€™t place additional limitations. The type inference should do its magic. If the context requires a Meter<i64>, the impl Literal<i64> for m will be chosen over impl Literal<i32> for m.

1 Like

If this one is linted, there should also be lints for:

  1. impl IntLit for ** where the name ** starts with any of [a-fA-F], since 0x7fff_EiB (intended for exbibyte) will be interpreted as 0x7fffE with suffix iB.

    • This could be problematic as it covers a lot of units of measures (A, cd, Bq, C, F, au, day, dB, eV, ft, fl_oz, prefixes exa-/deca-/deci-/centi-/femto-/atto-)
    • We could avoid this if there would be separate HexIntLit/OctIntLit/BinIntLit traits, thought that means u256 will need to implement 4 traits (should be fine?).
    • Speaking of which, should we also lint prefix x and o because of 0x6? I suggest no since 0_x6 is unambiguously parsed as 0 with suffix x6.
  2. impl *Lit for ** where the name ** starts with _. 1234____s is equivalent to 1234s so there shouldn't be any leading underscores.

Or maybe these don't need to be linted at declaration site since you could unambiguously call it like 10::<e100> or 0x7fff::<EiB> (I'm not sure if using turbofish for this syntax is a good idea though).

In C++, milliseconds is a type: std::chrono::duration - cppreference.com (As is milli, but that's a different discussion.)

That solves all these problems, and as a bonus allows more efficient uses in places that don't want or need 1ns resolution (like sleeping a thread, which can sleep neither for 1ns nor for 136 years).

(Of course, it needs a bunch of stuff that rust doesn't have yet to have it not also cause a ton of manual grunt work creating all these types and to avoid the need for .into() all over the place to actually use them.)

Yeah... this is something of a mess. I'm actually in favor of HexIntLit/OctIntLit/BinIntLit, all of which have IntLit as a supertrait. This lets us lint a little bit less aggressively. core certainly won't have to put up with it, since the implementations for the primitive literals will get macro-generated anyways. Unit literals really shouldn't be given by hex, or octal.

10::[e100]? I think further abusing :: for parsing disambiguation is fine. Even in the presence of a disambiguation syntax, we should still lint, because people will be really confused when 10e100 doesn't do what they want and it takes them a second to remember about scientific notation literals.

1 Like

Other than the type checked SI units proposal which currently seems to have a bunch of other open questions, couldnā€™t these use cases be supported by either:

A (proc?) macro:

  • time!(2s + 3m)
  • complex!(3 + 4i)
  • quaternion!(2i + 3j + 4k)

A normal const fn:

  • Regex::from_static_str("foo(.+)")
  • Utf16String::from_static_str("Hello, World!")

Or, a const fn + literal types:

  • BigInt::from_literal(192346712347913279231461927356)
  • HalfFloat::from_literal(1.23)
1 Like

Sure, this is the implicit ā€œor we could just, not do thisā€ in any RFC. Also, things like time!(2s + 3m) are hilariously hard to implement, because + canā€™t be used as a separator currently.

I have to ask, what about 1e10e10?

Parses as 1e10::[e10] and desugars to <e10 as FloatLit>::float_lit(1e10);. I think I pointed this out elsewhere.

or should it be e10e10::int_lit(1)?

The RFC is very clear on this. We always parse a literal as far as it will go before parsing the suffix, because doing otherwise will confuse users. See @kennytmā€™s replies.

1 Like

So we have...

and if we have:

1.23 < m / s >

(or some other two set of tokens for disambiguation) we desugar to:

<<m as Div<s>>::Output as IntLit>::int_lit(1.23)

I guess that makes sense... Basically, we interpret < m / s > as a type. However, translating / to Div this way is a bit weird.

Wild speculation: Another mechanism could be to interpret m / s as a const expression and so you get something like Unit<{ m / s }>. However, then you may get into coherence problems wrt. the type constructor Unit and so it might need to be #[fundamental] so that you can write something like:

impl IntLit for Unit<{m / s}> {
    type Output = Qty<Quot<Meter, Second>>;
    const fn int_lit(lit: i__) -> Self::Output {
        ...
    }
}
1 Like

Such Unit type is probably not definable in Rust.

struct Unit<const suffix: ?????> { ... }

BTW,

/ has higher precedence than ^, this would be desugared to

(<m as IntLit>::int_lit(1.0).div(s)).bit_xor(2);

you'll need an exponential operator for this to be ordered properly (which cannot be ** since x**y already means x*(*y)).

Oh dear... Hmm... the only thing I can think of is:

struct Unit<const suffix: TypeId> { ... }

and so you translate to Unit<{ make_type_id( m / s ) }> but at this stage you might be better of translating to:

Unit<typeof(m / s)>

or directly via:

<typeof(m / s) as IntLit>::int_lit(1.23)

instead.

EDIT: actually, this last snippet is not so bad?

I havenā€™t been able to review this proposal, or the custom suffixes proposal, to the depth I would like. I apologize for anything already covered or discussed.

In my biased opinion (Iā€™m the author of uom) units of measure are one of the more important uses of custom suffixes and the RFC shouldnā€™t discount that. Using abbreviations works for simple units when required to be a valid rust identifier. What about more complex abbreviations? m/s? m/s^2? m/sĀ²? ƅ? Fall back to using full descriptions written as valid identifiers? What about supporting quantities with no explicitly defined units? Can existing units be combined (kg m/s^2)?

My other concern is around handling of literals. All of the examples except the first three would require code running at compile time. e.g. in the BigInt example when the number canā€™t be represented by the largest built-in integer type how do you convert that into an instance of BigInt at compile time? If delaying the conversion until run time is 12345..._BigInt really that big of an improvement over "12345...".bigint() or BigInt::new("12345...") that are already supported today?

You could ignore the problem and just include the built-in literal type in the trait definition:

pub trait<T> CustomLiteral<T> {
    type Output;
    const fn custom_literal(literal: T) -> Self::Output;
}

impl CustomLiteral<i32> for meter {
    type Output = Meter;
    const fn custom_literal(literal: i32) -> Meter { Meter::new(literal) }
}
3 Likes

Ah, excellent! I was going to try to track you down!

The discussion so far seems to have settled on picking some hitherto unused syntax that simple literals desugar to. The current syntax strawman is 42i32 -> 42::[i32]. It is unclear what the grammar inside ::[] (which one might be tempted to call a "turbolit", as in "turbofish") is going to be, but the idea is that expressions like 123::[m/s] will be desugared into something like m::int_lit(123).lit_div::<s>(). I think that this is far more than I want to try to figure out in an initial RFC, so I'm punting on any literals that aren't Rust identifiers. (Also, as nice as having a symbol for angstroms sounds, I'm not going to be in favor of allowing non-ascii outside of str/char lits.)

Right, this is a problem my proposal is blocked on. In order to happen, we need to sort out my literal types proposal (which I linked in the OP), which adds the unsized types i__ and f__. In principle it might be ok to have &'static i__ around at runtime, but it'd need to be cast to i32 or whatever first (which would be mediated by a compiler-provided shim, leaving the representation of such pointers unspecified).

Your alternate proposal is interesting, but has a few problems. First, we'd probably want this:

// core::marker
pub unsafe trait FromLiteral {}
unsafe impl FromLiteral for i32 {}
unsafe impl FromLiteral for &'static str {}

to have as a bound for T in your trait. There's also the problem that you need to implement it all at once for all of the number types if you want reasonable behavior, and there's no good way to enforce this. We also can't enforce that you need e.g. T = i64 if you want T = f64. My FloatLit has IntLit as a supertrait.

There isn't. The rationale for adding this is the same rational for any operator overloading: ergonomics and readability. (Of course, like all operators, it's a massive readability footgun if overused, but that ship sailed with the rest of the ops traits.)

After mulling it over for a few days, Iā€™m going to postpone this proposal for now. Thereā€™s just too many unresolved questions, and Iā€™m also kind of waiting for @varkor to come up with an opinion on exactly what a prefix is in the type system, as well as figuring out the story with arbitrary-size compile-time literals.

2 Likes

That's not going to be a major issue when the Allow non-ASCII identifiers (#2457) RFC finally gets implemented. (An earlier version is already implemented in nightly.) Many Unicode scientific symbols will become generally available at that point.

Well, that's certainly not my problem. I disagree with that proposal (and, in fact, was unaware of it!), but if Rust gets non-ASCII idents, this proposal will get them as corollary.

use .. as ..; can be used to rename them to something that doesn't require finger-hurting chording, in any case!

Nothing requires use of non-ASCII characters. However, since the limited ASCII subset of the Roman alphabet is not the native alphabet for most people in the world, at some point Rust needs to be inclusive and not relegate everyone else to secondary-citizen status.

It should be possible to write variable and type names in one's own native language, even when it has umlauts or cedillas or accents on vowels (and that's just for people native to western Europe). Non-Roman alphabets depart even more from ASCII, and are used by many more people worldwide than those who use variants of the Roman alphabet.

Note that Rust already supports Unicode identifiers as a nightly feature. So it's more than a proposal; it's already implemented and (somewhat) in use. However the final implementation is expected to be somewhat different, which is being discussed in that linked RFC.

1 Like

Are you aware that the Rust characters {, }, [, ] require chording on essentially all non-English keyboards in Western Europe (e.g., Danish, Swedish, Finish, Portuguese, Spanish, French, German, Italian), and that many of those keyboards also require chording of |, @, #, and ^? In some cases the ` character itself is unavailable even with chording, so must be copied or entered by other arcane keyboard magic.

In its choice of often-used glyphs Rust has already mandated chording for people with non-English keyboards. What's a tiny bit of similar inconvenience for someone who wishes to write Āµ or ƅ on an English keyboard (both of which characters are easily entered from most non-English Western European keyboards)?

2 Likes