Pre-RFC: Custom suffixes for integer and float literals

Since the parallel is drawn to C++, I’d like to attract attention on the fact that to avoid conflict between standard literal suffixes and user-defined literal suffixes, the C++ Standard mandates that user-defined literal suffixes must begin by _.

Since similar issues could occur in Rust, it might be worth applying the same rule:

use std::time::duration_literals;
let dt1: Duration = 1_s + 200_ms; // or just 1.2s, see further
let dt2: Duration = 10_us;
let dt3: Duration = 10.2_µs; // µs and us are equivalent, proposal can support both

use simple_units::literals::*;
let distance1: Meter = 1_m;
let distance2: Meter = 5_nm;
let accel1: MeterPerSecond2 = 2.5_g;
let accel2: MeterPerSecond2 = 3[m/s^2]; // see further on how square brackets work

So, the difference really is 1_s vs 1.s().


Disclaimer: the following is a shoot in the dark; it is unclear to me whether the proposed syntax could conceivably cause ambiguities. I’d expect not, since methods are available on primitives and data-members are available on expressions, but I have not proved it.

The noisy () could potentially be reduced by considering the addition of Properties to the language. Specifically, by making it so that 1.s evaluates to 1.s() when s is a property-able method.

This would let us rewrite the example above as:

use std::time::TimeProperties;

let dt1: Duration = 1.s + 200.ms; // or just 1.2s, see further
let dt2: Duration = 10.us;
let dt3: Duration = 10.2.µs; // µs and us are equivalent, proposal can support both

use si::DistanceProperties;

let distance1: Meter = 1.m;
let distance2: Meter = 5.nm;
let accel1: MeterPerSecond2 = 2.5.g;
let accel2: MeterPerSecond2 = 3.m / (1.s * 1.s); // not the greatest, I'll admit.

This does not support compound units, however should be familiar to most users:

  • It uses method calls on primitives, which is widely available, either through monkey-patching or method extensions.
  • It uses properties, which is widely available.

I would like to voice support for the leading underscore _ in any such custom suffix to a literal. The SI standard mandates that a space separate a quantity from the corresponding SI unit. The _ provides an equivalent to the mandated space while still lexically binding the suffix to the preceding numeric quantity. I find 5_ns completely acceptable, whereas for me 5ns is a typo rather than an SI-based measure.

Disclosure: I’ve personally edited over 15k pages of IEC standards, so I am more sensitive to this issue than most people (though not more so than the central office editors who work for ISO, IEC, ECMA, CCITT, etc).

Edits: For those in the States who find references to multinational standards organizations not so compelling, I’ve also edited over 3k pages of IEEE and ISA standards, including the first edition of IEEE 802.11, the WiFi standard. Their central office editors impose the same requirements.

3 Likes

If we are content with a macro-based solution, then post-fix macros would make this slightly nicer perhaps:

let a = 3.si![m/s^2];
4 Likes

@iliekturtles That syntax appears to expand naturally to annotating compile-time and run-time expressions with units.

I like the _ prefix as well, but would like to note that it can’t serve to distinguish builtin/user suffixes, since 42_u16 is already valid Rust.

C++ can do it this way because their literal grouping character is the apostrophe, if I’m not mistaken.

3 Likes

Of course it can't be used to make that distinction. I always write long u32, u64, and u128 literals that way, as a _-separated suffix. (E.g., 0x_dead_beef_u32.) To me the advantage of requiring the _ prefix on non-standard suffixes is that it calls reader attention to the suffix. That's particularly true when the suffix starts with a confusable character such as lower-case l.

Honestly, I think that the C++ convention is silly. AFAICT, it’s predicated on two things:

  • C++'s disastrous modularization story (e.g. #include order affecting compilation), which means the standard has a legitimate need to reserve the best suffix real-estate for itself.
  • The idea that the leading underscore lets readers determine if a literal is standard or not. In Rust, standard literals are all names of primitive types, so at the point that you are using custom literals, this is not a readability win and just Hungarian noise. It’s considered good style (AFAIK) to write 0xmylonglonglonghex_u64 anyways.

Rust should treat integer conversion literals and custom literals the same. If we went with my trait-and-opaque-type proposal, we’d add e.g. the following to one of the numeric modules in core:

impl IntLit for i32 {
    type Output = i32;
    const fn int_lit(x: i__) -> i32 {
        x as i32 // explicit cast, though technically 
                 // unnecessary accoring to my proposal for
                 // for i__ and friends
    }
}

(Of course, this exists for much the same reason that Add and such are implemented for i32 and friends. It’s only a formality to make the trait system happy, since i32 comes with + as a builtin and not by virtue of any trait.)

Naturally, since the primitive types are already introduced into all scopes (not as part of the core prelude), it is natural to expect we could shadow their literals with your own (though I think clippy should throw a fit over this) just as you can shadow all of the primitive types. C++'s distinction is, after all, at the grammar level, since you’re allowed to write e.g. operator "" k, which will get you a warning about how this function can’t be called.

Not only does it put that syntax on equal footing with all other literals, as opposed to enshrining primitive types as more special than they practically need to be, but it lets you do hilarious things like pretend you’re writing pre-1.0 Rust:

type uint = usize;
let k = 0uint;
1 Like

The strength of Rust's imports is that you can explicitly see all the names you import with a use statement, unless you use *. So I would expect use std::time::duration_literals; to only bring duration_literals into scope. (It's unclear if it actually does that in your proposal, though.)

Unfortunately, we can't just write use std::time::duration_literals::{s, ms};, as that could mean we import normal identifiers, and we actually want to import a suffix identifier. Maybe something like use suffix std::time::duration_literals::{s, ms}; can work.

But it works correctly if the "suffix" identifiers start with an initial underscore (_), as described in prior posts and which provides SI-unit suffixes that visually comply with the SI standards and the expectations of the millions of people worldwide who are used to working in the SI system. So just give up on concatenating letters to digit strings and instead use the leading underscore to indicate that it's a suffix.

SI standards do not apply to programming languages. As you said, the standard mandates a space separator. In my opinion, 5_ns and 5ns are equally close to 5 ns (although 5_ns is slightly more readable). But I don’t think this is too important. If it’s decided that 5ns is allowed, there is no reason to disallow 5_ns, so 5_ns will likely be allowed either way.

However, using the leading underscore as a distinguishing method between normal identifiers and suffixes can be problematic. First, you can have normal identifiers that start with an underscore. It’s completely allowed and doesn’t trigger compiler warnings. So imagine we imported a _s. Is it a function or a suffix?

We can just determine that by usage. So _s() will only work if _s is not a suffix, and 5_s will only work if _s is a suffix. That’s actually not too bad because it’s similar to how it currently works with imported types and methods, for example (you can only use them in an appropriate context). There is also a restriction that you can’t import a _s function and a _s suffix in the same scope but it’s also OK.

Next, imagine we imported _s and __s suffixes and want to use them. Does that mean that we can only write 5_s and 5__s? What if the user writes 5___s? 5____i32 works, so why would custom suffixes be more restrictive? If the underscore is not a part of the identifier, you can write 5___s and be sure that you used the s suffix, not something else. On the other hand, that would mean that suffixes are not allowed to start with an underscore.

Unfortunately, leading-underscore identifiers are already allowed, so this produces churn. Plus, separating namespaces by charset (e.g. Haskell, where types must be capitalized and variables (term and type) are lower case) is unprecedented in Rust, and maybe now is not a great time to start. (Ok, we sort of do this with macros, but the ! is a sigil, not part of the identifier. panic!() lexes as "panic", "!", "(", ")".)

If we want any reasonable import story for literals, they'll probably need to live in either the type or free-function-and-constant namespace (the OP proposal probably should put them in the free-function namespace, while mine dumps them into the type namespace). Creating yet another namespace just for a comparatively small feature is more baroque than we really need.

Indeed; I don't think that there is any hope to be able to pull what are ultimately typesetting standards into a plaintext format! Source code is not a document for publication, in any traditional sense.

1 Like

Of course typesetting standards don’t apply directly. However the expectations of professionals who use SI units in their daily work should be part of the considerations. Do recall that code is read many more times than it is written, so intelligibility to the reader should be a significant factor.

Just directly concatenating a unit with a literal value, using as the rationale that C++ did so, is not compelling; if it were Rust would have a lot less divergence from C++. Or perhaps I’ve misunderstood the intent of Rust to be a language usable by people of science who are not professional programmers.

I'm not sure that this is an issue? If a project really believes in adhering to the standard, they can always use 123_unit form and enforce it by linter or code review? It is already the case that this form is the predominant one, and Rust does not attempt to enforce any sort of style, beyond warnings that can be turned off.

I also would like to be able to write let z = 1 + 2i; for complex numbers. Complex numbers are defined in the num crate, which is not privileged like core, alloc, or std.

This is not my rationale, and, in fact, C++ requires the leading underscore for anything defined with operator "" outside of std. My rationale is that current Rust behavior does not require the underscore for existing builtin literals, and this behavior should remain uniform across all literals, discarding the notion of "builtin literals" entirely.

4 Likes

I actually just thought of a really, really annoying problem, which deserves mention.

How does 10e100 lex? Clearly, it should just lex as a single float literal, since the alternative is vanishingly rare in comparison, but… what the following code do?

enum e100 {}
impl FloatLit for e100 { .. }

I think we should just lint against this, since in principle you could write

<e100 as FloatLit>::float_lit(10);

and even more technically 10e0e100 would produce the expected result. We should still warn though, since this is just going to confuse people.

3 Likes

Agreed about the linting. I’m also concerned about using ASCII or Unicode confusables as the initial character of suffixes that are not underscore-separated from the preceding Rust literal. My earlier example of that was lower-case l. Bad actors will enter the Rust ecosystem; I’d like to be proactive against them by linting against direct juxtaposition of confusables.

Aside from specific cases with deep traditional use, such as complex numbers (suffix i) and quaternions (suffixes i, j, and k), which latter are also used to represent rotations in 3-space, I’d like to lint against any suffix that is not separated from the preceding literal string by an underscore (_). Omitting the underscore saves typing one character, at the expense of readability.

Honestly, this all sounds like more pain in the neck than it gains “ergonomics”. How about we just don’t add more magic literal suffix syntax? I would hate to read code that relied on them apart from the (obvious) ones that we have today.

Even with such simple things as complex numbers and quaternions, conventions vary, so we couldn’t be sure if 2 - 3j means 2 - 3i the complex number (because i and j are used interchangeably, e.g. physicists and electrical engineers tend to prefer j, while mathematicians usually use i), or it is the quaternion 2 + 0i - 3j + 0k.

I still think reading ergonomics is an overwhelmingly more important question than writing, and Complex { re: 2, im: -3 } and Meter::new(1.0) or Length::new::<Meter>(1.0) seems so much easier to digest at first glance than random suffixes.

Basically, this boils down to the same argument as the question of why we don’t program in plain English but in English-like, more structured artificial languages. It’s nice to have some degree of similarity, but it has to stop at the point where it becomes more ambiguous than understandable. And I think the proliferation of arbitrary suffixes crosses that line.

3 Likes

Constructing things the long way is fine if you only do it occasionally, but from personal, recent experience: the more you have to do it, the more tempted you are to just start taking shortcuts. Conveniences that encourage stricter type safety can be valuable, even if they sacrifice some explicitness.

As an aside: succinctness and explicitness are both aspects of ergonomics. Meter::new(1.0) semantically adds nothing over (say) 1.0{m}, provided the reader is aware of what that syntax means, but it does take longer to visually parse and consumes more space, making more complex expressions harder to read.

5 Likes

Thank you everyone for your response and critique!

I’ll wait a bit more for this discussion to unfold further and at the end of this week will try to write “take 2” for this proposal. Currently my thoughts are:

  • We need this feature, Duration::from_secs(2) + Duration::from_millis(200) is annoying both to write and read.
  • We need clear imports for custom literals.
  • But we probably don’t want to create a separate namespace for suffixes.
  • Feature should be extensible to string literals prefixes/suffixes.
  • Custom literals should be usable in match statements and ideally with runtime values.
  • Regarding 1s vs 1_s I think feature should support both, maybe with a recommendation to use 1_s in guidelines (excluding exceptions like complex numbers and quaternions)
  • For compound units (e.g. “m/s^2”) looks like the best approach will be to use suffixes which accept additional arguments, something like 2.3_si[m/s^2] (I am not sure if we need an explicit ! here, see further). Here you’ll import custom si suffix which will be able to support various units.

So here my rough thoughts: I think solution is to use macros for custom literals. Reasons are:

  • they will not conflict with variable names and naming conventions
  • macro output can be used in match statements without any problems
  • procmacros will allow more flexibility in future

In other words 1s will be desugared into s!(1) and 2.3_si[m/s^2] to si!(2.3, "m/s^2"). Because in this design custom literals can only be defined by a macro we can omit !. If user will wants to use those suffixes on runtime values, he can write: s!(var) or with postfix macros var.s!(). Macros which define custom suffixes should be marked with #[cusom_suffix] attribute. Though we will need a proper macros import system.

Unresolved questions:

  • Details on how macros will look internally. Should they accept expr or something like int_literal/float_literal?
  • Do we want a generic way to define set of sufixes without code duplication? (e.g. u1-u256)
3 Likes

Excuse me, but for me, it makes more sense and it's the literal suffix that's taking more time and effort to parse visually.

In addition, the question of ambiguity is a real problem even from the compiler's point of view.

The second part of this sentence is true but does not imply the first. See Time units by newpavlov · Pull Request #52556 · rust-lang/rust · GitHub for a much more lightweight proposal (already mentioned in this thread) that makes your example look pretty reasonable already: 2*S + 200*MS.

So, you'll need a much stronger motivation to argue that going from there to 2s + 200ms is worth all this machinery. (I am not saying I am against it, I am just saying your argument is not making a fair comparison.)

3 Likes