Expression vs statement ambiguities

The Rust Reference, under Expression statements, states:

An expression that consists of only a block expression or control flow expression, if used in a context where a statement is permitted, can omit the trailing semicolon. This can cause an ambiguity between it being parsed as a standalone statement and as a part of another expression; in this case, it is parsed as a statement. The type of ExpressionWithBlock expressions when used as statements must be the unit type.

I appreciate that, however one chooses to resolve grammatical ambiguities like this, there will always be some edge cases that cause paper-cuts. In this case, the above leads to situations like that raised yesterday in #98093:

fn test() -> bool {
    unsafe { std::str::from_utf8_unchecked(&[65]) }  == "A"
//  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
//  parsed as ExpressionWithBlock statement, expecting unit type
//
//                                                   ^^^^^^ syntax error
}

It occurs to me that Rust could potentially modify the above rule thusly:

An expression that consists of only a block expression or control flow expression, if used in a context where a statement is permitted, can omit the trailing semicolon. This can cause an ambiguity between it being parsed as a standalone statement and as a part of another expression; in this case, it is parsed as a statement — unless it is immediately followed by an infix operator, in which case it is parsed as that operator's left-hand operand. The type of ExpressionWithBlock expressions when used as statements must be the unit type.

Of course, this would then raise different ambiguities in the case of tokens which could be either prefix operators or infix operators: namely & (borrow/and), * (dereference/multiply) and - (negate/subtract): it would be necessary to resolve those ambiguities in favour of them being prefix operators so that currently valid code is not parsed any differently to the status quo.

I think such a change should therefore only result in some programs that are (surprisingly to some) currently rejected then being accepted.

I've searched around for previous discussions but unfortunately was only able to unearth:

Is this something that has been discussed before but I've not been able to find it? If not, is such a grammatical change viable? I would be happy to work on an RFC if it could potentially be well received (albeit I'd also be grateful if someone is willing to help steer me through the process, having not proposed one before).

1 Like

It will limit our ability to make currently-infix operators prefix. Not sure this is a concern, but worth noting.

Putting that aside, I'm afraid that this will actually make the rules less understandable. I think it is easy to explain the current rules than if there would be some operators that work and some that do not.

2 Likes

Excellent point, noted.

Aye, this was my concern in the referenced issue (I should have mentioned it here too—apologies). That said, it's also why I emphasised that there will always be paper-cuts no matter how this is resolved; to me it boils down more into what is more intuitively expected than how easy it is to explain the detail: that is, which will cause the fewest paper-cuts?

1 Like

The big problem we have here is that - and & and * are both prefix and infix.

So, for example, this compiles today:

fn foo() -> &'static i32 {
    {()}
    &4
}

(Yes, it's silly, and the help in the warning is wrong, but it compiles.)

And there are more reasonable versions where the block is a while loop or something.

So I think the rule might need to be "unless followed by an infix operator that is not a prefix operator", but that's getting more complicated, reducing the value.

Especially since {3} + {4} doing something different from {3} - {4} could be even more confusing than the current situation.

3 Likes

Yeah, that's exactly what I was proposing.

That's a fair point.

Hmm, come to think of it, it's not just "is not a prefix operator", but the broader "is not a token that can start an expression.

fn foo() -> i32 {
    {()}
    <i32>::MIN
}

fn bar() -> impl Fn() -> i32 {
    {()}
    || 0
}

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=a3afd4e8492ffaf6f2231bd682736063

So what's left? =, >, + -- anything else?


I wonder what this would look like in the other direction. What if we made a breaking change over an edition? Is there a different simple rule that might be better?

Strawman: The "implicit semicolon" is added if the } is followed by a literal or an ident, otherwise it expects an infix expression.

Dunno if that's even better, let alone whether it'd be worth the churn of changing it, but it might be interesting to ponder.

5 Likes

Disclaimer: while I'm on wg-grammar, this is very much my own opinion, and not that of the group.

The amount of edge cases of expression but if it's ()-valued it's a statement makes me wonder if it was worth it to elide semicolons after }.

It's definitely very convenient! And it definitely carries a lot to make Rust feel more familiar, and is a useful middle ground between full ; omission and requiring ;s after everything including blocks.

But this is the one place where types impact the grammar of Rust, and that makes it feel off when I'm trying to do formal-ish things with the grammar. In practice the parse tree doesn't depend on the types lining up because of rules like the above, since if both would be valid a statement is preferred and a () type constraint is preferred — except this isn't even the full rule, because what about

let _ = {
    // …
    if /* … */ {
        // …
    } else {
        // …
    }
};

Here the if-else is an expression, not a statement, even though it would be syntactically valid for it to be a statement.

So in truth it's more like it's an expression unless it's in statement position followed by another token which is not the end of the containing block or a ;. EXCEPT if that following token is ; and the type is () some tooling will treat it as a statement followed by a redundant semicolon rather than an expression statement.

Removing this behavior and requiring ; is silly. It's just churn for the purpose of churn. But I definitely wouldn't make the follow set behavior for deciding expression or statement any more complicated than it already is.

2 Likes