Expression vs statement ambiguities

eggyal · June 15, 2022, 9:10am

The Rust Reference, under Expression statements, states:

An expression that consists of only a block expression or control flow expression, if used in a context where a statement is permitted, can omit the trailing semicolon. This can cause an ambiguity between it being parsed as a standalone statement and as a part of another expression; in this case, it is parsed as a statement. The type of ExpressionWithBlock expressions when used as statements must be the unit type.

I appreciate that, however one chooses to resolve grammatical ambiguities like this, there will always be some edge cases that cause paper-cuts. In this case, the above leads to situations like that raised yesterday in #98093:

fn test() -> bool {
    unsafe { std::str::from_utf8_unchecked(&[65]) }  == "A"
//  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
//  parsed as ExpressionWithBlock statement, expecting unit type
//
//                                                   ^^^^^^ syntax error
}

It occurs to me that Rust could potentially modify the above rule thusly:

An expression that consists of only a block expression or control flow expression, if used in a context where a statement is permitted, can omit the trailing semicolon. This can cause an ambiguity between it being parsed as a standalone statement and as a part of another expression; in this case, it is parsed as a statement — unless it is immediately followed by an infix operator, in which case it is parsed as that operator's left-hand operand. The type of ExpressionWithBlock expressions when used as statements must be the unit type.

Of course, this would then raise different ambiguities in the case of tokens which could be either prefix operators or infix operators: namely & (borrow/and), * (dereference/multiply) and - (negate/subtract): it would be necessary to resolve those ambiguities in favour of them being prefix operators so that currently valid code is not parsed any differently to the status quo.

I think such a change should therefore only result in some programs that are (surprisingly to some) currently rejected then being accepted.

I've searched around for previous discussions but unfortunately was only able to unearth:

a very early rust-dev thread that's only really relevant for historical interest; and
an RLIO topic from last year that discussed the purpose of having semicolons at all.

Is this something that has been discussed before but I've not been able to find it? If not, is such a grammatical change viable? I would be happy to work on an RFC if it could potentially be well received (albeit I'd also be grateful if someone is willing to help steer me through the process, having not proposed one before).

chrefr · June 15, 2022, 10:40am

It will limit our ability to make currently-infix operators prefix. Not sure this is a concern, but worth noting.

Putting that aside, I'm afraid that this will actually make the rules less understandable. I think it is easy to explain the current rules than if there would be some operators that work and some that do not.

eggyal · June 15, 2022, 10:46am

Excellent point, noted.

Aye, this was my concern in the referenced issue (I should have mentioned it here too—apologies). That said, it's also why I emphasised that there will always be paper-cuts no matter how this is resolved; to me it boils down more into what is more intuitively expected than how easy it is to explain the detail: that is, which will cause the fewest paper-cuts?

scottmcm · June 15, 2022, 8:31pm

The big problem we have here is that - and & and * are both prefix and infix.

So, for example, this compiles today:

fn foo() -> &'static i32 {
    {()}
    &4
}

(Yes, it's silly, and the help in the warning is wrong, but it compiles.)

And there are more reasonable versions where the block is a while loop or something.

So I think the rule might need to be "unless followed by an infix operator that is not a prefix operator", but that's getting more complicated, reducing the value.

Especially since {3} + {4} doing something different from {3} - {4} could be even more confusing than the current situation.

eggyal · June 15, 2022, 8:33pm

Yeah, that's exactly what I was proposing.

That's a fair point.

scottmcm · June 15, 2022, 9:08pm

Hmm, come to think of it, it's not just "is not a prefix operator", but the broader "is not a token that can start an expression.

fn foo() -> i32 {
    {()}
    <i32>::MIN
}

fn bar() -> impl Fn() -> i32 {
    {()}
    || 0
}

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=a3afd4e8492ffaf6f2231bd682736063

So what's left? =, >, + -- anything else?

I wonder what this would look like in the other direction. What if we made a breaking change over an edition? Is there a different simple rule that might be better?

Strawman: The "implicit semicolon" is added if the } is followed by a literal or an ident, otherwise it expects an infix expression.

Dunno if that's even better, let alone whether it'd be worth the churn of changing it, but it might be interesting to ponder.

CAD97 · June 15, 2022, 11:16pm

Disclaimer: while I'm on wg-grammar, this is very much my own opinion, and not that of the group.

The amount of edge cases of expression but if it's ()-valued it's a statement makes me wonder if it was worth it to elide semicolons after }.

It's definitely very convenient! And it definitely carries a lot to make Rust feel more familiar, and is a useful middle ground between full ; omission and requiring ;s after everything including blocks.

But this is the one place where types impact the grammar of Rust, and that makes it feel off when I'm trying to do formal-ish things with the grammar. In practice the parse tree doesn't depend on the types lining up because of rules like the above, since if both would be valid a statement is preferred and a () type constraint is preferred — except this isn't even the full rule, because what about

let _ = {
    // …
    if /* … */ {
        // …
    } else {
        // …
    }
};

Here the if-else is an expression, not a statement, even though it would be syntactically valid for it to be a statement.

So in truth it's more like it's an expression unless it's in statement position followed by another token which is not the end of the containing block or a ;. EXCEPT if that following token is ; and the type is () some tooling will treat it as a statement followed by a redundant semicolon rather than an expression statement.

Removing this behavior and requiring ; is silly. It's just churn for the purpose of churn. But I definitely wouldn't make the follow set behavior for deciding expression or statement any more complicated than it already is.

system · September 13, 2022, 11:17pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Possible parser defect compiler	8	558	September 3, 2024
Understanding decisions behind semicolons language design	34	3904	January 18, 2022
Unsafe Blocks / Async Blocks : should they be parsed differently? language design	9	857	April 9, 2023
Control flow in final operand? language design	7	741	December 8, 2022
Why if/else expression in Rust doesn't end with a ;?	6	1932	March 25, 2019

Expression vs statement ambiguities

Related topics