Reserved Syntax Should be lexically valid

Edition 2021 added reserved syntax, making <simple-ident>#<simple-ident> and <simple-ident><string-literal> single tokens. However, in tests, I noted that use of these tokens is lexically ill-formed, except for the defined forms.

In GitHub - LightningCreations/lccc: Lightning Creations Compiler Frontend for various languages, reserved syntax is planned to be used to do various internal things (In particular, two keywords with k# prefix are defined to allow the asm!()/global_asm!() macros, as well as a compiler-specific macro used for implementing some stdlib functions, to expand to actual syntax), behind feature gates. However, I am unsure whether it is possible to gate syntax behind features.

Thus, I propose that any prefixed string/character literal or identifier be considered lexically valid, but an error to use anywhere (other than an input to a macro tt matcher or a function-like proc macro) unless it is considered valid by the stable rust language, or the relevant feature is enabled.

Alternatively, if this is too permissive, then perhaps a subset of prefixes may be considered lexically valid, such as prefixes starting with two consecutive underscores (a prefix that I would consider "reserved for the implementation", but that's a different discussion really). However, as the only valid use (in stable) would be to consume and discard the token, I do not believe this to be a problem (though proc-macros make a notable exception, since it's possible to specifically match these tokens if they are made valid).

I'm personally of the mind that everything should be lexically valid (that is, TokenStream::from_str should never fail), so in theory I agree with you. I also can't think of a possible use of a prefixed identifier that shouldn't be able to be passed to macro_rules! valid_lex { ($($tt:tt)*) => () } to just drop it on the floor, unused.

But that said, if it's primarily for use by lccc, you're not using the rustc lexer anyway (I don't think?) so it doesn't really matter what's lexically allowed by the language for well-formed code, you can just allow __lccc#asm or whatever in your implementation.

In fact, I believe rustc already does use some tokens internally that don't have printable equivalents.

Well, in general, I'd like this to be available for any implementation, but also, it comes into the idea of extensions, and that features should be the mechanism to allow extensions in alternative compilers and that, with purely stable code, no compiler should translate an ill-formed program (that mandates a diagnostic). The issue, of course, is that the lexer runs before features can be checked, so you'd then run into this program being valid on lccc and invalid on others, despite no #[feature] attribute being present.

 macro_rules! valid_lex { ($($tt:tt)*) => () }
valid_lex!(__lccc#asm)
fn foo(){}

I did think of a potential problem with string literals (namely, what escapes are valid and interpreted in the literal), but I'd be fine with those being excluded as, at least currently, the only use I have applies to identifier tokens/keywords explicitly.

Isn't the purpose of reserving something like foo"string literal" exactly to leave the lexing of such literals unspecified for now? The current reservation allows something similar to e.g. the r"foo bar" raw strings (that we already have) to be added in the future without additional breaking changes. The problem is: it depends on the prefix how a string is lexed! E.g. r"foo \" is a complete legal raw string, whereas "foo \" would be an unterminated string literal. Only by always erroring when lexing such literals can we keep all the flexibility we might want.

I'm not sure how much of this applies to prefix#identifier as-well, since I'm not sure if we really want to parse identifiers in a special way. On the other hand, as far as I know, the current implementation actually just disallows any thing of the form foo#...; even foo# with nothing following it, not just foo#identifier or foo#123.

8 Likes

Yeah, I recognized that after the fact (as I indicated in my above reply). And for string literals, implementations can just use suffixes, which are lexically (but not syntactically) valid for string literals. (I also can't think of very many extensions that would use prefixed string literals and would be sufficiently internal that it wouldn't be reasonable to ask T-lang on a case-by-case basis)

The problem is that the restriction is the way it is explicitly to break matching it in macros. Everywhere except macros, it was already a non-breaking change to give a meaning to k#foo .

And anything that affects tokenization requires making things lexically invalid. For example, how many tokens is f#"hello {s + "s"}!"?

3 Likes

That shouldn't change with this, I wouldn't think. The only valid thing would be to match it with a tt pattern and discard it (the macro_rules! valid_lex example above). It could probably still match, but hard error on, ident fragments (which, IIRC, still match keywords), since it's still an identifier.

Yeah, I recognized the issue with string literals and, as above, would be fine with this only applying to identifiers.

Note: When I say the only valid thing would be to match it with a tt pattern, I'm excluding explicitly matching the token without the relevant feature. This keeps the following code Ill-formed

macro_rules! match_lccc_asm{
    {__lccc#__asm ( $($tt:tt)*)} => {} // this line is ill-formed: unrecognized prefix for identifier `__lccc#`
}

match_lccc_asm!{asm!("foo")}; // Cannot be used to detect lccc without first flipping `#[feature(lccc_asm_syntax)]`

I'm going to revise the proposal.

The revised proposal will be for reserved identifier syntax only, so <id>#<id>, as I doubt these will change lexing behaviour. reserved string literals will not be made valid under the revised proposal, and implementations can use literal suffixes in there rare case they wish to provide a feature gated extension there.

The remainder of the proposal will remain the same, the only valid use of these tokens will be to consume them with a tt matcher, or a proc_macro. It is possible this could be extended to an ident matcher, as reserved keywords do match them (Rust Playground), but that will not be part of this proposal.

The tokens will not be valid as literal token matchers in macro_rules (see my "detect lccc as macro_rules" example, which will remain ill-formed without the relevant feature), under this proposal. Are there other issues that may arise in the future under this revised proposal, that anyone can see?

I can easily imagine circumstances in which id#id syntax could affect lexing.

Consider r# , which changes a keyword to an ident.

What if we need something else similar? Or a construct like id#id<tokens> that affects lexing of the tokens?

Keywords are identifiers too according to the lexer. I believe the lexer currently handles r# by setting a raw flag. It could be changed to store the r or store the entire r#keyword and split it when trying to parse a keyword. The lexer already outputs literals as a verbatim string of the literal as written that needs to be parsed to a value by the parser anyway.

Any reason why what you want to spell as __lccc#__asm can’t be made an unplottable token (i.e. one that does not exist in surface syntax)? Why even modify the common meaning of the language at all?

1 Like

As far as I know, keywords and identifiers are lexed identically. In fact, (as the playground link I showed demonstrates) it even goes further than that, as they match the $ident macro matcher (Which these would not). Likewise, k# should not affect this at all (this is even more restrictive than this definition, because the tokens wouldn't match $ident matchers, only $tt matchers, and only to either discard them, or forward them into another macro likewise).

This is more likely to be problematic, although if it is arbitrary tokens, I'm not sure that wouldn't already cause problems (and if it's restricted to, say, literals, id<string-literal> and id#<string-literal> remains invalid, and only floating-point literals wouldn't be valid for id#id<literal>).

I prefer to avoid pure magic wherever possible (for example, I insist that I will never put #[lang = "owned_box"] on lccc's alloc::boxed::Box, to the point where there are 3 lang items that exist because of Box that are not Box), using feature-gated extensions instead wherever possible.