pre-RFC: a new symbol mangling scheme

If it's simple enough (should be), mrustc might be good enough to transpile it.

I think the reference implementation repo should also contain a native C version for demangling.

This duplicates the implementation increasing the maintenance cost. What value does this add?

AFAICT, the main thing this would allow is for C tools, like gdb, to be compiled with Rust support for platforms that Rust cannot target. Since Rust cannot compile to these platforms, there are no Rust programs to debug there, so gdb with Rust support would be useless, and gdb would just be better off completely disabling Rust support in this case.

Many projects, like GDB, will not want to have a Rust dependency – which is understandable as supporting multiple languages in a build system is a significant cost. For these kinds of projects I’d like that have something that they simply can copy.

5 Likes

GDB can also work on non-natively with core files and with remote targets. However, I expect most of the time your GDB host is going to be a common platform that Rust does support, and the oddball target is the one that's less likely to have Rust support.

The perf tool also has rust symbol demangling support. We’ll want to update that, and something tells me it will be ~infinitely easier to write some C rather than trying to inject Rust into the Linux kernel build system.

5 Likes

Big +1 from me for a new mangling scheme.

I’ve found that not being able to back-map mangled symbols to a specific type instantiation is a huge problem in trying to understand backtraces, particularly in async code. I end up with dozens of frames of generic Future combinators but no concrete types to work out what’s actually going on.

A few notes:

The current scheme’s similarity to C++ mangling has been pretty awkward in practice. While feeding a Rust mangled symbol into C++ demangler kinda sorta gives a readable result, it isn’t great - a Rust-specific demangler gives a much better result. The trouble is that if a tool runs the symbol through C++ demangler first, then it becomes useless for Rust demangling. This means that any tool which might deal with a Rust symbol has to be changed to:

  1. distinguish the symbol as being Rust or C++ (or other)
  2. feed it through the right demangler accordingly

Right now, the distinguisher is just looking for a trailing hash, but it would be nicer to have something less ad-hoc. I think the ideal would be one of:

  1. A stock C++ demangler does a perfect job on Rust symbols, so no separate Rust demangler is needed
  2. A stock C++ demangler does nothing to Rust symbols, so the “demangled” symbols can also be fed through a Rust demangler
  3. Symbols can be accurately distinguished as Rust symbols just from the symbol alone (ie, without relying on debug info which may not be available)

I think 1. is along the lines of what @eddyb is suggesting. I like it for its generality, but I’m concerned that it relies too much on C++ demangler consistency. In practice at FB we’ve found that different C++ demanglers often fail to produce the same demangling for complex symbols (or simply fail); I’d be worried they wouldn’t handle Rust mangling particularly well. And it potentially constraints Rust if we start adding things to the language which aren’t well represented in C++. And conversely C++ is adding things; what if the mangling between Rust and C++ becomes ambiguous?

It sounds like @michaelwoerister is aiming for 2 & 3 - the _R prefix will make a C++ demangler ignore the Rust symbol, and make them clearly distinguishable. It’s not ideal for the no Rust-aware tooling case, but I think its better as soon as Rust awareness is added.


Throughout this document, it uses the term “generic arguments”, which I find a bit confusing. I read it as meaning function arguments whose types are generic, but I think the actual meaning is more clearly expressed as “generic type parameters” or just “type parameters”. I’m not sure if the former is actually commonly accepted terminology in Rust and I’m the odd one out, but it confused me a couple of times esp in the discussion of align_of() which has no arguments.


In “Closures and Closure Environments”, would the reference section go into more detail about how the closure numbering would actually work? I’m assuming it would be defined in terms of a traversal of the AST. The two cases that occurred to me are:

  1. Closures can be nested inside others, so presumably a depth-first preorder traversal would mean the outer closure would get N, and the inner N+1 (etc)
  2. If macros are present then they could arbitrarily generate or reorder closures with respect to the literal source

In Methods (both inherent and trait) its not clear to me what the value of having the instantiating crate is as part of the path. There’s already been discussion on this point, but I’m wondering specifically if this has any interactions with LTO (ie, could it inhibit LTO, or could a given symbol be considered instantiated by multiple crates?).

5 Likes

I think this was an attempt to include both type parameters and const generics parameters (and maybe even type constructor parameters if we get HKT someday) under a single term.

imo "generic parameters" would be marginally better than "generic arguments" for this purpose, but we don't really have a nice, unambiguous term for "the things in angle brackets that we have to monomorphize over rather than pass values for at runtime".

1 Like

Thanks a lot for your feedback, @jsgf!

  • Your description of the approach is accurate: The RFC suggests a mangling that is not compatible with Itanium C++ mangling and a Rust symbol can be identified as being Rust and not anything else by looking at it’s first two characters.
  • Regarding generics: I tried to stick to the distinction between parameters and arguments I know from my introductory CS courses: There are formal parameters (as in function or type parameters) and what you provide as the actual value for a parameter is then called the argument. I might not have been entirely consistent in my application of the terms. As @Ixrec rightly notes, I didn’t use the term type parameters because with const generics the arguments could also be constants. I’ll try to make this clearer in the actual RFC.
  • Regarding closures and other things where a disambiguation index will be needed: Yes, the detailed description would try to describe this precisely in terms of the (macro-expanded) AST. The example with nested closures is interesting. The inner closure would probably be addressed as foo::c$0::c$0. And macros can indeed lead to indices that are hard to infer from looking at the unexpanded source code.
  • Per crate instantiations of generic functions interact with LTO in that LTO’ed code would still contain multiple copies of the function. The mangling scheme does not define whether things will be instantiated per crate or not though. It just supports it because that’s the compiler’s current strategy. We might want to experiment with ODR linkage again since the last time we did so was at the beginning of 2016, iirc, and LLVM might do better at optimizing under these circumstances now than it did back then. However, this is a discussion that’s not directly related to how symbols are mangled

At some point, deep in template land, Itanium mangled names become so long that the existing compression method is utterly ineffective, and human readability is already lost.

Therefore, I 40% jokingy propose that version code 0 (symbols starting with _R0) mean the following:

  • A Rust mangled symbol, starting at the version code (dropping the _R only), compressed with gzip and encoded with base32.
    The compiler will perform trial compression of all symbols greater than (TODO: 96?) characters and MUST use the compressed version if it is (TODO) smaller after b32 encoding.
    It is a decoding error to place a v0 compressed symbol inside a v0 compressed symbol. Display the raw symbol. (This avoids quines causing infinite loops.)

“You want compression? Here’s your compression.”

1 Like

cc:

1 Like

Yes, it's pretty unlikely IMO that gdb or binutils would accept a Rust dependency. At the same time, I think it's not important to worry about this. Someone, probably me, will write a demangler for use here.

One other thing I'd like to point out is that c++filt already has hacks in place for Rust -- it's not the case that the current scheme is sufficiently C++-like to just work. In fact these hacks don't go far enough (there's also a patch thread somewhere that goes deeper into this).

1 Like

If one were to take that seriously, then I'd propose using zstd with a standard dictionary computed over some corpus of existing Rust symbols. That would get much better compression than gzip/zlib. And it would be better to use some larger base; at least use the full available character set.

This avoids quines causing infinite loops

Already a huge risk with compression - you can set up exponential explosions with small inputs.

2 Likes

I think it would be wise to make the standard Rust mangling scheme(s), whatever they end up being, intentionally incompatible with the standard C++ mangling scheme(s) used on the same target. This is because, if you have C++ and Rust components linked into the same program, the linker should not resolve a call to extern void foo(int, int) as targeting pub fn foo(i32, i32) -> () unless the author of the latter specifically declared it pub extern "C++".

12 Likes

I don’t have time to give any kind of detailed response but I just wanted to say, to paraphrase Spock: I have been, and always shall be, a friend (to rust defining and owning it’s own principled name mangling scheme).

Good luck! And I look forward eagerly to any resolutions on this front :slight_smile:

I'd be interested in how much better this would be than the Itanium compression scheme. The basic principle of Itanium and gzip is the same: Build up a "dictionary" of things that have already been encountered and emit the "keys" instead of the data for subsequent occurrences. But I guess the more advanced derivations of LZ77 (like zstd and even DEFLATE) have some more tricks up their sleeve that could provide better compression.

EDIT: It looks like zstd with its prepopulated dictionary would indeed be a great fit here. Who knows, maybe we'd really like to have this as a fallback for really long symbol names at some point :)

The C++ ad-hoc substitution scheme is a nightmare. It’s a ton of complexity to figure out what a substitution actually maps to, and it’s not even clear how substitutions are to be interpreted. See https://github.com/itanium-cxx-abi/cxx-abi/issues/68 for example.

2 Likes

Having fixed a number of bugs in the cpp_demangle crate, I personally think that the Itanium ABI mangling is horrendous, though some of that is C++ madness. But it would be great to have a standard Rust mangling scheme.

  • Define a precise AST for demangled names and be careful to describe what the productions mean.
  • Define the serialization and ensure the grammar is trivial to parse. The Itanium ABI requires lookahead in some places. No more than 1 token please.
  • Ensure it’s easy to extend the grammar unambiguously without hacks.
  • Make the compression as simple as possible. I would suggest defining tokenization of mangled names, and then compressing using backreferences to ranges of tokens.
  • Avoid smuggling in other kinds of references, like the Itanium ABI template-parameter references (T_ etc).
5 Likes

Thanks for the feedback! That's very helpful.

The core algorithm (minus template parameter references) does not seem too ad-hoc to me. It seems like a reasonable way of implementing a dictionary coder for an AST. Is your criticism directed at the core algorithm or at the concrete way it's applied by the Itanium mangling?

The reference implementation I'm working on uses an AST and defines compression in terms of the AST. So I think we should be covered here?

The current implementation does not need more than 1 character look-ahead needed.

Can you describe this in more detail? It seems like that would take up more space because you have to encode too numbers for a range, instead of just one number for a dictionary index. It also seems like it might make the compression algorithm more complex (while making decompression simpler).

Maybe "ad-hoc" isn't the right phrase, but because only a subset of AST productions are counted when enumerating substitution targets, that subset has to be known to any tool processing mangled names. That seems like unnecessary complexity to me.

The reference implementation I’m working on uses an AST and defines compression in terms of the AST. So I think we should be covered here?

Sure, as long as it's clearly specified.

The current implementation does not need more than 1 character look-ahead needed.

Great! I hope you have a way to ensure that remains true as the specification evolves.

Can you describe this in more detail?

No, you're probably right that referring to AST nodes is a better option so you don't have to encode a length.

OTOH a downside of AST nodes is you need to have a clear processing model that precisely describes what kinds of nodes are allowed in what contexts ... and those constraints need to be faithfully implemented. E.g. if you have an AST node <ident>, and a substitution references that node where you're expecting an <expression>, depending on how your AST works you may or may not promote that <ident> to the right variant of your Expression AST node. Token or string backreferences would have the appealing property that that's all handled by existing code in the parser.

Yes, deciding what goes into dictionary in which order is a bit tricky. I hope that the Rust version is a bit simpler than the C++ version because the AST is simpler.

It’s a bit of a trade-off between implementation difficulty and compression ratio. My hope is that a clean reference implementation that can be easily ported will alleviate the problem somewhat.