pre-RFC: a new symbol mangling scheme


#41

That’s fine: assuming we don’t need 100% fidelity of representation (because we have crate metadata), we can encode as much info as practical, and deal with the rest by adding a hash uniquifier. We could still use punicode for indents, although removing diacritics and just dropping all chars that cannot be mapped to ASCII would be acceptable as well, IMO.


#42

The itanium mangling scheme has a production for “unnamed types” that has a counter for disambiguating. I imagine that we could do something similar for “unnamed scopes”.

http://itanium-cxx-abi.github.io/cxx-abi/abi.html#mangle.unnamed-type-name


#43

Oh, and another thing: as someone mentioned on the Reddit thread, maybe the right thing to do would be to optimize for naked-eye readability (think generated assembly)?


#44

That would be point 2 from the “Alternatives” section. It would have to support compression though, in order to keep symbol name length somewhat in check.

That again would be an argument against any form of compression. I don’t think that’s a good idea from a performance point of view. Usually, disassembler tools support demangling anyway.

Overall, I think it’s good that have the hash suffix approach as a last resort if a non-opaque scheme turns out to be too complex. But before that I think we should try to do something that allows for pin-pointing the piece of source code a given symbol comes from.


#45

So, syntactically it shouldn’t be a problem to support disambiguation of name components via indices. Every component can already have the F suffix, indicating that it’s in the value namespace. This suffix can be extended with anything that doesn’t clash with any of the possible follower productions (i.e. the next component, the list of generic arguments, or the E ending the fully qualified name). Itanium uses _<index> which seems reasonable. So the example with the two foo::bar() functions from above would yield the symbol names _ZN17my_crate_a1b2c3d43fooF3barFE and _ZN17my_crate_a1b2c3d43fooF3barF_0E. The same mechanism can be used to handle the macro hygiene case.

The only possible complication I see with this scheme is how to assign indices. We don’t want to require the compiler to do expensive analysis of the containing scope of a given component. However, if we say that it can yield any index that makes things unambiguous, it could fall back on DefPath disambiguation indices. That would not compromise the traceability of a symbol name’s origin too much since those indices are also per-parent-scope and thus only depend on local information.


#46

I completely disagree that non-ascii unicode in identifiers would not be common if it were supported in the language. I would very much appreciate if my identifiers were not mangled weirdly, just because I chose to not write them in the ascii subset, especially since there is existing prior art for mangling unicode identifiers as unicode.

I think you don’t want weak_odr, but linkonce_odr; I’m not sure why mingw would not support it well, since it’s required for C++ to compile. I don’t really mind this part of the RFC, tho.


#47

This is the status quo, but the reality is that you get ok-ish-but-not-excellent support from the existing tools, because those tools are, understandably, expecting C++. The pre-RFC mentions a number of things the existing tools don’t handle; and both gdb and lldb have various issues with the current approach. A rust-specific approach would improve all of this, and the painful period between a new mangling scheme and everybody having new tools is not really that long … I think it’s better to take the long view and have Rust be more in control of its own future.


#48

I second that it would be nice if c++filt/objdump -C would “just work”. It doesn’t appear to:

$ c++filt --version
GNU c++filt (GNU Binutils for Ubuntu) 2.26.1
Copyright (C) 2015 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or (at your option) any later version.
This program has absolutely no warranty.
$ c++filt _RN12std_a1b2c3d43foo3FooIN12std_a1b2c3d43bar3BarIiEEEE
_RN12std_a1b2c3d43foo3FooIN12std_a1b2c3d43bar3BarIiEEEE

#49

Sadly, nobody seems to have written down what @nikomatsakis and I discussed back in 2016, and it wasn’t high priority enough to implement.

What we were going for crucially required that c++filt & friends work out of the box.

That means Itanium C++ mangling (and maybe even the MS one on windows), and encoding Rust types within that mangling scheme.

I still think that’s the best approach for now, although it does feel that we’re a bit late.


#50

I’d say that ideally the symbols should have the same name (and have linkonce_odr linkage), to allow the linker to pick one canonical copy and throw away the rest. Currently it cannot do so, except if it has an “identical code folding” pass (only some linkers do) and notices that the function bodies are identical (won’t necessarily work if the function references other symbols). This just creates unnecessary binary bloat.

One issue might be if different instantiating crates are compiled with different optimization levels, so perhaps the name mangling could include the optimization level.


#51

Edit: Also, :+1: to doing Unicode “properly” on platforms that support it – which includes at least Darwin, Windows, and anything GNU based.


#52

I’ve been thinking about this and have come to the conclusion that I don’t think this is an important goal. If there’s a way to unambiguously encode everything that Rust needs via vanilla Itanium that also demangles to proper Rust syntax, we could do it. But I don’t know how to. And, as @tromey states, it’s good for tools if they can tell from the symbol name if it’s Rust or C++.

A point of reference here is debuginfo and how we tried to shoehorn Rust concepts into an equivalent C++ DWARF representation. DWARF is very flexible, much more so than Itanium mangling I’d say, and it still did not work well. I’d like to not repeat this approach without a really good reason. (We had a good reason for debuginfo back then but we are still paying the cost of doing the workaround).

I don’t think the argument that existing tools don’t support the new encoding out of the box is a good one. Many tools out there (like GDB and valgrind) already support the current Rust mangling, which isn’t Itanium either and rather messy. Given a good specification and a reference implementation, a demangler can be implemented in half a day. And Rust-based projects can just pull in the reference implementation from crates.io.

I think this is kind of off-topic. The compiler’s strategy of dealing with generic code is not defined by symbol mangling. The mangling scheme just has to support all the strategies the compiler might choose. If we were to switch to an ODR based strategy, the scheme proposed above would support it by just omitting the instantiating-crate suffix.

Noted. It still think that doing platform-specific things without a really good reason is unnecessarily asking for trouble.


#53

I’m not sure how this is a problem. You pointed out punycode - just like that can be unambiguous, you can make any encoding unambiguous.

We also have the advantage of crate “namespace roots” meaning anything that’s not a crate (e.g. C++'s std) can be reused to encode tuples, trait impls, etc.


#54

The problem I see is not encoding something unambiguously in the Itanium mangling scheme. We could indeed put disambiguating information into idents or some other syntactic form, like the current scheme does with the hash or how we encoded enum compression into field names for DWARF.

The problem is that the result would be a kind of contortion that c++filt still would not turn into nice Rust output – at the expensive of requiring a less compact, less intuitive encoding that Rust-aware tools still would need to have special handling for. At the same time there are already a number of tools with first class Rust support (e.g. GDB, LLDB, and valgrind) that would quickly pick up a new encoding, regardless if it’s Rust-mangling -> Rust-rendering or Itanium-mangling -> Rust-rendering. So in my opinion we should not compromise on the quality of the mangling scheme and complicate implementation of Rust-aware tooling just to make an easily substitutable C++ tool (semi-)usable.


#55

Another thing to keep is this name mangling scheme would have an associated library for demangling Rust symbols. So what’s stopping us from creating C bindings for that crate and contributing patches to C++filt and friends?

Existing C++ tools already need to switch between several name mangling formats anyway, so it wouldn’t be too difficult to add yet another demangling implementation. Then, assuming we’ve submitted a patch and things are up to date, users would be able to use tools like C++filt on Rust symbols like @eddyb intended.


#56

Yes, this seems to be a case of “Be in charge of your own destiny!” That seems like a better place to be rather than relying upon pretending to follow C++ mangling rules so that tools that understand C++ will automagically understand Rust. That seems fraught with corner-cases and future issues, etc. Better for Rust to stand on its own.


#57

You’re assuming that GNU binutils is willing to accept a dependency on Rust, which seems highly unlikely to me.


#58

Any particular reason? Licensing?


#59

Binutils is a core component of bootstrapping a new platform. Introducing a large dependency such as Rust will significantly increase the difficulty of bootstrapping.


#60

I think the reference implementation repo should also contain a native C version for demangling.