This is a proposal for a new name mangling scheme for the Rust compiler. It should solve the known problems of the existing scheme (see the “Motivation” part). The pre-RFC version does not yet include the strict technical description but the “Guide Level Explanation” should give a good idea what it looks like. It would be great to hear what people think; whether it makes sense to fully flesh out the remaining details or if we should take a different approach altogether.
- Feature Name: symbol_name_mangling_v2
- Start Date: 2018-10-01
- RFC PR: (leave this empty)
- Rust Issue: (leave this empty)
Summary
This RFC proposes a new mangling scheme that describes what the symbol names for everything generated by the Rust compiler look like. This new scheme has a number of advantages over the existing one which has grown over time without a clear direction. The new scheme is consistent, does not depend on compiler internals, and the information it stores in symbol names can be decoded again which provides an improved experience for users of external tools that work with Rust symbol names. The new scheme is based on the name mangling scheme from the Itanium C++ ABI.
Motivation
Due to its ad-hoc nature, the compiler’s current name mangling scheme has a number of drawbacks:
- It depends on compiler internals and its results cannot be replicated by another compiler implementation or external tool.
- Information about generic parameters and other things is lost in the mangling process. One cannot extract the type arguments of a monomorphized function from its symbol name.
- The current scheme is inconsistent: most paths use Itanium style encoding, but some of them don’t.
- The symbol names it generates can contain
.
characters which is not generally supported on all platforms. [1][2][3]
The proposed scheme solves these problems:
- It is defined in terms of the language, not in terms of compiler data-structures that can change at any given point in time.
- It encodes information about generic parameters in a reversible way.
- It has a consistent definition that does not rely on pretty-printing certain language constructs.
- It generates symbols that only consist of the characters
A-Z
,a-z
,0-9
,_
, and$
.
This should make it easier for third party tools to work with Rust binaries.
Guide-level explanation
The following section will lay out the requirements for a name mangling scheme and then introduce the actual scheme through a series of ever more complex examples.
Requirements for a Symbol Mangling Scheme
A symbol mangling scheme has a few goals, one of them essential, the rest of them desirable. The essential one is:
- The scheme must provide an unambiguous string encoding for everything that can end up in a binary’s symbol table.
“Unambiguous” means that no two distinct compiler-generated entities (that is, mostly object code for functions) must be mapped to the same symbol name. This disambiguation is the main purpose of the hash-suffix in the current, legacy mangling scheme. The scheme proposed here, on the other hand, achieves it in a way that allows to also satisfy a number of additional desirable properties of a mangling scheme:
-
A mangled symbol should be decodable to some degree. That is, it is desirable to be able to tell which exact concrete instance of e.g. a polymorphic function a given symbol identifies. This is true for external tools, backtraces, or just people only having the binary representation of some piece of code available to them. With the current scheme, this kind of information gets lost in the magical hash-suffix.
-
It should be possible to predict the symbol name for a given source-level construct. For example, given the definition
fn foo<T>() { ... }
, the scheme should allow to construct, by hand, the symbol names for e.g.foo<u32>
orfoo<extern fn(i32, &mut SomeStruct<(char, &str)>, ...) -> !>()
. Since the current scheme generates its hash from the values of various compiler internal data structures, not even an alternative compiler implementation could predicate the symbol name even for simple cases. -
A mangling should be platform-independent. This is mainly achieved by restricting the character set to
A-Z
,a-z
,0-9
,_
, and$
. All other characters might have special meaning in some context (e.g..
for MSVCDEF
files) or are simply not supported (e.g. Unicode). -
The scheme should be efficient, meaning that the symbols it produces are not unnecessarily long (because that takes up space in object files and means more work for compiler and linker) and that generating a symbol should not be too computationally expensive.
Note that a source-level definition can contain components that will not show up in symbol names, like lifetimes (as in fn foo<'a>()
). It is an explicit non-goal of this RFC to define a mangling for cases like the above. One might want to cover them “for completeness” but they are not actually needed.
The Mangling Scheme by Example
This section will develop an overview of the mangling scheme by walking through a number of examples. We’ll start with the simplest case – and see how that already involves things that might be surprising.
Free-standing Functions and Statics
A free-standing function is fully identified via its absolute path. For example, the following function
mod foo {
fn bar() {}
}
has the path foo::bar
and N3foo3barE
is a mangling of that path that complies to the character set we are restricted to. Why this format with numbers embedded in it? It is the encoding that the Itanium C++ ABI name mangling scheme uses for “nested names” (i.e. paths). The scheme proposed here will also use this format.
However, the symbol name above does not unambiguously identify the function in every context. It is perfectly valid for another crate to also define mod foo { fn bar() {} }
somewhere. So in order to avoid conflicts in such cases, fully qualified names always include the crate name and disambiguator, as in N15mycrate_4a3b56d3foo3barE
(the crate disambiguator is used to disambiguate different versions of the same crate. It is an existing concept and not introduced by this RFC).
There is one more possible ambiguity that we have to take care of: Rust has two distinct namespaces: the type and the value namespace. This leads to a path of the form crate_id::foo::bar
not uniquely identifying the item bar
because the following snippet is legal Rust code:
fn foo() {
fn bar() {}
}
mod foo {
fn bar() {}
}
The function foo
lives in the value namespaces while the module foo
lives in the type namespace. They don’t interfere. In order to make the symbol names for the two distinct bar
functions unique, we thus add a suffix to name components in the value namespace, so case one would get the symbol name N15mycrate_4a3b56d3fooF3barFE
and case two get the name N15mycrate_4a3b56d3foo3barFE
(notice the difference: 3fooF
vs 3foo
).
As opposed to C++ and other languages that support function overloading, we don’t need to include the argument types in the symbol name. Rust does not allow two functions of the same name but different arguments.
The final symbol name for the function would also include the prefix _R
that is common to all symbol names generated by this scheme:
_RN15mycrate_4a3b56d3foo3barFE`
<><-------------------------->
| |
prefix fully qualified name
Generic Functions
Each monomorphization of a generic function has its own symbol name. The monomorphizations are disambiguated by the list of concrete generic arguments. These arguments are listed as suffix, starting with I
, after the name they belong to. So the instance
std::mem::align_of::<f64>
would be mangled to
_RN12std_a1b2c3d43mem8align_ofFIdEE
^^^
|||
start of argument list ---+|+--- end of argument list
|
f64
where I
starts the list of arguments, d
designates f64
and E
ends the argument list. As we can see, we need to be able to represent all kinds of types that can be part of such an argument list. (In the future we might also need to represent values when const generics get added to the language.) These kinds of types are:
- basic types (
char
,()
,str
,!
,i8
,i16
, …) - reference and pointers types, shared and
mut
- tuples
- arrays, with and without fixed size (e.g.
[u8]
,[u8; 17]
, or as part of a slice type&[char]
) - structs, enums, closures, and other named types, possibly with their own set of type arguments
- function types such as
fn(&i32) -> u16
Basic types are all encoded via a single lower-case letter, like in the Itanium scheme. Named types are encoded as their fully qualified name (plus arguments) like is done for function symbols. Composites like references, tuples, and function types all follow simple grammar given in the reference-level explanation below. Here are some examples manglings to get a general feel of what they look like:
-
std::mem::align_of::<usize>
:_RN12std_a1b2c3d43mem8align_ofFIjEE
-
std::mem::align_of::<&char>
:_RN12std_a1b2c3d43mem8align_ofFIRcEE
-
std::mem::align_of::<std::mem::Discriminant>
:_RN12std_a1b2c3d43mem8align_ofFIN12std_a1b2c3d43mem12DiscriminantEEE
-
std::mem::align_of::<&mut (&str,())>
:_RN12std_a1b2c3d43mem8align_ofFIWTRrvEEE
There’s one more thing we have to take into account for generic functions: The compiler may produce “crate-local” copies of a monomorphization. That is, if there is a function foo<T>
which gets used as foo<u32>
in two different crates, the compiler (depending on the optimization level) might generate two distinct functions at the LLVM IR level, each with it’s own symbol. In order to support this without running into conflicts, symbol names for monomorphizations must include the id of the crate they are instantiated for. This scheme does this by appending an $<crate-id>
suffix to the symbol. So for example the mangling for std::mem::align_of::<usize>
would actually look like this:
_RN12std_a1b2c3d43mem8align_ofFIjEE$foo_a1b2c3d4 (for crate "foo/a1b2c3d4")
_RN12std_a1b2c3d43mem8align_ofFIjEE$bar_11223344 (for crate "bar/11223344")
Closures and Closure Environments
The scheme needs to be able to generate symbol names for the function containing the code of a closure and it needs to be able to refer to the type of a closure if it occurs as a type argument. As closures don’t have a name, we need to generate one. The scheme takes a simple approach here: Each closure gets assigned an index (unique within the item defining it) and from that we generate a name of the form c$<index>
. The $
makes sure that the name cannot clash with user-defined names. The full name of a closure is then constructed like for any other named item:
mod foo {
fn bar(x: u32) {
let a = |x| { x + 1 }; // ~ c$0
let b = |x| { x + 2 }; // ~ c$1
a(b(x))
}
}
In the above example we have two closures, the one assigned to a
and the one assigned to b
. The first one would get the local name c$0
and the second one the name c$1
. Their full names would then be N15mycrate_4a3b56d3foo3barF3c$0FE
and N15mycrate_4a3b56d3foo3barF3c$1FE
respectively. The type of their environment would be the same, except for not having the F
suffix to their local name.
Inherent Methods
Inherent methods (that is, methods that are not part of a trait implementation) are represented by a symbol of the form:
_RM <self-type> <method-name> [<generic-arguments>] [<instantiating-crate>]
The M
designates the symbol as an inherent method. The self-type is encoded like any other type argument and already contains the concrete type arguments of the impl
defining the method. The method name is unique among all inherent methods for the given type, so we don’t need to further qualify it. The method can have type arguments of its own. These are encoded like other argument lists as I <type>+ E
. If the method is generic in any way, it will also need the instantiating crate suffix, like any other generic item.
Here’s an example for a non-generic method:
mod foo {
struct Bar;
impl Bar {
pub fn panic_please() { panic!() }
}
}
The resulting symbol name looks like:
_RMN15mycrate_4a3b56d3foo3BarE12panic_please
<-------------------------><------------>
self-type method name
A method with a generic self-type is a bit longer, since it also contains the instantiating-crate-suffix:
mod foo {
struct Bar<T>;
impl<T> Bar<T> {
pub fn panic_please() { panic!() }
}
}
The symbol for foo::Bar<char>::panic_please
would look like this:
_RMN15mycrate_4a3b56d3foo3BarIcEE12panic_please$downstream_crate_x_abcd1234
<----------------------------><------------><-------------------------->
self-type method name instantiating crate
Trait Methods
Trait methods are similar to inherent methods, but in addition to the self-type the symbol name must also contain the trait being implemented:
_RX <trait-name-and-params> <self-type> <method-name> [<generic-arguments>] [<instantiating-crate>]
The X
signifies that this is a trait method. The trait being implemented is encoded N <name-component>+ [I<generic-param>+ E] E
, like a named type. Here is a complex example with generics in all the places:
mod foo {
trait Foo<T> {
fn id<T>(x: T) -> T;
}
}
mod bar {
struct Bar<T>;
}
mod baz {
impl<T1, T2> Foo<T1> for Bar<T2> {
fn id<T>(x: T) -> T { x }
}
}
The mangling for <Bar<char> as Foo<isize>>::id::<usize>
would be:
_RXN15mycrate_4a3b56d3foo3FooIiEEN15mycrate_4a3b56d3bar3BarIcEE2idIjE$downstream_crate_x_abcd1234
<----------------------------><----------------------------><-><-><-------------------------->
trait self-type method instantitating crate
One thing that’s interesting here is that baz
, the module the impl is situated in, does not show up anywhere in the mangled name.
Items Within Specialized Trait Impls
In Rust one can define items within generic items, e.g. functions or impls, like in the following example:
fn foo<T>(a: T) -> (u32, T) {
static mut X: u32 = 0;
unsafe {
X += 1;
(X, a)
}
}
The X
here (or any other such nested definition) does not inherit the generic context. X
is non-generic, and a function defined in its place would be too. Consequently, when giving the path to something defined within a generic item, one does not specify the generic arguments because they add no information. The fully qualified name of X
is thus my_crate/a1b2c3d4::foo::X
and its symbol name:
_RN15mycrate_4a3b56d3fooF1XFE
However, there is at least one case where the type arguments do matter for a defintion like this, and that is when impl specialization is used. Consider the following piece of code:
trait Foo<T> {
fn foo() -> T;
}
struct Bar<T>(T);
impl<T> Foo<T> for Bar<T> {
default fn foo() -> T {
static MSG: &str = "sry, no can do";
panic!("{}", MSG)
}
}
impl<T: Default> Foo<T> for Bar<T> {
fn foo() -> T {
static MSG: &str = "it's a go!";
println!("{}", MSG);
T::default()
}
}
Notice that both MSG
statics have the path <Bar as Foo>::foo::MSG
if you just leave off the type arguments. However, we also don’t have any concrete types to substitute the arguments for. Therefore, we have to encode the type parameters and their bounds for cases like this so that the symbol name will be a mangling of something like <Bar<T> as Foo<T>>::foo::MSG where T: Clone
:
_RXI1TIN12std_a1b2c3d47default7DefaultEEEN15mycrate_4a3b56d3FooI1TEEN15mycrate_4a3b56d3BarI1TEE3foo3MSG
-------------------------------------- "where clause"
-- T
---------------------------------- bounds to T
-------------------------------- std::default::Default
--------------------------- Foo<T>
--------------------------- Bar<T>
foo ----
MSG ----
Compiler-generated Items (Drop-Glue, Shims, etc)
The compiler generates a number of things that can end up needing an entry in the symbol table:
-
Drop-glue is what recursively calls
Drop::drop()
for components of composite type. Generating symbol names for it is straightforward. They are of the form_RG<type>
where<type>
is the usual mangling as used for generic arguments. -
Various “shims”, that is, compiler-generated implementations of built-in traits like
Fn
,FnMut
,FnOnce
, orClone
, or for dynamic dispatch via trait objects. These are similar in structure to drop glue. Their precise mangling is specified in the reference-level explanation below.
Unicode Identifiers
Rust allows unicode identifiers but our character set is restricted to ASCII alphanumerics, _
, and $
. In order to transcode the former to the latter, we use the same approach as Swift, which is: encode all identifiers via Punycode, a standardized and efficient encoding that keeps encoded strings in a rather human-readable format. So for example, the string
"Gödel, Escher, Bach"
is encoded as
"Gdel, Escher, Bach-d3b"
which, as opposed to something like Base64, still gives a pretty good idea of what the original string looked like.
Each component of a name, i.e. anything that starts with the number of bytes to read in the examples above, is encoded individually. Components encoded this way also start with the number of bytes to read, but that number is prefixed with a 0
. As an example, the function:
mod gödel {
mod escher {
fn bach() {}
}
}
would be mangled as:
_RN15mycrate_4a3b56d08gdel_5qa6escher4bachFE`
<-------->
unicode component
Compression/Substitution
The length of symbol names has an influence on how much work compiler, linker, and loader have to perform. The shorter the names, the better. At the same time, Rust’s generics can lead to rather long names (which are often not visible in the code because of type inference and impl Trait
). For example, the return type of the following function:
fn quux(s: Vec<u32>) -> impl Iterator<Item=(u32, usize)> {
s.into_iter()
.map(|x| x+1)
.filter(|&x| x > 10)
.zip(0..)
.chain(iter::once((0, 0)))
}
is
std::iter::Chain<
std::iter::Zip<
std::iter::Filter<
std::iter::Map<
std::vec::IntoIter<u32>,
[closure@src/main.rs:16:11: 16:18]>,
[closure@src/main.rs:17:14: 17:25]>,
std::ops::RangeFrom<usize>>,
std::iter::Once<(u32, usize)>>
It would make for a symbol name if this types is used (maybe repeatedly) as a generic argument somewhere. C++ has the same problem with its templates; which is why the Itanium mangling introduces the concept of compression. If a component of a definition occurs more than once, it will not be repeated and instead be emitted as a substitution marker that allows to reconstruct which component it refers to. The scheme proposed here will use the same approach.
The exact scheme will be described in detail in the reference level explanation below but it roughly works as follows: As a mangled symbol name is being built or parsed, we build up a dictionary of “substitutions”, that is we keep track of things a subsequent occurrence of which could be replaced by a substitution marker. The substitution marker is then the lookup key into this dictionary. The things that are eligible for substitution are (1) all prefixes of qualified names (including the entire name itself) and (2) all types except for basic types. If a substitutable item is already present in the dictionary it does not generate a new key. Here’s an example in order to illustrate the concept:
std::iter::Chain<std::iter::Zip<std::vec::IntoIter<u32>, std::vec::IntoIter<u32>>>
$0: ---
$1: ---------
$2: ----------------
$3: --------------
$4: --------
$5: ------------------
$6: -----------------------
$7: ----------------------------------------------------------------
$8: ----------------------------------------------------------------------------------
The indices on the left are the dictionary keys. The prefixes std
, std::iter
, and std::iter::Chain
all get added to the dictionary because we have not seen them before. After that we encounter std
again. We’ve already seen it, so we don’t add anything to the dictionary. The same goes for when we encounter std::iter
the second time. Next we encounter std::iter::Zip
, which we have not seen before, so it’s added to the dictionary. Next we encounter std
again (already seen, no insertion), then std::vec
and std::vec::IntoIter
which both generate a new entry. Next we see std::vec::IntoIter<u32>
, the first full type. It generates an entry too. The second type parameter is the same as the first. No part of it introduces a new entry. After the next >
we have completely processed std::iter::Zip<std::vec::IntoIter<u32>, std::vec::IntoIter<u32>>
, which adds another type entry. Finally, the full std::iter::Chain<std::iter::Zip<std::vec::IntoIter<u32>, std::vec::IntoIter<u32>>>
adds another entry.
Using the dictionary above, we can compress to:
std::iter::Chain<$1::Zip<$0::vec::IntoIter<u32>, $6>>
A couple of things to note:
- The first occurrence of a dictionary entry is never substituted. We don’t store the dictionary anywhere and need to be able to reconstruct it from the compressed version.
- Longer substitutions are preferred to shorter ones.
std::iter::Chain<$1::Zip<$0::vec::IntoIter<u32>, $4::IntoIter<u32>>>
would also decompress to the original version but the compiler is supposed to always pick the longest substitution available.
The mangled version of a substitution marker is S <key - 1> _
(and S_
for key 0
) like it in Itanium mangling. So the above definition would be mangled to:
_RN12std_a1b2c3d44iter5ChainINS0_3ZipINS_3vec8IntoIterIjEES5_EEE
The uncompressed version would be:
_RN12std_a1b2c3d44iter5ChainIN12std_a1b2c3d44iter3ZipIN12std_a1b2c3d43vec8IntoIterIjEEN12std_a1b2c3d43vec8IntoIterIjEEEEE
Reference-level explanation
This is still to be done. Not part of the pre-RFC.
Drawbacks
Why should we not do this?
- The scheme is rather complex, especially due to compression (albeit not more complex than prior art)
- The current/legacy scheme based on symbol-hashes is flexible in that hashes can be changed at will. That is, the unstable part of the current scheme mangling is nicely contained and does not keep breaking external tools. The danger of breakage is greater with the scheme proposed here because it exposes more information.
Rationale and alternatives
The alternatives considered are:
- Keeping the current scheme. It does meet the minimum requirements after all. It also has pretty big downsides.
- Keeping the current scheme but cleaning it up by making the non-hash part more consistent and more expressive. Keep the hash part as a safe guard against symbol conflicts and the rest as something just for demangling. The downside of this is that the hash would still not be predictable, and symbols would get rather long if they should contain more human-readable information about generic arguments.
- Define a standardized pretty-printing format for things that end up as symbols, and then encode that via Punycode in order to meet the character set restrictions. This would be rather simple. Symbol names would remain somewhat human-readable (but not very, because all separators would be stripped out). But without some kind of additional compression, symbol names would become rather long.
- Use the scheme from the previous bullet point but apply the compression scheme described above. We could do this but it wouldn’t really be less complex than the Itanium inspired scheme proposed above.
The Itanium mangling (and by extension the scheme proposed here) could be considered somewhat arcane. But it is well-known from C++ and provides a good trade-off between readability, complexity, and length of generated symbols.
Prior art
The mangling scheme described here is an adaptation of the Itanium C++ ABI scheme, which is the scheme used by the GCC toolchain (and clang when it’s not compiling for MSVC). In fact, the scheme proposed here tries to stay as close as possible to Itanium mangling and only deviates where something does not make sense for Rust.
One notable improvement the proposed scheme makes upon Itanium mangling is explicit handling of unicode identifiers. The idea of using Punycode for this is taken from the Swift programming language’s mangling scheme (which is also based on Itanium mangling).
Unresolved questions
- Should we introduce a
St
substitution for the::std::
to the compression scheme (like Itanium does). This would preclude mixing symbols from different versions of the standard library into a single binary (we’d not have a crate disambiguator for libstd). It’s unclear whether that can occur in practice. - Similar to the above, common items, like built-in bounds, could get predefined abbreviations.
- Is the compression scheme unambiguous? That is, is it always clear which substitutions the compiler should choose? (a reference implementation of the algorithm will solve this)
- Is the scheme for disambiguating specialized impls sound?
- Should symbols include information that might help during debugging/analyzing a program but that is not strictly necessary for avoiding name conflicts? Examples of such information would be names and types of function parameters or the ABI of functions.
- Should named items (everything of the form
N...E
) not start withN
but instead with something that gives a hint of what it is? E.g.F
for functions,S
for statics,C
for closures, etc? This is not needed for disambiguation but it would add more information to the symbol name without really increasing the complexity of the scheme or the length of names. (Although it makes compression a bit less straightforward to describe.)
Appendix - Interesting Examples
TODO (not part of the pre-RFC)
- specializing impls
- impl Trait
- closure environment as a type parameter
- various examples of compression