pre-RFC: a new symbol mangling scheme

I actually think having at least two independent implementations is important to making sure the spec is accurate and complete, i.e. that bugs in the reference implementation don’t become de-facto spec.

If compression ratio is really important then I think you need a methodology for evaluating different alternatives over a good test corpus and actually measure the impact of different alternatives. We’ve all seen formats that tried to introduce DIY compression based on intuitions and made some really bad decisions (hello DWARF!). You’ll want to determine the maximum set of ASCII symbols you can use, and use them all. You probably should also figure out exactly what “human readability” means to you and push right up against that boundary.

5 Likes

I’m not sure a range would need two numbers, as long as each possible subtree is self-contained and self-terminating.

E.g. the start of a polish notation subtree is enough to determine its entire range, as each prefix fully determines the shape its children take.

Also, byte offsets would allow relatively efficient zero-allocation demangling, if I’m understanding correctly, whereas tokens would require a separate tokenization step.

Good point.

Such a test corpus should not be too hard to get by.

I personally would like to stick to A-Za-z0-9_. Others in this thread have expressed a preference of allowing symbols to be UTF-8.

This is a tricky one. Here's are some examples from the reference implementation test cases:

  • _RN7std_xxx3fooITNS_3BarES1_ES2_EE:
    std[xxx]::foo<(std[xxx]::Bar,std[xxx]::Bar),(std[xxx]::Bar,std[xxx]::Bar)>
  • _RN7std_xxx3fooFINMNS_4QUUXE3barINS0_3BARSEEEEE:
    std[xxx]::foo<std[xxx]::QUUX::bar<std[xxx]::foo::BAR>>
  • _RNXlN7foo_xxx3BarIxEE4quuxF1_Cs_IcEE:
    <i32 as foo[xxx]::Bar<i64>>::quux::{closure}'2<char>

As soon as definitions get nested, it's almost impossible to demangle the name in your head. And compression makes it worse still. So I'd say anything Itanium-based is past human readable except for simple cases. Although one can still glean some useful information from a mangled - which might be all we are interested here.

I guess a good next step would be to collect a test corpus of symbol names for trying out compression schemes.

1 Like

I wonder if can take a step back, give up on a symbol mangling completely, and instead use simple IDs (either auto-incremented or hash-based) with a separate file which will include table for debuggers to use. If I am not mistaken, it will be something similar to .pdb files used in Windows. Or is there an additional motivation for mangling except a desire to keep all information in one file?

1 Like

@newpavlov This is basically what debuginfo already provides (not just on Windows). However, you don’t always have debuginfo available.

I guess my point is that trading debugability of stripped binaries to a significantly simpler symbol construction algorithm (which also IIUC as a nice bonus will result in a somewhat smaller binaries) is not a bad compromise to take.

I think we’ll want to provide the option in the compiler to emit completely opaque symbols (something like plain 160bit hashes) for this case. But I don’t think that that would be a good default.

3 Likes

For anyone interested: I opened the actual RFC :sunny:

5 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.