Proposal: support *.s natively via llvm-mc


#1

I saw there has been a lot of discussion about inline assembly, including possibly adding an asm! thing to the Rust language. I think that even if Rust had asm!, there will still be a need to support out-of-line assembly language code. And, it seems relatively easy to support out-of-line assembly language code–easier than adding asm!—by simply ensuring that llvm-mc is included.

My strawman proposal is:

  • Ensure that the standalone llvm-mc assembler is always installed, even if the platform already has an assembler like binutils’s gas or Microsoft’s MASM.
  • Ensure that llvm-mc is easy for build scripts (build.rs) to invoke, e.g. by ensuring it is in $PATH when the build script starts.
  • Ensure that the version of llvm-mc used in Rust is the same across platforms. For example, if Rust 1.7 ships with llvm-mc 3.7.0 on Windows, then llvm-mc 3.7.0 must be used on Linux and Mac.
  • (Eventually) add support for assembling *.s files directly to Cargo, so that build scripts using the GCC crate aren’t necessary.

Why wouldn’t adding asm! to Rust be sufficient support for assembly language?

  1. It is a lot of very tedious work to write assembly language code. By its nature, it is difficult to maintain. Thus, it is useful to share assembly language code across projects as much as possible, in a single form. For example of how having to maintaining real-world assembly language code in multiple formats can go terribly wrong, take a look at the Go version of Intel’s P-256 ECC point multiplication code vs the BoringSSL version of the same code, which is a variant of the OpenSSL version of the code. (I’ve started this thread, partially, to avoid needing to create and maintain a third variant, Rust + !asm, of that code and ~50,000 lines of similar code.)

  2. As far as cryptography is concerned, one of the reasons we use assembly language, besides performance, is to ensure that the compiler of the higher-level language doesn’t do things that would leak sensitive information through side channels. In general, we (I) feel more comfortable that this happens when the sensitive code is not touched at all by the higher-level compiler.

  3. As far as high-assurance computing is concerned, we want to prove that our tools are correct, prove that our code is correct, and thus prove that the when we feed our code to our tools, the output is correct. This is much easier to do when the assembler is separate from rustc.

  4. Usually inline assembler doesn’t support the entire feature set that standalone assemblers support. But, usually there’s a reason that standalone assemblers support those features.

  5. It is difficult to understand a program that mixes Rust code, assembler .macro directives, and assembly language code.

  6. asm! is feature that looks simple but is actually quite hard to do well. It may be better to try other approaches, like the MSVC 64-bit approach of not supporting inline assembler at all, but instead support intrinsics. Supporting external assembly language code reduces the pressure on the compiler team to add asm! before it is ready and before alternatives have been fully evaluated.

Why not just use the native assembler?

  1. The native assembler on Windows is MASM. The native assembler on Linux is gas. The assembly language syntax for each of these is completely different. As a result, the OpenSSL team has written a preprocessing system that takes inline assembly code embedded in Perl (yes, Perl) code and emits the correct assembly language syntax, which is then fed into the native assembler. (Actually, OpenSSL doesn’t support the native assembler on Windows, but only supports nasm, which is even more inconvenient.) Note that this proprocessor code may itself have bugs that make the generated assembly language code incorrect. You can see an example of a fix of a significant bug of this form buried in this BoringSSL commit. Having one assembler syntax per architecture instead of per (operating system * architecture) helps avoid the need for such problematic things.

  2. The system assembler is often too old. In particular, when Intel and ARM add new instructions, the assemblers need to be updated to take advantage of these instructions. Otherwise, the assembly language code has to encode the instructions as byte sequences. Standardizing on llvm-mc and keeping it up to date minimizes the need for such error-prone manual encoding of instructions. For example, this is what OpenSSL’s assembly language code for AES-NI looks like (note that this is using its Perl preprocessor):

# AESNI extension
sub aeskeygenassist
{ my($dst,$src,$imm)=@_;
    if ("$dst:$src" =~ /xmm([0-7]):xmm([0-7])/)
    {	&data_byte(0x66,0x0f,0x3a,0xdf,0xc0|($1<<3)|$2,$imm);	}
}
sub aescommon
{ my($opcodelet,$dst,$src)=@_;
    if ("$dst:$src" =~ /xmm([0-7]):xmm([0-7])/)
    {	&data_byte(0x66,0x0f,0x38,$opcodelet,0xc0|($1<<3)|$2);}
}
sub aesimc	{ aescommon(0xdb,@_); }
sub aesenc	{ aescommon(0xdc,@_); }
sub aesenclast	{ aescommon(0xdd,@_); }
sub aesdec	{ aescommon(0xde,@_); }
sub aesdeclast	{ aescommon(0xdf,@_); }

Would this be hard to do?

I don’t think so. llvm-mc is already built as part of building llvm as part of Rust. The main risk, AFAICT, is figuring out how well llvm-mc currently supports generating COFF output for the -msvc target. A secondary risk is how to deal with any conflict of Rust-provided llvm-mc with another llvm-mc in $PATH. My suggestion is just to name the rust-provided llvm-mc executable rust-as or rust-llvm-mc or similar, to avoid this.

What projects would benefit from this?

I am mostly familiar with crypto libraries. Any crypto library that wants to maximize performance could benefit from this right away. Especially, crypto code that also wants to have optimal performance on Windows would benefit from this. For example, note how this issue in rust-crypto was closed by disabling the AES-NI optimizations for -msvc targets.

Isn’t llvm-mc too primitive? What about alternatives like gas, nasm, yasm?

Until recently, llvm-mc’s macro support was not that great. However, now it seems to be good enough (or at least there are patches pending which make it good enough) to be useful for sophisticated use cases. And, in particular, I believe it is good enough to reasonably replace the Perl preprocessor for OpenSSL’s code. Also, llvm-mc supports the same macro language for all architectures, so it is possible to write macros that abstract architecture differences, which is useful at least for crypto code. So, despite my initial biases (which were to just use yasm for x86/x64 code, which is actually easier for my projects), I think llvm-mc is the better choice, and that its advantages over other alternatives will only increase over time.

Thoughts?


#2

I agree about just using a proper assembler for these kinds of large pieces of code written in assembly language. I always though that asm! support would be used for when you wanted to mix just a little bit of inline assembly in a mostly Rust module, but the kind of stuff you’re discussing, of thousands of lines of code, definitely seems more appropriate separated out and built with dedicated tools.

The question I have is, why bundle an assembler over just improving how easy it is to package Rust and Rust applications for distributions, so that they can be distributed as packages with build dependencies and then build-depend on whichever assembler the author prefers?

We don’t bundle a C compiler for compiling any C code that you might want to include, and while assembly language dialects may be more different, especially once you get to sophisticated macro support, just being able to add the appropriate build-depends at the packaging level seems simpler than Rust bundling its own assembler.

In addition, bundling other third party tools like this tends to be frowned upon by distribution packagers, who would prefer each dependency to be packaged and updated separately rather than some projects doing their own bundling of other projects.


#3

One concrete point is that unless rustc switched to dynamically linking LLVM (which might be more expensive?), or else bundled the functionality directly into the rustc binary, llvm-mc would be an extra 12MB to distribute.


#4

If I understood correctly, the proposal is to include llvm-mc as part of rust distribution. While the proposal explains why llvm-mc distribution would be useful, it does not seem to explain why including it as part of rust distribution is useful.


#5

In my experience, it is very useful to be able to have a guarantee that functionality will exist with uniform behaviour (for a library author), and have that functionality exist automatically (for a user of the library), so that you can use it without having to handle the case when it doesn’t exist. For instance, the discussion under “Why not just use the native assembler?” in the main post, and needing a C compiler causes issues for Rust libs that bind to native libraries on platforms like Windows.


I’m definitely in favour of the idea, in theory, but I’m not sure about it in practice. It seems like it may “accidentally” tie the Rust language to LLVM’s choice of asm syntax & behaviour, meaning other non-rustc (and even non-LLVM) compilers need to support it to get access to certain crates… possibly even very widely used crates, if, for instance, a crypto crate used in a project like hyper requires assembly. There’s also a similar question of guaranteeing compatibility between versions of the same compiler, which may not be at all guaranteed (I personally have no idea what llvm-mc’s stability guarantees are.)

(As prior art, compilers like Clang and GCC can be invoked directly on .s files.)


#6

I like the idea of a cross-platform ubiquitously supported method of writing assembly in Rust, it’s certainly makes many applications quite nicer. My only concerns are:

  • Shipping a literal llvm-mc binary may be difficult in terms of interacting with other installations (mentioned elsewhere on this thread)
  • Do we know if the syntax accepted by llvm-mc is stable? We’d want to make sure that updating LLVM doesn’t force us to eventually vendor our own parser or something like that.

#7

tl;dr: I’d like to propose these as the next steps:

  • I will build an experimental version of ring where some of the assembly language files are .S llvm-mc-compatible source files, instead of .pl PerlAsm, and try to make that work using the latest Nightly -msvc.
  • I will write up an RFC for adding llvm-mc to the installation package for Rust targetting -msvc (x86 and x86_64).

This doesn’t work for Windows, specifically users of the -msvc target. Not having a gdb-/llvm-mc-compatible assembler on Windows is the main issue I’m trying to solve.

How about for now, this change is only made for -msvc? For example, we could bundle llvm-mc.exe with the -msvc target and have cargo ensure that AS is defined as the path to the assembler for build scripts to use.

I don’t think anybody really cares about an extra 12MB, which is almost nothing in terms of build tools. Further, I did some research and the Clang toolset avoids dynamic linking because it makes startup time for the tools much slower. So, I think we should just do what is currently being done, with respect to linking.

[quote=“sanxiyn, post:4, topic:2879”] If I understood correctly, the proposal is to include llvm-mc as part of rust distribution. While the proposal explains why llvm-mc distribution would be useful, it does not seem to explain why including it as part of rust distribution is useful.[/quote]

Again, the point is to be able to use the same assembly language source files (possibly/probably with a lot of platform-specific conditional logic) on Windows targets that are used on Linux and Mac OS X targets, without the user having to manually install another toolset, and without a library’s build script needing to fetch and run an executable from the internet.

[I think huon’s concerns are similar to yours, so I didn’t quote huon.]

On every platform other than Windows, for all practical purposes, crates that use assembly language that isn’t inline asm already have this issue. They expect gas or llvm-mc to be available and they use it, usually by using a subset of the functionality that seems to be “universally” available. IMO, extending this to the '-msvc` toolchain wouldn’t significantly hurt.

In the long term, it would be nice to have a 100% uniform, perpetually-100%-backward-compatible solution across all (officially-supported) platforms. I think the llvm-mc team is dedicated to gas compatibility as far as is practical, and to backward compatibility to older versions of llvm-mc.

The harder question is how to deal with the politics of getting Linux distros to accept that rustc version X requires specifically llvm-mc version Y and substituting gas or an older version of llvm-mc isn’t acceptable. Again, my suggestion is to just punt on that for now, and experiment with seeing if/how people use cross-platform assembly language .s/.S to see if it is even worth tackling the politics of Linux distros w.r.t. this issue at all.