Explicit monomorphization for compilation time reduction

Background

People believe that generic code slows down compilation[1][2]. There's a whole section in The Rust Performance Book about ways to avoid IR bloat caused by generic functions. As an extreme example, glam generates all its public interfaces' implementation with macros to avoid any generics, partly because generics can increase compile time.

Proposal

I believe we can mitigate this problem by allowing users to create Explicit MOnomorphizations(EMOs) of generic items (functions, methods, structs, enums). By creating an EMO in an upstream crate, things such as MIR and machine code are generated and stored in that crate's compilation artifact, so the downstream crates can skip generating those if they're using that specific flavor of the generic item.

This idea is so simple that I thought it must have been brought up previously, but I did my googling and found nothing. Was it proposed under a different name? Or is it just silly and I'm missing something obvious?

Anyway, the syntax in my mind looks like this. We introduce a new keyword mono (of course open to bikeshedding), and we can write:

pub fn generic_fn<T>(t: T) {}

mono generic_fn::<i32>;

And for methods:

pub struct MonoStruct;

impl MonoStruct {

    pub fn generic_method<T>(self, t: T) {}

    mono fn generic_method::<i32>;

}

Of course we can create a struct EMO:

pub struct GenericStruct<T>;

mono GenericStruct::<i32>;

Creating an EMO for a generic method of a generic struct is more verbose and should be unusual:

pub struct GenericStruct<T>;

impl<T> GenericStruct<T> {
    pub fn generic_method<U>(self, u: U) {}
}

mono impl::<i32> GenericStruct<i32> {
    mono fn generic_method::<f32>;
}

Note that EMOs have the same visibility as their generic origin. EMOs of a private item make no sense.

Implementation

I'm not familiar enough with the compiler to lay out all the implementation details. My high level understanding is that EMO can be implemented in the mono item collector. Every EMO item becomes a mono item graph root. Before creating a new mono graph node, we check if there's an EMO version. If so, the node is not created.

Cross crate EMO

We can allow creating in crate B an EMO of an item defined in crate A. This is useful because an app author may want to create a crate with all the EMOs the app needs, and then the app's supporting library crates can all depend on it to avoid wasting compilation time. This method works like C++ precompiled headers.

However, allowing it can make the above discussed EMO lookup process slower, because we have to check all dependent crates for possible EMO.

Interaction with other language features

Lifetime parameters

AFAIK, lifetime parameters doesn't matter after borrow checking, which happens before monomorphization. EMOs can ignore lifetime parameters. We can write:

pub fn generic_fn<'a, T>(t: &'a T) {}

mono generic_fn::<i32>;

Specialization

An EMO is not a specialization. Any more specific implementation wins over EMO.

Inlining

One downside of EMO is that it makes inlining less likely. To mitigate this, we can allow #[inline] attributes on EMOs.

Prior art

C++ compilers support precompiled headers, which is the closest thing I know to EMO.

EDIT: As @InfernoDeity pointed out, C++ has explicit instantiation.

6 Likes

I agree that it seems a common-sense item. I think the subject has come up before, but I don't think there's been RFCs.

Well don't mind if I do!

You don't need a new keyword. A builtin macro would work just as well:

pre_instantiate!(generic_fn::<i32>);
pre_instantiate!(generic_fn::<u32>);
pre_instantiate!(generic_fn::<i64>);
// etc

(a macro has the added bonus of being explicit, and easily google-able)

Also... thinking about it, it's probably possible to write such a macro right now. It would probably need to do some type system gymnastics to actually instantiate the function without being given arguments, but it seems feasible.

6 Likes

I like the idea of using a macro.

Without modifying the compiler, even if we get the instantiation work, compiler will still monomorphize it in downstream crates. Basically making the extra instantiation meaningless. :sweat_smile:

The nightly compiler has -Zshare-generics, which does do (some of?) this automatically.

The main reason glam doesn't use generics in its API comes down to not performance (though that is a factor, including debug mode performance, which isn't normally a factor that Rust cares about), but rather a) API simplicity and b) explicit SIMD implementations meaning that implementation isn't uniform between types, precluding the use of generics.

The generics-heavy interface that glam "competes" with is nalgebra. But it's important to note that glam isn't trying to be nalgebra, even for the CG subset. Rather, it's more directly comparable to the C++ library GLM, which, even though it's in C++ with the full power of explicit template instantiations, implements it and pushes consumers towards using glm::vec3, not glm::vec<float, 3>.

GLM actually does provide the glm::vec template for users who prefer that style, but there is no generic implementation of the templates. Rather, the templates just have explicit instantiations to say that glm::vec<float, 3> = glm::vec3, and so on. (This is required to provide fields for the N generic, but theoretically not required for the T generic.)

This, especially in glam, is due to implementation. There is no uniform implementation strategy (yet?), because the types are implemented using explicit SIMD (where possible).

And then there's the compiler errors angle. While nalgebra type errors aren't quite highly-generic-network-crate levels of obtuse, they can get difficult to understand, due to the generality of functionality provided by nalgebra. With glam, however, because we're just dealing with simple types at the API level, type errors are very straightforward, clearly pointing at "you said X, you likely meant Y," rather than some failed trait obligation only loosely related to what you're actually trying to do. (To this point I actually have a private crate whose entire point is to wrap nalgebra types into specific types!)


I think the best direction for this to go is for 1) -Zshare-generics to mature and its heuristics improve to where it's enabled by default in debug mode (if it isn't already? I recall some talk about it maybe being so), and 2) a built-in macro which guarantees a specific instantiation is generated and acts as a hint to the compiler that it's a good candidate for sharing.

10 Likes

Given share-generics, as CAD97 mentioned, does this need a language feature at all, at least for functions?

Perhaps something like this could work to force monomorphization (probably hidden in a macro):

pub fn foo<T>() { ... }

pub const _: *const () = &foo::<u32> as *const _ as _;
pub const _: *const () = &foo::<String> as *const _ as _;
4 Likes

How would you monomorphize all the (non generic) methods of a generic type that way?

Ah, so it's called share generics. Thanks for pointing that out @CAD97!

According to Experiment with sharing monomorphized code between crates · Issue #47317 · rust-lang/rust · GitHub, it's enabled in stable by default for debug builds. Makes me wonder why people don't mention it at all when optimizing compilation time.

2 Likes

It does, because share-generics only shares generics instantiated in parent crates.

So, let's say you have this code:

// crate A
pub fn foo<T>() { ... }

// crate B1
use A::foo;
let x = foo::<u32>();

// crate B2
use A::foo;
let x = foo::<u32>();

// crate B3
use A::foo;
let x = foo::<u32>();

You end up instantiating foo::<u32> three times. If you add pre_instantiate!(foo::<u32>); to A, then the B crates can all share that single instantiation.

4 Likes

I think scottmcm was trying to say pre_instantiate! can be implemented as a crate, instead of a builtin macro. Actually I'm dwelling on it and will try to implement said crate.

7 Likes

C++ has template struct foo<int>; (if you have template<typename T> struct foo<T>{...}; (likewise for template functions and template variables), which is exactly this. It also has extern template ...;, to prevent the compiler from instantiating the template in a translation unit (unless an explicit instantiation is given as above).

I've actually wanted to use this in a couple (albeit more limited, in the case of rust) cases.

share-generics doesn't really do things like guarantee a particular instantiation of a generic function to have a particular runtime address (even inside dylib crates). It's more of the compiler reducing executable size via the as-if rule, than the language guaranting this sharing.

1 Like

Thanks for the information. I updated the "Prior art" section.

1 Like

…and its use is considered archaic, and is strongly recommended against by modern C++ guidelines and communities.

This would need a much stronger motivation, then. If you aren't familiar with the compiler internals, then where does your claim come from that doing this would significantly improve compile times? Guessing about performance of one's own code is already difficult. Speculating on the performance of a massive code base one isn't well-acquainted with is, I would say, pointless.

Instead, to motivate this, you should show actual benchmarks that pinpoint that this kind of change would matter around the hot spots in the execution of the compiler.

4 Likes

Well I did NOT claim it. In the background section, I clearly said what the motivation was - feedbacks from the community that generics slow down compilation.

I totally agree that benchmarks should be shown to motivate a change. Meanwhile, this post is not a RFC or PR. It is an idea and I wanted to hear what the community thinks about it, especially if it has been proposed before.

I understand that you feel people should be well-acquainted with the code base before they suggest possible ways for performance optimization, which I respectfully disagree. In hindsight, `share-generics did improve performance, which is very similar to EMO.

2 Likes

Do you mind to elaborate why this is the case?

1 Like

My main uses for it have been to ensure that the template is valid for a particular instantiation, but also, if I have a massive template/generic function instantiated with a common fundamental or library type, in several DSOs, I'd rather instantiate that once, not 50 times for 50 shared objects. I will agree that, because of COMDATs, it's less of an issue for static linking, but not all linking is static.

This sounds similar to what Haskell calls "specialize"

https://wiki.haskell.org/Inlining_and_Specialisation#How_do_I_use_the_SPECIALISE_pragma.3F

1 Like

This is really useful for rust-gpu and rust-cuda for having generic shaders and kernels

You end up instantiating foo::<u32> three times. If you add pre_instantiate!(foo::<u32>); to A, then the B crates can all share that single instantiation.

Maybe, we can just add specialized crates, e.g. «A_u32», with preinstantiated implementation, which can be imported directly instead of A. It requires no changes, but will pollute namespace.

Thanks. I've confirmed your approach works for both functions and methods. Now it's matter of writing appropriate proc macros the ease the use.

1 Like