Type system refactoring

A short while ago I started a topic on implementing an IR (or two) for the compiler. Since then I have been investigating how we’d best go about doing it. One of the more obvious things I’ve discovered is that the current type system code will need a good amount of refactoring in order to make any architectural changes going forward.

middle::ty and it’s satellite modules have become, over time, a dumping ground for a variety of other systems. There are fields in the context object for many things that are only loosely related to typing.

  • used_unsafe - A set of used unsafe nodes, this seems to exist solely to allow for “unused unsafe” warnings.
  • node_lint_levels - A map from nodes to what lints are allowed/warned/denied. This exists on tcx only to allow for the enum sizes lint - in trans!
  • stability - A map from a node to a stability. This seems very out of place and again only serves to power a warning.
  • used_mut_nodes - Set of local variables, marked mut, that get used (in a way that means they need to be mut. This is for a lint, again.

On top of this, many fields are just tracking/caching small amounts of information that are specific to certain nodes. We have a cache for object-safety that is just a map to boolean values! We have so many caches it’s ridiculous. Any possible large-scale change to the compiler is going to have to start at middle::ty. This is what I think needs to be done:

1: More complete representation of the type system

There is no specific type you can go and look at and say “this represents the definition of a struct type”, instead struct types have to be constructed from several different maps. The same goes for other types.

So, I propose that we create a “definition” representation of types that is allocated in an arena and interned like types currently are. This definition representation would have the name, the type parameters, and the contents of the type. Importantly, instead of an opaque DefId that you use to look up information in a map, this would be a pointer to the actual definition. This is big win from a performance perspective, as you 1) don’t have to handle the very unlikely case that the entry isn’t in the map and 2) don’t have to do a much more expensive hashtable lookup.

2: Remove anything that isn’t directly related to typing from the context

We have a certifiable God Object going on here. As i mentioned earlier, it is incredibly bloated and much of the data doesn’t need to be around for most of the compilation. There needs to be a small-as-possible typing context that contains only what is necessary for working with types. Unless we make stability and lints part of the type system, they have no reason to be leeching off of the ubiquity of the type context.

On the same topic:

3: Stop putting everything in hash maps

I get it, hash maps are incredibly useful datastructures. But I think we have a problem with hashmap abuse. Consider this an intervention.

A lot of this data should be stored inline where it’s actually used/useful. Part of the problem is that we don’t have anywhere to put a lot of this data (hence the first point), but that doesn’t stop it from being a problem. Whether or not a trait is object-safe should be part of the representation of the trait, not in a hashmap from a DefId to a bool. What traits (like Copy and Sized) a type implements should be part of the representation of those definitions. Whether or not a definition is an associated type shouldn’t even be something we have to store. It should just… be.

4: Use a Typed AST

There is pretty much no part of the compiler that doesn’t need the types of various parts of the AST. Instead of constantly looking up the same types in the same places and doing the same “is it there” checks, we need to have a secondary AST that incorporates the types directly. This would be a massive boost not only to performance, but productivity. Currently, seemingly-simple tasks like "what is the type of this expression" require more work than is expected.


What I’m hoping is that people that understand this code better can provide some more details. As I said, refactoring this code is almost certainly going to be required for any major changes going forward.

5 Likes

Funny you should say that, I was pointing out to @nikomatsakis the other day how pointers are never worse than DefId. That's the easiest change to make IMO and it would simplify anything dealing with structs and enums by a lot.

The "typed AST" is the CFG I keep mentioning, and which @nikomatsakis has started calling "MIR". I never tried to start working on that because I've been viewing "restructure trans to use only Datum" as a pre-requisite and my two attempts at doing that have failed due to my limited time and repeated rebases.

This is particularly bad for structs, which seem to take a sadistic delight in making it hard to find out information about themselves. Also trait/impl items.

In general, I'm all for refactoring and improving the compiler's internals! Though I do think there are some advantages to the current side-table system, but by and large I agree we could centralize a lot of the information. Another thing that would be helpful, I think, is to start moving away from big global tables that span the entire crate and move instead to per-fn tables. I envision this as an important part of an eventual MIR design, as it will allow for internal parallelization, and should make it easier to encode/decode inlined fns and the like as well.

However, I do want to proceed carefully when it comes to full-on refactoring into MIR -- this is indisputably a good thing, but I want it to be carefully designed to match up with a minimal, formalized Rust as well, and there are some subtle questions about how to handle match expressions and the like. I think that doing this right will take some thought and there's a bit too much going on right now for me to want to devote mental energy to that. That said, I do see it as a priority post 1.0.

One word of caution with introducing pointers. We should wait until the work on the new destructor semantics lands, because these kinds of inter-connected, arena-allocated pointer graphs are a bit trickier under the new rules (though I think they are mostly permitted, now, as long as they don't include non-trivial destructors, so we should be ok).

So an initial attempt at implementing a StructDef type that holds just field information for a type netted a ~8% speed gain in the translation pass, while compiling libsyntax, from 12.151s to 11.139s.

1 Like

That's definitely a good idea. It guess it would also be prudent to try and keep this isolated for function bodies, so that there is no artificial requirement that this typed AST representation is created for all function bodies when you just want to compile one function. We'll want to be able to handle functions as independently as possible from each other in order to facilitate parallel and incremental compilation.

It would also be great if this typed AST could be serialized and cached on disk in order to do incremental type inference and checking. I think it's actually not that hard to determine if type checking needs to be re-run for a given function.

cc @nrc if you are not already following this

So I’ve been working on the first point, and have come up with an addendum:

1.1: Combine Struct and Enum Types

In refactoring, I’ve found that most code ends up handling struct types and enum types almost identically. This makes sense given that structs are really just special-cases of algebraic data types. They are just ADTs with a single constructor. Combining the two internally would simplify a lot of code that has to handle either a ty_struct or a ty_enum, this is especially obvious when you realise that not all ty_struct types are actually from struct definitions in source, we also use them for certain enum variants.

Basically instead of ty_struct and ty_enum, we would have a ty_data that links to the relevant information. Structs would become a special-case of ty_data with only a single constructor variant.

3 Likes

Nice post.

I hope that consolidating all the maps into one might have a considerable impact on memory usage and cache performance.

That's super promising.

1 Like

@Aatch I think this would be a really good change. I’ve been doing some Rust formalization as a side effect of a research project I’m working on and in our tooling/formalisms I’ve found it to make much more sense to treat them as a unified concept. Generally I’m in favor of de-sugaring as much of the surface language away and keeping our internal representations as small as possible.

I’ve got a branch that is mostly ready to go. I just got distracted with CFG-related things (found a bug, fixed it! :smiley:). It doesn’t merge ty_struct and ty_enum into ty_data, but it does use the same datastructure for both. Some imprecise measurements put typechecking at ~8% faster and translation at ~5% faster.

@Aatch: would it make sense to include tuples into this ty_data type? I would imagine they would also share a lot of the same logic with structs and enums.

Good question, took me a minute to figure out the answer. But no, tuples are different enough that it would just end up complicating things.

For a start, tuple types are not defined anywhere. They just... are. Like other primitive types. You cannot point to the definition of the tuple type itself. As such, there's also no defining location for it's fields.

You also cannot have a generic tuple. You can obviously have a tuple that contains type parameters, but a tuple does not have type parameters itself. I.E., for (T, U, V) the types for T, U and V must come from somewhere else. You can't refer to a tuple in isolation like you can a struct or enum.

Lastly, while there is code that handles tuples and tuple structs similarly, they are generally handled quite differently due to the fact that tuple types have far less information to carry around. There is no privacy for fields, no source-code definition, no attributes associated with the type (e.g. repr(C)), and other things.

I think that combining the two would be possible, but would actually complicate the code, rather than simplify it.

This all sounds great!

Especially that

How is this idea going? Are there any attempts at implementing it?

Just a semi-tangential thought: I've thought a bit about ways of allowing for the execution of arbitrary, user-supplied code for various contexts (primarily game modifications and simulation modules). PNaCl is cool, but depressingly under-documented and scary.

I started considering AOT compiling asm.js code to native without bothering with the actual JavaScript parts. The problem with that is a naive translation is going to incur significant overhead from address bounds checking.

Then it occurred to me: Rust already does this at compile time. It's just that this information is thrown away by the time it gets to LLVM/asm.js.

So, just to throw this out: it'd be really cool if the hypothetical MIR contained sufficient information that it could either be directly used as, or lowered into, a form that could be used as a "memory safe" IR format for libraries. Obviously, you'd need to also re-run the borrow checker at load-time on the IR module, but it would allow you to run user code safely and very fast.

Hell, that might even become a selling point of Rust: ability to generate provably memory safe binary modules.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.