Machine learning primitives in rustc -- An Opportunity


#1

Machine Learning in Systems

Jeff Dean (of Google fame) gave a presentation two weeks ago (slides), where he stated

Learning Should Be Used Throughout our Computing Systems

Traditional low-level systems code (operating systems, compilers, storage systems) does not make extensive use of machine learning today

This should change

Then he went on to show how they have done some experiments replacing B-trees, hashmaps, and bloom filters in database indices with neural networks. For B-trees they obtained for 5-100x fold improvement in memory consumption and ~2 fold faster lookups, and similar (if not as large) improvements for hashmaps and bloom filters.

After some more examples of machine learning applied to systems (datacenter cooling controller, distributed execution device placement), he went on to explain:

[Where should we use ML in systems?] Anywhere We’re Using Heuristics To Make a Decision!

Compilers: instruction scheduling, register allocation, loop nest parallelization strategies, …

Networking: TCP window size decisions, backoff for retransmits, data compression, …

Operating systems: process scheduling, buffer cache insertion/replacement, file system prefetching, …

Job scheduling systems: which tasks/VMs to co-locate onsame machine, which tasks to pre-empt, …

ASIC design: physical circuit layout, test case selection, …

As Rust is targeted squarely at system programming, I think we should be asking how rust as a language is well suited to machine learning if ML is likely to make (significant?) inroads into all of these domains that are squarely in rust’s sweet spot

At this point I can imagine you saying, “that’s all well and good, but that’s what we have libraries for! You’ve stumbled into the internals forum by mistake!” And that’s where we come to the second leg of this journey.

Machine Learning is a language/compiler issue

If you have not read the excellent Julia-lang blog post regarding the interplay between languages and machine learning, you should go and do that now because they’ve explained the issue (and opportunity) better than I can do here.

But in summary: most machine learning (especially deep learning) libraries today maintain a separate notion of data-flow graphs with control flow, execution and scoping semantics, and a separate execution runtime which deals with parallelism, distributed execution etc. This is all separate from that provided by the language compiler/runtime/library ecosystem. This duplication of effort causes problems such as difficulty in debugging, lack of type safety, complicated runtime machinery, complicated deployment paths etc.

Combining this trend in the first half of this post, there is a real opportunity here to take a modern language/compiler (such as Rust/rustc) and add proper first class support for machine learning primitives (tensor types and supporting machinery mainly, perhaps autodiff), then extending the compiler to reason about tensor dataflow in addition to the types it already understands.

There are many benefits that come from doing this including re-use and leverage of existing compiler optimizations, deployment/debug/profiling tooling, library ecosystem/infrastructure etc. That I can expand on further if anyone would like.

Fortunately, there is already work in this direction, see this recent paper and associated LLVM conference talk, where a staged IR is layered on top of LLVM (working up from the bottom), and then a type safe DSL is embedded into Swift from the top. It is still early and not yet open source, but I like to view this as a sign of things likely to come.

I think that Rust is well positioned to capitalize on these trends similar to the direction Swift was being taken in that talk and paper above, and I wanted to open a dialogue here,

  1. Does this community see the importance of machine learning in systems and the opportunity that Rust has similarly? This is an area I work closely in so I’d be happy to continue this discussion/give more supporting evidence if others disagree.
  2. How could rustc iteratively experiment/advance in this space? I think const generics are a really important first step (from what I understand of them), and I’d imagine that gives much of the bedrock which first class tensor types would be based on? Then I’d think we’d want MIR optimization passes operating on these tensor types and either doing all the tensor level optimizations at this point, or emitting out to a DSL like DLVM? Is it possible to register MIR passes in a library similar to procedural macros? I think that would be one way this stuff could be experimented on outside of rustc itself.

Thanks!


#2

Wow! Thanks for the writeup, this is really interesting… I’ve always thought of machine learning as something that is slow and cumbersome, so I was surprised when I initially heard about the speedups they got. These types of advancements in the way we program make me excited for the future, and I definitely think that it’d be great if Rust could support this - if not at a language level, then at a crate level.


#3

Thanks for the bringing up the idea with your writeup! And I would be interested to hear some more on the topic from experienced rustaceans, in particular on viability of using the mentioned or pre-existing optimisation features in LLVM (e.g. what is the status of tensor optimizations in stable LLVM?).

I can see the importance of having good support for ML in a systems language such as rust, and I guess that a lot can be achieved through libraries. What would you like to see in the language and compiler itself that would be essential, say beyond const generics for tensors you already mentioned?

One question about the const generics for array sizes: That optimization is based on the tensor dimensions being known at compile time which is not always the case. Would you also need some on-demand specialization (recompiling some modules, or some kind of JIT) for it to be useful?


#4

Big note: rustc has very little support for reasoning about concurrency (it knows that static variables need to be Sync, that Send and Sync can be used in Trait Object definitions, and general auto-trait support), and no support for OS threads. Most of Rust’s support for concurrency is in the standard library.

I’d rather ML be the same way.


#5

Can you show some example of how we might use this? How would I code with these machine learning primitives in Rust?

One thing I’ve been wanting lately, and which ties into this, is the ability for my code to reason bayesianly about what to do next. Basically I’ve been implementing UDP-based, peer-to-peer protocols where I need to ask questions like “how long should I wait before re-sending this packet?”, “how many times should I attempt to contact host X before trying host Y instead” etc. The result is that my code is chock-full of arbitrary magic numbers that I pulled out my ass and it’s crying out for a more principled way to do things.

I think what I’d really like is the ability to describe the state of the world with a set of probabilities, like “probability that host X is still online”, “probability that this packet has arrived”, then use these probabilities to guide control flow, and update them as evidence comes in. I’d write a library to do this, but I don’t know what this library would look like or how I’d use it.

Since this sounds similar to what you’re after maybe you have some examples of code like this.


#6

To exemplify some of the indirection that is currently causing grief, consider the following simple tensorflow example:

x = tf.placeholder(tf.float32, shape=(1024, 1024))
y = tf.matmul(x, x)

with tf.Session() as sess:
  rand_array = np.random.rand(1024, 1024)
  print(sess.run(y, feed_dict={x: rand_array}))

This could be this in some fantasy version of Rust:

let x: Array<Array<f32, 1024>, 1024> = random::rand2::<f32>(1024, 1024);
let y = x * x;
println!("{:?}", y);
// extra
let z = y * 3.1415f32;
println!("z = {:?}", z);

It would then compile down to some format that represents a tensorflow graph intermingled with a couple of dynamic (and in this case synchronized) code sections of regular Rust. More complex examples would result in more complex graphs.


#7

I’m far from being an expert in language design or even Rust but there is nothing in your example that requires an ML specific language level feature. It could almost be done in today’s Rust.


#8

If I understood correctly, DLVM is based on the idea of Lightweight Modular Staging (LMS). As far as I know, one of the main selling points of LMS is that it is just a Scala library and requires no special support from the compiler. Therefore, I would like to understand: what is the main reason why we cannot have ML DSL implemented as a library in Rust?


#9

The challenge here is in my opinion to either through non-intrusive markup, language extensions, or with an exceedingly clever compiler build sub-graphs of computations that can be effectively dispatched to a compute device. While contrived, for the example it would be possible for the matrix multiplication to happen on the GPU. Note that you’d usually expect to see longer graphs of computation that can be dispatched.


#10

Suppose I have a function like this:

fn foo(x: Matrix<(50, 50)>) -> Matrix<(10,)> {
    let output0 = sigmoid(x * W0 + b0);
    let output1 = sigmoid(output0 * W1 + b1);
    output1
}

This has the same computation graph as a simple neural network you might construct in tensorflow. Now how do I back-propagate over this? Can I use this function as part of a larger, differentiable function and back-propagate over that? How are the weights initialized?

This is the level of integration that would require built-in language support. But I’m wondering what the code might look like in practice.


#11

As the recent crate wyrm demonstrates, and has been pointed out by several in this thread, these types of operations can be done in libraries today. See an example of wyrm’s define-by-run syntax in the example linked above to see what this looks like:

This is already quite nice syntax, and while I do think there are ergonomic improvements to be made in the future, I think those are outside of this internals discussion.

The part that I think could benefit from compiler assistance/integration is the set of compiler optimizations that should be happening under the hood to support this syntax/library (or is probably not happening eg in wyrm because of the difficulties involved):

  1. CFG and dataflow analysis (including liveness analysis for memory savings)
  2. Operator fusion (both vertical and horizontal)
  3. Type safety for eg matrix computations (this is something rustc can/will help with const generics, but as @gavento points out, this might benefit from RTTI ala something like trait objects or monomorphisms (similar to TensorFlow’s recurrent length bucketing strategies)).
  4. Etc…

I can go into more detail for any of these analyses/optimizations since I work with an open source deep learning compiler project for my day job (Intel nGraph). Though this post/idea is all on my free time.

In general, I agree with @notriddle’s opinion that things like concurrency should stay outside the compiler to improve velocity/experimentation etc, but given how much of the above list is something the compiler is reasoning about already, building out an out-of-compiler set of representations/optimization pass machinery/analysis passes is not only wasteful, but will likely be more difficult when trying to reason about/optimize across the domains of these two compilers. Example: if I take a one dimensional slice out of a tensor, sort it, and then broadcast it back across multiple dimensions to do a binary operation across another tensor, then I lose the ability to optimize across this operation by doing an in-place sort unless I know the liveness of the pre-sliced tensor.

I think this is a somewhat different situation than the concurrency primitives being outside of rustc given that the primitives of concurrency are quite dissimilar to the core operations of rustc, unlike what I’ve described above.

That being said, I do think we are a long way out from adding all of this stuff to rustc today, but I think the process of laying the groundwork can start. For example, it sounds like the current consensus is that external MIR plugins/passes are currently discouraged (due to the understandable desire for MIR APIs/details to remain internal). So perhaps a next question is to figure out

  1. Given that we’d want to start out using some of the existing/new deep learning compilers (nGraph, NNVM+TVM, etc) at the library level, is there a natural way to iteratively transition/experiment to rustc aware optimizations that can leverage/benefit/extend planned rustc MIR optimizations?
  2. If MIR is the right level to be operating on these types of constructs? I think so, because once you lower to LLVM, you’ve lost a lot of your higher level info.
  3. How can we enable quick iteration/experimentation with MIR passes given the desire to keep this interface private? I think a similar analogy is that of the Linux device driver contract: If you upstream your driver code, then its use of internal APIs will likely change, but those will be taken care of by the people making the changes, whereas if you don’t upstream, then we’re free to change our internal APIs and break your driver and it’s your responsibility. But this seems like a burden on rustc…

#12

Let’s be clear - there are two opportunities here:

  1. Supporting ML in Rust (the focus of the above discussion), and
  2. Exploiting ML in optimising Rust code (Jeff Dean’s point about the use of ML in compilers).

Perhaps these could be investigated independently to start with?


#13

I don’t see that it needs language-level integration. If you want to check out something superficially similar, timely dataflow presents as roughly idiomatic Rust iterator code, but compiles to a dataflow representation under the hood, at which point you could do autodiff; I do auto-incremental computation personally, but it is up to you.


#14

To be clear, I wasn’t talking about generating computation graphs at runtime then manually passing them off to some compute engine (like you’d do with (eg.) pytorch), I was talking about the ML computation graphs being generated at compile time by rustc. Since we were talking about integrating ML at the language-level I (mistakenly) thought other people were talking about the same thing.

To be even clearer: 2u32 + 3u32 doesnt return a computation graph which computes 5u32, it just returns 5u32. And in my example matrix0 + matrix1 doesn’t return a computation graph which computes a matrix, it just returns a matrix.


#15

Right, but matrix0 + matrix1 can do whatever you ask the + operator implementation to perform for you, which can be registering the operation in a dataflow graph that you subsequently analyse for backprop or autodiff or whatever you would like to do with that information.

There is a big difference between language-based and library-based, and which largely comes down to the difference between “everyone must” and “anyone may” use your idioms. I suspect there is a pretty high bar for “all of Rust needs to be representable as dataflow” when TF demonstrates that you can get things done with just a library (rather than a fork of Python).

That being said, you could probably drop in on the MIR folks and see if you can help tool up some transformations from SSA/whatever to PDGs, if they don’t have those yet. There are years of compilers research on treating general computation as dataflow, but almost all of the recent ML success has been with libraries and DSLs instead.


#16

i see no problems with the language itself. const generics would be nice to have, but constant size tensors are an edge case in machine learning.

i use the tensorflow bindings for now, with a lot of boilerplate to be able to train with it.

the libraries are the problem, i have not found anything i could rely on.

maybe a layered approach would be the right one - using gfx-rs to abstract the gpu and use SPIR-V kernels in a crate like ndarray. this could add additional resources to gfx-rs and ndarray and would avoid repeating work.

and then building a graph crate upon that. just my two cents.,


#17

@glyn Agreed that using ML in rustc is another interesting area of investigation, and I agree that rustc supporting ML is we’ve talked about here and should probably try and stick to in this thread for clarity.

@frankmcsherry While I agree that you can provide a nice API using library defined operators (as you do with timely dataflow, and as wyrm does as I mentioned above) and then call out to an external (library hosted) execution environment, that execution environment now has to be it’s own compiler that has only partial knowledge of the computation (only the regions handled by your supplied operators). In addition, there’s a good chance that rustc is probably blind to that section of code (assuming JIT compilation of the matrix computations) as well and can’t reason about it from a performance standpoint.

I also wouldn’t cite TF as an example of how things should be: Calling 1M lines of C++ a python ‘library’ is a bit of a stretch when it implements an entire programming language (with it’s own scoping rules etc) and runtime with the issues that causes. One reason pytorch has been able to compete with TensorFlow is in large part because it leverages the existing language/control flow etc of Python and then try to fix performance through other means (tracing JIT compiler is their current attempt).

There is a big difference between language-based and library-based, and which largely comes down to the difference between “everyone must” and “anyone may” use your idioms. I suspect there is a pretty high bar for “all of Rust needs to be representable as dataflow”

I agree about the high bar required, but there are a rather small number of primitives that are quite central and consistent that aren’t subject to much variation (dense matmul). The idioms/syntax/higher level libraries then reuse/repackage/enhance these similar to SIMD intrinsics.

Agreed about MIR transformations. I guess one question is: how many MIR transformations/optimizations could be implemented in rustc today that would benefit normal code and any ML libraries (like wyrm) that might emit MIR code in the future?


#18

If I understand, what is being proposed is as follows:

The current Rust ELF uses libc, but in the future we won’t be depending on libc and will make syscalls directly. But the proposed Rust ELF will also depend on either CUDA or OpenCL. This doesn’t seem right to me. Rather, we want to borrow from two approaches: BLAS from HPC and regex from … everywhere. For example:

  1. There is a string input (regexes are short, but in this case, it’s probably an external file).

  2. There is a crate that can compile the DSL that will be able to run (probably in a thread or otherwise async as it’s long running) and offer Future<Klass> or Future<f32> when the computation is complete.

  3. The library can have different backends so you can run your system on deeplearning-gpu or deeplearning-simd. Or it’s a runtime binding like blas (but this requires stable linkage in Rust and I’m not sure this exists yet)

Benefits:

  • The cost to people who aren’t using deep learning is 0.
  • The release train is separated so Rust can continue at it’s cadence while the quickly iterating Deep Learning libraries can be iterated on at their own cadences.
  • Flexibility so there’s no one true blessed system for deep learning baked into Rust.
  • As a library, you could perhaps even call it from Python as well. This would be a good way to get more people using Rust. :wink:

Drawbacks:

  • Doesn’t co-opt the clever rustc compiler writers to work on this.
  • You don’t get to write TF flow code in/as vanilla Rust.

For using DSLs in Rust, check out how some of the parser libraries handle grammers. e.g. peg. The grammar is a DSL that is ‘built’ at build.rs time.


#19

Thanks for the picture @ehiggs, I should have been more concrete because that diagram does not describe what I was proposing: rustc itself is currently hardware agnostic and IMO should stay that way. LLVM/Cretonne should continue to be the ones responsible for platform dependent lowering/codegen.

Let me describe the main pain point with the library approach that is currently available to ML library authors in rustc today (ala BLAS and regex):

Take the let mat 2 = matrix0 + matrix1 example used above. As the library author, I have implemented the + operator here to do deferred execution (rather than immediate) so I can register the operation in a library implemented dataflow graph that I can perform my optimizations in etc. No big problems here other than having to build my own (matrix focused) compiler infrastructure in a library.

Now if the user grabs some values out of mat2 and uses them in an additional computation that creates a new matrix,

fn cool_comp(mat0: Matrix<2, f32>, mat1: Matrix<2, f32>) -> Matrix<2, f32> {
  let mat2 = mat0 + mat1;
  let factor = mat2[0] + mat2[1];
  //rustc takes over here ^^^ since Index<Matrix> returns a scalar
  let mat3 = mat2 / factor;
  mat3
}

Then because the library author loses ‘sight’ of the computation at the point it gives the user the two scalar values from mat2, the library can no longer reason about the fact that mat2's storage can be reused for mat3 and the element-wise division can be done in-place.

Instead, the library author just has to hope that MIR/LLVM will be able to do appropriate inlining and loop fusion and memory analysis across several library calls. And if our library does it’s own platform specific codegen (perhaps for an accelerator), which is quite likely these days, then the chance of this analysis happening correctly falls to zero.


#20

@binarybana, Thanks for the feedback. To be clear, in your example, what I would have envisioned might have been:

use matex;

fn cool_comp(mat0: Matrix<2, f32>, mat1: Matrix<2, f32>) -> Matrix<2, f32> {
  let comp = matex::compile::<FnMut<(Matrix<2, f32>, Matrix<2, f32>>)->Matrix<2, f32>>("{m = $0 + $1; return m / (m[0] + m[1])}")
  
  comp(mat0, mat1)
}

And even though you’re not keen on working with the computations as strings, a nice thing about it is that a REPL falls out of the work in a straight forward manner. That said, this could be made nicer for production code by implementing a language plugin:

fn cool_comp(mat0: Matrix<2, f32>, mat1: Matrix<2, f32>) -> Matrix<2, f32> {
  let comp = matex::compile!(|ref x, ref y| {
    m = x + y; 
    m / (m[0] + m[1])
  });
  
  comp(mat0, mat1)
}

Alternatively:

#[matex_openacc_style]
fn cool_comp(mat0: Matrix<2, f32>, mat1: Matrix<2, f32>) -> Matrix<2, f32> {
  let mat2 = mat0 + mat1;
  mat2 / (mat2[0] + mat2[1])
}

As you’re probably familiar, running matrix operations on an accelerator means copying data to and from the accelerator is non trivial and should be explicitly managed (unless you have some kind of sufficiently smart compiler/scheduler) or you end up with severe performance degradation. So I think going without something to mark the code as ‘special’ is not a good idea for Rust which tries to reduce the magic.