Machine learning primitives in rustc -- An Opportunity

Thanks for the bringing up the idea with your writeup! And I would be interested to hear some more on the topic from experienced rustaceans, in particular on viability of using the mentioned or pre-existing optimisation features in LLVM (e.g. what is the status of tensor optimizations in stable LLVM?).

I can see the importance of having good support for ML in a systems language such as rust, and I guess that a lot can be achieved through libraries. What would you like to see in the language and compiler itself that would be essential, say beyond const generics for tensors you already mentioned?

One question about the const generics for array sizes: That optimization is based on the tensor dimensions being known at compile time which is not always the case. Would you also need some on-demand specialization (recompiling some modules, or some kind of JIT) for it to be useful?

Big note: rustc has very little support for reasoning about concurrency (it knows that static variables need to be Sync, that Send and Sync can be used in Trait Object definitions, and general auto-trait support), and no support for OS threads. Most of Rust’s support for concurrency is in the standard library.

I’d rather ML be the same way.

1 Like

Can you show some example of how we might use this? How would I code with these machine learning primitives in Rust?

One thing I’ve been wanting lately, and which ties into this, is the ability for my code to reason bayesianly about what to do next. Basically I’ve been implementing UDP-based, peer-to-peer protocols where I need to ask questions like “how long should I wait before re-sending this packet?”, “how many times should I attempt to contact host X before trying host Y instead” etc. The result is that my code is chock-full of arbitrary magic numbers that I pulled out my ass and it’s crying out for a more principled way to do things.

I think what I’d really like is the ability to describe the state of the world with a set of probabilities, like “probability that host X is still online”, “probability that this packet has arrived”, then use these probabilities to guide control flow, and update them as evidence comes in. I’d write a library to do this, but I don’t know what this library would look like or how I’d use it.

Since this sounds similar to what you’re after maybe you have some examples of code like this.

2 Likes

To exemplify some of the indirection that is currently causing grief, consider the following simple tensorflow example:

x = tf.placeholder(tf.float32, shape=(1024, 1024))
y = tf.matmul(x, x)

with tf.Session() as sess:
  rand_array = np.random.rand(1024, 1024)
  print(sess.run(y, feed_dict={x: rand_array}))

This could be this in some fantasy version of Rust:

let x: Array<Array<f32, 1024>, 1024> = random::rand2::<f32>(1024, 1024);
let y = x * x;
println!("{:?}", y);
// extra
let z = y * 3.1415f32;
println!("z = {:?}", z);

It would then compile down to some format that represents a tensorflow graph intermingled with a couple of dynamic (and in this case synchronized) code sections of regular Rust. More complex examples would result in more complex graphs.

I’m far from being an expert in language design or even Rust but there is nothing in your example that requires an ML specific language level feature. It could almost be done in today’s Rust.

If I understood correctly, DLVM is based on the idea of Lightweight Modular Staging (LMS). As far as I know, one of the main selling points of LMS is that it is just a Scala library and requires no special support from the compiler. Therefore, I would like to understand: what is the main reason why we cannot have ML DSL implemented as a library in Rust?

The challenge here is in my opinion to either through non-intrusive markup, language extensions, or with an exceedingly clever compiler build sub-graphs of computations that can be effectively dispatched to a compute device. While contrived, for the example it would be possible for the matrix multiplication to happen on the GPU. Note that you’d usually expect to see longer graphs of computation that can be dispatched.

Suppose I have a function like this:

fn foo(x: Matrix<(50, 50)>) -> Matrix<(10,)> {
    let output0 = sigmoid(x * W0 + b0);
    let output1 = sigmoid(output0 * W1 + b1);
    output1
}

This has the same computation graph as a simple neural network you might construct in tensorflow. Now how do I back-propagate over this? Can I use this function as part of a larger, differentiable function and back-propagate over that? How are the weights initialized?

This is the level of integration that would require built-in language support. But I’m wondering what the code might look like in practice.

As the recent crate wyrm demonstrates, and has been pointed out by several in this thread, these types of operations can be done in libraries today. See an example of wyrm’s define-by-run syntax in the example linked above to see what this looks like:

This is already quite nice syntax, and while I do think there are ergonomic improvements to be made in the future, I think those are outside of this internals discussion.

The part that I think could benefit from compiler assistance/integration is the set of compiler optimizations that should be happening under the hood to support this syntax/library (or is probably not happening eg in wyrm because of the difficulties involved):

  1. CFG and dataflow analysis (including liveness analysis for memory savings)
  2. Operator fusion (both vertical and horizontal)
  3. Type safety for eg matrix computations (this is something rustc can/will help with const generics, but as @gavento points out, this might benefit from RTTI ala something like trait objects or monomorphisms (similar to TensorFlow’s recurrent length bucketing strategies)).
  4. Etc…

I can go into more detail for any of these analyses/optimizations since I work with an open source deep learning compiler project for my day job (Intel nGraph). Though this post/idea is all on my free time.

In general, I agree with @notriddle’s opinion that things like concurrency should stay outside the compiler to improve velocity/experimentation etc, but given how much of the above list is something the compiler is reasoning about already, building out an out-of-compiler set of representations/optimization pass machinery/analysis passes is not only wasteful, but will likely be more difficult when trying to reason about/optimize across the domains of these two compilers. Example: if I take a one dimensional slice out of a tensor, sort it, and then broadcast it back across multiple dimensions to do a binary operation across another tensor, then I lose the ability to optimize across this operation by doing an in-place sort unless I know the liveness of the pre-sliced tensor.

I think this is a somewhat different situation than the concurrency primitives being outside of rustc given that the primitives of concurrency are quite dissimilar to the core operations of rustc, unlike what I’ve described above.

That being said, I do think we are a long way out from adding all of this stuff to rustc today, but I think the process of laying the groundwork can start. For example, it sounds like the current consensus is that external MIR plugins/passes are currently discouraged (due to the understandable desire for MIR APIs/details to remain internal). So perhaps a next question is to figure out

  1. Given that we’d want to start out using some of the existing/new deep learning compilers (nGraph, NNVM+TVM, etc) at the library level, is there a natural way to iteratively transition/experiment to rustc aware optimizations that can leverage/benefit/extend planned rustc MIR optimizations?
  2. If MIR is the right level to be operating on these types of constructs? I think so, because once you lower to LLVM, you’ve lost a lot of your higher level info.
  3. How can we enable quick iteration/experimentation with MIR passes given the desire to keep this interface private? I think a similar analogy is that of the Linux device driver contract: If you upstream your driver code, then its use of internal APIs will likely change, but those will be taken care of by the people making the changes, whereas if you don’t upstream, then we’re free to change our internal APIs and break your driver and it’s your responsibility. But this seems like a burden on rustc…
2 Likes

Let’s be clear - there are two opportunities here:

  1. Supporting ML in Rust (the focus of the above discussion), and
  2. Exploiting ML in optimising Rust code (Jeff Dean’s point about the use of ML in compilers).

Perhaps these could be investigated independently to start with?

I don't see that it needs language-level integration. If you want to check out something superficially similar, timely dataflow presents as roughly idiomatic Rust iterator code, but compiles to a dataflow representation under the hood, at which point you could do autodiff; I do auto-incremental computation personally, but it is up to you.

To be clear, I wasn’t talking about generating computation graphs at runtime then manually passing them off to some compute engine (like you’d do with (eg.) pytorch), I was talking about the ML computation graphs being generated at compile time by rustc. Since we were talking about integrating ML at the language-level I (mistakenly) thought other people were talking about the same thing.

To be even clearer: 2u32 + 3u32 doesnt return a computation graph which computes 5u32, it just returns 5u32. And in my example matrix0 + matrix1 doesn’t return a computation graph which computes a matrix, it just returns a matrix.

Right, but matrix0 + matrix1 can do whatever you ask the + operator implementation to perform for you, which can be registering the operation in a dataflow graph that you subsequently analyse for backprop or autodiff or whatever you would like to do with that information.

There is a big difference between language-based and library-based, and which largely comes down to the difference between “everyone must” and “anyone may” use your idioms. I suspect there is a pretty high bar for “all of Rust needs to be representable as dataflow” when TF demonstrates that you can get things done with just a library (rather than a fork of Python).

That being said, you could probably drop in on the MIR folks and see if you can help tool up some transformations from SSA/whatever to PDGs, if they don’t have those yet. There are years of compilers research on treating general computation as dataflow, but almost all of the recent ML success has been with libraries and DSLs instead.

i see no problems with the language itself. const generics would be nice to have, but constant size tensors are an edge case in machine learning.

i use the tensorflow bindings for now, with a lot of boilerplate to be able to train with it.

the libraries are the problem, i have not found anything i could rely on.

maybe a layered approach would be the right one - using gfx-rs to abstract the gpu and use SPIR-V kernels in a crate like ndarray. this could add additional resources to gfx-rs and ndarray and would avoid repeating work.

and then building a graph crate upon that. just my two cents.,

1 Like

@glyn Agreed that using ML in rustc is another interesting area of investigation, and I agree that rustc supporting ML is we've talked about here and should probably try and stick to in this thread for clarity.

@frankmcsherry While I agree that you can provide a nice API using library defined operators (as you do with timely dataflow, and as wyrm does as I mentioned above) and then call out to an external (library hosted) execution environment, that execution environment now has to be it's own compiler that has only partial knowledge of the computation (only the regions handled by your supplied operators). In addition, there's a good chance that rustc is probably blind to that section of code (assuming JIT compilation of the matrix computations) as well and can't reason about it from a performance standpoint.

I also wouldn't cite TF as an example of how things should be: Calling 1M lines of C++ a python 'library' is a bit of a stretch when it implements an entire programming language (with it's own scoping rules etc) and runtime with the issues that causes. One reason pytorch has been able to compete with TensorFlow is in large part because it leverages the existing language/control flow etc of Python and then try to fix performance through other means (tracing JIT compiler is their current attempt).

There is a big difference between language-based and library-based, and which largely comes down to the difference between “everyone must” and “anyone may” use your idioms. I suspect there is a pretty high bar for “all of Rust needs to be representable as dataflow”

I agree about the high bar required, but there are a rather small number of primitives that are quite central and consistent that aren't subject to much variation (dense matmul). The idioms/syntax/higher level libraries then reuse/repackage/enhance these similar to SIMD intrinsics.

Agreed about MIR transformations. I guess one question is: how many MIR transformations/optimizations could be implemented in rustc today that would benefit normal code and any ML libraries (like wyrm) that might emit MIR code in the future?

If I understand, what is being proposed is as follows:

The current Rust ELF uses libc, but in the future we won't be depending on libc and will make syscalls directly. But the proposed Rust ELF will also depend on either CUDA or OpenCL. This doesn't seem right to me. Rather, we want to borrow from two approaches: BLAS from HPC and regex from ... everywhere. For example:

  1. There is a string input (regexes are short, but in this case, it's probably an external file).

  2. There is a crate that can compile the DSL that will be able to run (probably in a thread or otherwise async as it's long running) and offer Future<Klass> or Future<f32> when the computation is complete.

  3. The library can have different backends so you can run your system on deeplearning-gpu or deeplearning-simd. Or it's a runtime binding like blas (but this requires stable linkage in Rust and I'm not sure this exists yet)

Benefits:

  • The cost to people who aren't using deep learning is 0.
  • The release train is separated so Rust can continue at it's cadence while the quickly iterating Deep Learning libraries can be iterated on at their own cadences.
  • Flexibility so there's no one true blessed system for deep learning baked into Rust.
  • As a library, you could perhaps even call it from Python as well. This would be a good way to get more people using Rust. :wink:

Drawbacks:

  • Doesn't co-opt the clever rustc compiler writers to work on this.
  • You don't get to write TF flow code in/as vanilla Rust.

For using DSLs in Rust, check out how some of the parser libraries handle grammers. e.g. peg. The grammar is a DSL that is 'built' at build.rs time.

Thanks for the picture @ehiggs, I should have been more concrete because that diagram does not describe what I was proposing: rustc itself is currently hardware agnostic and IMO should stay that way. LLVM/Cretonne should continue to be the ones responsible for platform dependent lowering/codegen.

Let me describe the main pain point with the library approach that is currently available to ML library authors in rustc today (ala BLAS and regex):

Take the let mat 2 = matrix0 + matrix1 example used above. As the library author, I have implemented the + operator here to do deferred execution (rather than immediate) so I can register the operation in a library implemented dataflow graph that I can perform my optimizations in etc. No big problems here other than having to build my own (matrix focused) compiler infrastructure in a library.

Now if the user grabs some values out of mat2 and uses them in an additional computation that creates a new matrix,

fn cool_comp(mat0: Matrix<2, f32>, mat1: Matrix<2, f32>) -> Matrix<2, f32> {
  let mat2 = mat0 + mat1;
  let factor = mat2[0] + mat2[1];
  //rustc takes over here ^^^ since Index<Matrix> returns a scalar
  let mat3 = mat2 / factor;
  mat3
}

Then because the library author loses ‘sight’ of the computation at the point it gives the user the two scalar values from mat2, the library can no longer reason about the fact that mat2's storage can be reused for mat3 and the element-wise division can be done in-place.

Instead, the library author just has to hope that MIR/LLVM will be able to do appropriate inlining and loop fusion and memory analysis across several library calls. And if our library does it’s own platform specific codegen (perhaps for an accelerator), which is quite likely these days, then the chance of this analysis happening correctly falls to zero.

@binarybana, Thanks for the feedback. To be clear, in your example, what I would have envisioned might have been:

use matex;

fn cool_comp(mat0: Matrix<2, f32>, mat1: Matrix<2, f32>) -> Matrix<2, f32> {
  let comp = matex::compile::<FnMut<(Matrix<2, f32>, Matrix<2, f32>>)->Matrix<2, f32>>("{m = $0 + $1; return m / (m[0] + m[1])}")
  
  comp(mat0, mat1)
}

And even though you’re not keen on working with the computations as strings, a nice thing about it is that a REPL falls out of the work in a straight forward manner. That said, this could be made nicer for production code by implementing a language plugin:

fn cool_comp(mat0: Matrix<2, f32>, mat1: Matrix<2, f32>) -> Matrix<2, f32> {
  let comp = matex::compile!(|ref x, ref y| {
    m = x + y; 
    m / (m[0] + m[1])
  });
  
  comp(mat0, mat1)
}

Alternatively:

#[matex_openacc_style]
fn cool_comp(mat0: Matrix<2, f32>, mat1: Matrix<2, f32>) -> Matrix<2, f32> {
  let mat2 = mat0 + mat1;
  mat2 / (mat2[0] + mat2[1])
}

As you’re probably familiar, running matrix operations on an accelerator means copying data to and from the accelerator is non trivial and should be explicitly managed (unless you have some kind of sufficiently smart compiler/scheduler) or you end up with severe performance degradation. So I think going without something to mark the code as ‘special’ is not a good idea for Rust which tries to reduce the magic.

The language plugin concept looks really neat, and doable with little effort and change to the existing language. All though for me personally it doesn’t eliminate the strain caused by meshing immediate and deferred execution.

I think it’s an interesting mid-term solution to improve the ergonomics of interacting with these kind of libraries. But I also find it interesting to explore the fully integrated aspect.

I’ll try to outline the caveats from the (excellent!) article on Julia Lang linked in the initial post:

Bundled approaches optimize specific, generally high-level approaches to machine learning. The library provides excellent support for feature X, but researchers frequently violate the assumptions that makes these perform well. This hints at the need for even lower level primitives being first-class citizens.

Building and reasoning around “deferred execution” (my terminology) is hard on the programmer and the compiler. The best example I think is when you want to do step-through debugging. While it’s conceivable that you can build a sophisticated language and runtime that performs this, you always end up with different contexts. The intermediate and the deferred one. Two separate domains and languages. They also have different limitations. Some models could benefit from using more generalized language constructs (loops, recursion, …). But they are hard to fit into the sometimes esoteric, specialized language.

This is partly why the article provokes you to think about a fully integrated language. And in my opinion, one that might depend on CUDA/OpenCL when generating code :). It can also be that tensorflow becomes such a strong de-facto standard that they provide a low-level virtual machine which can be targeted as a general purpose backend.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.