Pre-RFC: Differential programming support

A few months ago, I summed up the state of Machine Learning in Rust. Since then Rust hasn't improved much in the required areas listed and the trend of Machine Learning / Deep Learning system design has gone more and more into compiler by leading language designers and compiler engineers such as in tensorflow-swift first class support (point of this proposal). There're some libraries in julia/Zygote or tvm that have implemented this idea either in tvm DSL or because julia allows such IR manipulations. For the record, this came up before but the difference is there's an actual implementation. See Swift differential programming mega-proposal.

A bird's-eye view of the matter is as follows:

The central data structure in ML/DL is multi-dimensional array aka ndarray aka tensor. An ndarray can hold some fixed data or some parameters that we want to fit to the data. A layer has some parameters and may have some data and it actually behaves like a function (for some reason, it has taken years for people to realize that). So basically we have some ndarrays and some functions expressing a DAG of computations.

To be able to find a good set of parameters, we need to do Mathematical optimization. In case of Deep learning, it's a form of gradient descent. To able to do gradient descent we need to assume our involved functions are (almost everywhere) differentiable. When we calculate derivatives, we go through iterative updates of parameters and hoping our "model" has learned something.

The status quo approach for finding derivatives is Automatic Differentiation and for efficiency is done by constructing the reverse DAG of expressions aka adjoint DAG, so computing the derivatives backwards aka backpropagation. The leading idea is supporting differentiation at compile time aka differential programming.

What does it mean in Rust and for Rust?

The sketch means supporting something like (slightly misleading since we don't have good syntax for it)

#[differentiable]
fn f(x: Tensor<f32>, w: Tensor<f32>, b: Tensor<f32>) -> Tensor<f32> {
    return x * w + b  // matrix multiplication and addition
}

where w, b are parameters and x is our data and the compiler finds df/dw and df/db at x that expands to

fn df_dw(x: Tensor<f32>, w: Tensor<f32>) -> Tensor<f32> {
    return x  // or transpose of x to keep it column oriented
}

Or better idea for expansion is through closure-like

#[derive(Differentiation])
struct Linear {
    #[differentiable]
    w: Tensor<f32>,

    #[differentiable]
    b: Option<Tensor<f32>>,
}

impl FnOnce<Args> for Linear { ... }

let f = |x: Tensor<f32>| Linear(x);

and compiler finds df_dw and df_db where they're

let df_dw = |x| x;
let df_db = |x| I; // the identity_tensor

More importantly, this is not only about neural nets. Such supports can close the existing huge gap in Rust for Scientific Computing in general and in particular in areas where solving differential equations comes into place (finance, weather prediction etc.) and I believe will open up much greater possibilities for Rust.

I know this would entail a huge undertaking in rustc, but this would server at least as a reference if community wants to do it in future.

Some other useful references:

  1. A Differentiable Programming System to Bridge Machine Learning and Scientific Computing

  2. Automatic differential in ML: where we are and where we should be going

  3. Demystifying differential programming

  4. A simple differential programming language

3 Likes

For the people who aren't already familiar with differential programming, explaining what

actually means in context would be very helpful. The idea of making Rust better for differential programming has come up before, but so far the response has been "do it as a library," mainly because nobody here really understands what is needed from the compiler for this to work.

Secondly, an example of how you would use this feature for actual work would be helpful; as is, you've just defined a function with an attribute on it, and this doesn't say anything about what that even means let alone how it's used.

14 Likes

Say #[differentiable] was implemented as a compiler macro / attribute, etc.

What would something like

fn f(x: Tensor<f32>, w: Tensor<f32>, b: Tensor<f32>) -> Tensor<f32> {
    return x * w + b  // matrix multiplication and addition
}

expand to?

4 Likes

And just to add some of my other thoughts on feasibility, and in compiler vs as a library:

According to your links, you can implement automatic differentiation as source code transformation or operator overloading. While operator overloading is trivially implementable for libraries with their own unique types (because operator behavior is defined through traits), from what I've read that's a runtime implementation.

So a compile time solution would involve transforming the source code, which you can implement as a library today through proc macros. The real question then becomes, "do you need full type information while running the macro?" Because if you don't, you can still implement auto-differentiation as a library instead of needing compiler support. Though, while of the difficulties associated with source transformation, such as parsing, are partially mitigated through Rusts type system, I can imagine that it would still be a huge undertaking.

I unfortunately don't know quite enough to follow all the links you posted. Examples mapping what is written to what gets output would be great. I really do think most of this could be implemented as a library with macros, but I could be wrong.

3 Likes

It's not easy to write the expansion fully since we don't have the syntax for it but I tried to add more for the proposal. Note that AD is not Differential programming.

@CAD97 @samsieber Neither this can be done "as a library" nor as proc-macro. See the swift Mega-proposal.

1 Like

Note that this is not proposing another DL library which has been requested/posted thousands of times. For more complete understanding of my position, please see my blog post linked at the top of the post.

I think rust is powerful enough to implement this as a library. From that mega thread:

Some languages provide the ability to define custom code transformations:

  • Macros enable syntax-based code transformations at compile-time. Hygienic macros (macro systems that avoid accidental variable capture) are available in a variety of languages, including Lisp, Julia, Rust, and Scala, to name a few. As an example: generated type-safe schema wrappers can implemented using hygienic macros in Scala .
  • Compiler plugin systems enable programmers to write plugins that extend the behavior of a compiler. Compiler plugins are more popular in bootstrapped languages, like Haskell, Rust and Scala, where the plugin can be written in the language itself. As an example: a continuation-passing-style code transformation can be implemented as a compiler plugin in Scala .

One might make the case that derivative code generation for differentiation is better implemented as a custom code transformation. While that may be true in theory, Swift does not yet support custom code transformations in practice. This proposal presents differentiable programming as a system of high-level language features and semantics ; derivative code generation is an implementation detail. If a system for custom code transformations is added to Swift one day, it may be possible to reimplement derivative code generation using that system without changing the high-level differentiable programming features proposed here.

Swift does not yet support custom code transformations in practice

I'd argue that Rust does support custom code transformations in practice, via proc macros, and to a much lesser extent macro_rules. And that's why I think this can be implemented as a library in Rust. I looked through the code examples in the introductory comment in that swift mega. I think that everything I saw could be done with attribute proc macros.

The one drawback I see to this is that debugging probably won't be as nice.

7 Likes

The one wrinkle here is that it might take some major proc-macro skills to do this. I know of one high-profile programmer in the rust ecosystem that pumps out lots of proc-macro related things. But think that auto-differentiation is possible as a library, even if it's not immediately obvious that it can be done.

I think my initial approach for AD as a library would be:

  • Implement a Differentiable trait, exposing a gradient method
  • Implement a Differentiable wrapper struct, for use with arguments inside functions
  • Overload the operators for it the wrapper struct
  • Write a proc macro that derives the Differentiable trait for a function by copying the function and doing some substitutions on the parameter types
  • You could write a proc macro derive for a structs

Note that this is woefully incomplete as a strategy, and would need a lot of refining, but it seems possible.

I came across a paper on a python library that does auto differentiation as a library: https://papers.nips.cc/paper/7863-tangent-automatic-differentiation-using-source-code-transformation-for-dynamically-typed-array-programming.pdf

Now it actually needs to parse the python code, but I think we could get away by changing parameter types and implementing operator overriding... I think.

1 Like

Yes, tangent served as a prototype led to tensorflow-swift. We need more IR level manipulation. Not sure about proc-macro idea, would be surprised if it works.

At the very least it is possible to implement a subset of differential programming features via a procedural macro. If you really want to see this in the compiler you should implement that subset in a proc macro, and then find out how far you can take that approach.

Once you know what the limits of that approach are, then you could make a case to the compiler team that there is a need for a new language feature to solve a specific limitation. Perhaps when you find out what the actual blocker is, you may find that the language-level support needed is lower level than "support differential programming" and may be useful for other areas than just machine learning.

This is pretty much how the development of async/await progressed too, and differential programming looks a lot more amenable to implementation via procedural macros than async/await was.

17 Likes

I'll note that I'm currently applying/looking for a PhD student position (yes I'm applying at MPI-SWS), and have some very good experience working with proc macros (though maybe not quite as much as said prominent author). Code generation/manipulation is sort of my specialty recently.

I don't want to advertise too much on this forum, but if somebody wants this done enough to pay for it, I'm almost certainly a promising candidate to experiment along these lines.

(If I had any real preexisting knowledge of how this works or why it'd be useful, I'd probably take a stab towards it anyway, as it seems like an interesting problem right down my alley.)

2 Likes

I want to understand what exactly is the benefit of having language support or using code transformation over the approach of using "dual numbers" which is both theoretically and implementation-wise simple and flexible.

I have been seeking the reason but I still haven't found the answer.

The idea, if I understand it correctly, is to have the differentiation calculations take place at compile time rather than runtime. This makes the runtime computations much more efficient.

Dual-number approach is used in forward-differentiation and is efficient for a function f: R^n -> R^m where n <<m. However, in case of DL where we're computing derivatives of a loss function R^n -> R, wrt parameters would require O(n) evaluation for each derivative of the loss, so in opposite cases f: R^n -> R^m where n >>m, backward-differentiation is a lot more efficient.

For more details, have a look at the survey Automatic differentiation in ML.

2 Likes

Seems like a good grad project :wink:

Now I tried to implement the reverse mode automatic differentiation from "Demystifying Differentiable Programming:Shift/Reset the Penultimate Backpropagator" and I have an insight of why "operator overloading" approaches don't work nicely in Rust and why the code transformation approach is desired.

The backpropagation phase computes gradients by modifying related nodes in-place. Thus, each node must have a mutable reference to its sub-expression nodes. But Rust hates multiple mutable references!

For instance, even a simple square function |x| x * x treats two mutable references to x simultaneously. In general, it is possible to construct arbitrary DAG and Rust cannot handle it nicely.

Thus, it either requires some kind of interior mutability or a non-local transformation of the code. Probably a code transformation can output a backpropagation code without mutable reference issues. I think this is one of the main points of this feature?

1 Like

As I said elsewhere, I believe Differential programming should be done in a crate and not at the language level as :

  • it is a research topic that is still evolving (whereas Rust tries to keep its core stable over time)
  • I believe that it can be done at the crate level with macros and appropriate types (see the work done in julia for some inspiration)

As a side note, I am aware of very few actual use of (non overloading based) automatic differentiation in the wild (libraries like TensorFlow do not require special support to implement it as their operations are well constrained). What I have seen is a number of scientific applications that used operator overloading based on libraries such as sacado (which requires no special language support and has been, at least partially, implemented in Rust).

2 Likes
  1. No, it's now out of the prototype-research phase, see the Swift mega-proposal.
  2. Julia allows easier manipulation of IR.

Tensorflow, PyTorch, MxNet they all do Automatic Differentiation (AD) with a mix of operator overloading, tracing at runtime. Derivative is not a first class citizen of Python.

TF4S is a next generation and if there's a feature that would close the gap for Rust in scientific applications would be this, because it encapsulate a lot of other important features.

Also as I explained above, the hyperdual crate is forward-differentiation which is not efficient for DL and AD is not Differential programming.

1 Like

I just spent time reading the Swift proposition :

  • Writing a differentiable trait for types with an associated derive macro (that would define the type of the TangentVector) should be fairly straightforward.

  • Marking a function as the user-defined forward|reverse derivative of another function (with respect to one or more parameters) could be a simple matter of renaming it following a predefined convention.

  • Defining a gradient! macro that takes a function and the name of the parameters of interest in order to call the appropriate forward|reverse derivative should be doable.

  • Marking a function as #differentiable is harder as we cannot know if its user-defined input type are differentiable at macro time. It could default to all parameters unless told otherwise (the Swift proposal offers several ideas in that space).

  • Automatically building the forward derivative of a #differentiable function should be a mecanical process once we know which inputs should be derived.

  • I am not familiar enough with tape based implementations (which is what swift uses) to know how to automatically build the reverse derivative of a #differentiable function but it seems doable (it can be done mechanically for simple function with no control flow nor mutability before going into a proper, general, implementation)

Steps 1 to 3 should be doable in a relatively short time.
Step 4 and 5 are more delicate as their require some AST manipulation but they are doable (and would be a good way to explore how Rust type system interacts with the transformations and what is truly required of a differentiable type (clone ?)).
Step 6 requires a review of the state of the art, some design brainstorming, a solid tape implementation and knowledge gotten from the previous steps.

The result could be as efficient as the Swift implementation, the hardest part being having nice error messages when user forget to differentiate something along the way.

I do not have the time to start prototyping (I already have the advent of code, a crate under work (for which I do manual gradient computation) and a PhD to finish) but I stay on my position that this is doable now, with macros, on stable Rust.

For those interested but not familiar with the subject, here is a nice tutorial on forward-differentiation vs reverse-differentiation with an associated, run time evaluated, Rust implementation : reverse-mode-automatic-differentiation

11 Likes