Roadmap 2017 request - needs of HPC


#1

In this comment @nikomatsakis articulated that he wanted to see more motivation, use cases, blockers, … for using Rust in HPC domains.

HPC is a huge field, and the current issues are largely domain dependent. I can only talk about the current blockers for the particular domain in which I am proficient with, which is solving systems of partial differential equations. I’ve decided to split the issues into 3 categories: core issues (probably useful things outside HPC), fundamental HPC issues (useful in other domains of HPC beyond my own), and domain issues (useful within my domain), from less to more specific:

Core issues (useful in other domains beyond HPC):

  • Native numeric traits: NativeFloat for f32/64/(16), NativeSignedInt for i8/16/32/64/size, NativeUnsignedInt for u8/16/32/64/size, NativeInt for i/uXX, maybe NativeComplex. This is not about building a hierarchy, modelling mathematical number categories, nor about supporting user defined BigXXX types. These should be simple traits for abstracting over native/primitive machine types. These traits could drive some core language features like const fn as trait methods.

  • Native SIMD support: SIMD vector types/algorithms with software fallback. In particular users must be able to write v0 = v1 * v2 + v3; kind of code and that should generate a fused multiply add SIMD instruction (FMA) when available. This would probably drive the design of plugins and improve eDSL support in Rust.

  • SIMD support on slices of primitive types: as opposed to operations on SIMD vector types, often users want to do operations in slices: s0 = s1 * s2 + s3; where the type of sX is [f32]. This requires figuring out the best vector type to use for load/stores, and the corresponding SIMD instructions at compile-time (depending on the arch) and possibly at run-time as well (when compiling for multiple archs), and dealing with the remainder, very short slices, … properly. More plugin / eDSL work would be required.

  • Bitwise manipulation instructions: like SIMD instructions but for doing bit manipulation, and higher level algorithms that work on slices of words.

  • Abstracting over memory layout: high-performance code needs to be able to play and switch the memory layout of the data in a localized way, without causing a rewrite of the whole application. Tools/techniques/idioms/language support for abstracting over the memory layout and be able to change it at will without major code changes would be really helpfult. For example given a struct S with fields x, y, z and a slice of S, [S], I would like to be able to change the layout of S from Array-Of-Structs to Structs-Of-Arrays to be able to benchmark the difference with as little code modification as possible. A possible way to do this could be with improved support for Dynamically Sized Types (DSTs), but Jonathan Blow’s jai language (See the Data-Oriented-Structures section has also an interesting take on this.

  • Accelerator support: in the language. I’ll just leave it at this, no major language has this, although things are converging (e.g. C++ executors and HPX support accelerators, C++ HSA proposal is evolving fast).

Fundamental HPC issues (useful in other domains within HPC beyond my own):

  • Multi-dimensional iteration: iterating over a collection of elements with some multi-dimensional structure. In general these structures are graphs, and trees play a very important role in HPC. However, in a lot of cases these structures are just regular vectors/matrices/tensor-like grids. Being able to efficiently iterate over a rank-N tensor with N-different lengths across its components, with/without strides, with low/high (aka ascending/desceding, row/col) linear order of the elements in memory, with run-time/compile-time lengths/strides, and sub-tiles/matrices/views of these tensors is required for a lot of HPC applications. A couple of C++ experts from the national labs (lbl, sandia) and AMD have recently written a ISO C++ paper called polymorphic multi-dimensional array views with their (very comprehensive) take on this. Type-level values is one of the C++ features they exploit to make this a zero-cost abstraction. Doing this with external iterators that can still be auto-vectorized is hard.

  • Linear-algebra vocabulary types and traits: dense and sparse vector, matrix/tensor traits that allow using different implementations of the types (e.g. to store additional data), and native implementations of these types with different memory layouts (e.g. SIMD vector types for small ones, row/col order, …), with/without compile-time sizes… Type-level values, better const fn support, probably in traits as well, and maybe even HKTs will be required for this.

  • Serial linear-algebra algorithms: that work on the linear algebra traits, and use SIMD and/or bindings to the system BLAS libraries. Ideally one should be able to at least solve a system of equations easily on dense or sparse matrices/vectors with generic syntax, e.g., let x = linsolve(A, b), solve an eigen-value problem, perform an SVD, QR decomposition, Givens rotations, POD, … All in serial, in shared memory, using SIMD. These are the “fundamental” linear algebra building blocks for creating more complex and parallel algorithms.

  • Parallel linear-algebra algorithms: same as above but in parallel. Ideally one should be able to specify where and maybe how the work is done, e.g., using a tokio reactor, or some different executor (like a thread-pool, something that uses OpenMP, a GPU…). The latest executors C++ proposal and parallel algorithms (and the HPX C++ implementation) of these allow all of this for the C++ STL algorithms. The parallel algorithms should probably return a future and be async.

  • Linear-algebra EDSL: At the end of the day one wants to write Matlab-like code. Ideally if I write something like let x = linsolve(A, y) + linsolve(A, z); the matrix A should only be “inverted” (exposition only, not actually inverted) and loaded into memory once, and the whole thing should maximize cache efficiency, do a single memory traversal through y and z and whatnot. This really requires plugins, creating an expression tree, …

  • Asynchronous File I/O: the current Rust Async I/O story is purely Network I/O focused. Lots of HPC applications need to write to a single huge file (typically in parallel but from different processes at the same time, not from different threads within the same process). This takes a long time, and can be overlapped with work, so… futures to the rescue? There are some widely-used C libraries that deal with this (HDF5, NetCDF, parallel-NetCDF, …) and lots of applications just write their own wrapper over MPI-I/O… which leads me to the next point.

  • A good MPI wrapper: type-safe, that uses futures, with plugins for serializing data-types, mapping lambdas to reductions, …

  • A good geometry library: this depends on vector, matrices and linear algebra, but it would be great to have a trait-based very generic and fast geometry library (a CGAL on steroids). At the end of the day, every application is going to implement their own vector, triangle, surface, polygon… types with their own data-quirks and layouts, but the subset of classic primitives is pretty standard (point, quaternions, line, ray, segment, triangle, square, polygons, polylines, planes boxes, AABB, OBB, polyhedra, polysurfaces, signed-distance field…) and the operations one wants to perform are also standard (intersection, intersection test, split, overlap, union, difference, xor, centroid computation, surface computation, bounding box computation, translations, rotations, …). When dealing with large number of primitives the data-structures are also pretty standard (binary-/quad-/octree, point set, bounded volume hierarchy, half-edge data-structure, …). So having everything as trait, plus providing good implementation of the primitives and data-structures that one can reuse to implement their own with the extra quirks and then implement the trait for them would be awesome.

  • Other generic mathematical algorithms: things that build on the linear algebra stuff and are commonly needed: interpolation (linear/bilinear/trilinear, least squares, bsplines, radial basis functions…), integration (quadrature rules, trapezoid/simpson/… rules, newtons method…), projections/restrictions, krylov solvers, FFT, …

  • Optimization: steepest descent, quasi-newton, newton-krylov, automatic differentiation…

Domain issues (probably only useful in my domain):

  • Mesh data-structures for PDE solvers: structured, unstructured, block structured, hierarchical, hierarchical Cartesian, … grids, serialization for them, distributed memory support, load-balancing…

  • Scientific visualization: way to interface the core geometry libraries and mesh data-structures to write to common visualization formats like VTK, create custom visualizations, write plugins for scientific visualization software that uses VTK like ParaView, Visit,…

  • Libraries for building PDE solvers: something like Deal.II for Finite Elements, Clawpack for Finite Volumes, … There are also tons of good python libraries as well.

  • Multi-grid solvers…

  • For an overview, e.g., see the Trillinos C++ framework from Sandia. The C alternative is Argonne National Laboratory’s PetsC library. Both of them offer the full-stack, and inter-operate to some degree.

One thing I realized when writing this is that as I get closer and closer to my domain, while I would appreciate if there were libraries available, as long as the fundamental libraries and abstractions are there and work, I don’t mind writing my own (since I probably want to do custom stuff anyways). But having to write the full stack of file I/O, distributed computing, linear algebra and solvers, mathematical functions, multi-dimensional iteration, graphs… just to be able to get started writing what I am interested in (PDE solvers) is a titanic task.

Other people doing HPC will probably have completely different needs in terms of Domain issues, but I guess that for things like Machine Learning, Bioinformatics, HFT, … at least some of the fundamental issues will also apply. If you have similar or different needs please discuss them!


Focus of Rust evolution
#2

Thanks @gnzlbg for taking the time to write this down! Your list is quite comprehensive with lots of details, I’ll try to see if I’m missing s.th. for my purpose and post it here later.

In the case that you’ve missed it: there is also a discussion about numeric / scientific computing here:


#3

I think that one stumbling block is simply that the most Rust contributors – in fact, most Rust programmers – lack the experience needed to write such code. HPC is still very conservative (there is a reason it still uses quite a bit of Fortran), and many of these libraries are, as you said, not reasonable for one person to write.


#4

It would be nice to see if something like Halide could be used for rust.

Not only abstraction of presentation of the memory but of the organization of the computation http://people.csail.mit.edu/jrk/jrkthesis.pdf


#5

Hi, so this in fact comes up in Machine Learning and well I was wondering whether we should start this effort on. In machine learning frameworks like Theano, Tf etc… they are essentially compilers over multi dimensional arrays. Thus we briefly had a discussion here; https://www.reddit.com/r/rust/comments/5of7t8/is_anyone_interested_in_trying_to_build_a_graph/ on whether we should try something. Also note that HPC has to be done in this way, as whenever you have different parallel backends, making things like kernel fusion and what know can only be achieved via systematic approach (e.g. not implementing only the basic BLAS opeartions, but be able to generate new efficient kernels on the fly (compilation)). I’m interested in this area, however I have no expertise in the low level bits and I work more on abstract program optimization and autodiff. If you are interested for a chat please feel free to join up here: https://gitter.im/rust-ml/Lobby