HPC is a huge field, and the current issues are largely domain dependent. I can only talk about the current blockers for the particular domain in which I am proficient with, which is solving systems of partial differential equations. I’ve decided to split the issues into 3 categories: core issues (probably useful things outside HPC), fundamental HPC issues (useful in other domains of HPC beyond my own), and domain issues (useful within my domain), from less to more specific:
Core issues (useful in other domains beyond HPC):
Native numeric traits:
NativeComplex. This is not about building a hierarchy, modelling mathematical number categories, nor about supporting user defined
BigXXXtypes. These should be simple traits for abstracting over native/primitive machine types. These traits could drive some core language features like
const fnas trait methods.
Native SIMD support: SIMD vector types/algorithms with software fallback. In particular users must be able to write
v0 = v1 * v2 + v3;kind of code and that should generate a fused multiply add SIMD instruction (FMA) when available. This would probably drive the design of plugins and improve eDSL support in Rust.
SIMD support on slices of primitive types: as opposed to operations on SIMD vector types, often users want to do operations in slices:
s0 = s1 * s2 + s3;where the type of
[f32]. This requires figuring out the best vector type to use for load/stores, and the corresponding SIMD instructions at compile-time (depending on the arch) and possibly at run-time as well (when compiling for multiple archs), and dealing with the remainder, very short slices, … properly. More plugin / eDSL work would be required.
Bitwise manipulation instructions: like SIMD instructions but for doing bit manipulation, and higher level algorithms that work on slices of words.
Abstracting over memory layout: high-performance code needs to be able to play and switch the memory layout of the data in a localized way, without causing a rewrite of the whole application. Tools/techniques/idioms/language support for abstracting over the memory layout and be able to change it at will without major code changes would be really helpfult. For example given a
struct Swith fields
x, y, zand a slice of
[S], I would like to be able to change the layout of
Sfrom Array-Of-Structs to Structs-Of-Arrays to be able to benchmark the difference with as little code modification as possible. A possible way to do this could be with improved support for Dynamically Sized Types (DSTs), but Jonathan Blow’s jai language (See the Data-Oriented-Structures section has also an interesting take on this.
Accelerator support: in the language. I’ll just leave it at this, no major language has this, although things are converging (e.g. C++ executors and HPX support accelerators, C++ HSA proposal is evolving fast).
Fundamental HPC issues (useful in other domains within HPC beyond my own):
Multi-dimensional iteration: iterating over a collection of elements with some multi-dimensional structure. In general these structures are graphs, and trees play a very important role in HPC. However, in a lot of cases these structures are just regular vectors/matrices/tensor-like grids. Being able to efficiently iterate over a rank-N tensor with N-different lengths across its components, with/without strides, with low/high (aka ascending/desceding, row/col) linear order of the elements in memory, with run-time/compile-time lengths/strides, and sub-tiles/matrices/views of these tensors is required for a lot of HPC applications. A couple of C++ experts from the national labs (lbl, sandia) and AMD have recently written a ISO C++ paper called polymorphic multi-dimensional array views with their (very comprehensive) take on this. Type-level values is one of the C++ features they exploit to make this a zero-cost abstraction. Doing this with external iterators that can still be auto-vectorized is hard.
Linear-algebra vocabulary types and traits: dense and sparse vector, matrix/tensor traits that allow using different implementations of the types (e.g. to store additional data), and native implementations of these types with different memory layouts (e.g. SIMD vector types for small ones, row/col order, …), with/without compile-time sizes… Type-level values, better
const fnsupport, probably in traits as well, and maybe even HKTs will be required for this.
Serial linear-algebra algorithms: that work on the linear algebra traits, and use SIMD and/or bindings to the system BLAS libraries. Ideally one should be able to at least solve a system of equations easily on dense or sparse matrices/vectors with generic syntax, e.g.,
let x = linsolve(A, b), solve an eigen-value problem, perform an SVD, QR decomposition, Givens rotations, POD, … All in serial, in shared memory, using SIMD. These are the “fundamental” linear algebra building blocks for creating more complex and parallel algorithms.
Parallel linear-algebra algorithms: same as above but in parallel. Ideally one should be able to specify where and maybe how the work is done, e.g., using a tokio reactor, or some different executor (like a thread-pool, something that uses OpenMP, a GPU…). The latest executors C++ proposal and parallel algorithms (and the HPX C++ implementation) of these allow all of this for the C++ STL algorithms. The parallel algorithms should probably return a future and be async.
Linear-algebra EDSL: At the end of the day one wants to write Matlab-like code. Ideally if I write something like
let x = linsolve(A, y) + linsolve(A, z);the matrix
Ashould only be “inverted” (exposition only, not actually inverted) and loaded into memory once, and the whole thing should maximize cache efficiency, do a single memory traversal through y and z and whatnot. This really requires plugins, creating an expression tree, …
Asynchronous File I/O: the current Rust Async I/O story is purely Network I/O focused. Lots of HPC applications need to write to a single huge file (typically in parallel but from different processes at the same time, not from different threads within the same process). This takes a long time, and can be overlapped with work, so… futures to the rescue? There are some widely-used C libraries that deal with this (HDF5, NetCDF, parallel-NetCDF, …) and lots of applications just write their own wrapper over MPI-I/O… which leads me to the next point.
A good MPI wrapper: type-safe, that uses futures, with plugins for serializing data-types, mapping lambdas to reductions, …
A good geometry library: this depends on vector, matrices and linear algebra, but it would be great to have a trait-based very generic and fast geometry library (a CGAL on steroids). At the end of the day, every application is going to implement their own vector, triangle, surface, polygon… types with their own data-quirks and layouts, but the subset of classic primitives is pretty standard (point, quaternions, line, ray, segment, triangle, square, polygons, polylines, planes boxes, AABB, OBB, polyhedra, polysurfaces, signed-distance field…) and the operations one wants to perform are also standard (intersection, intersection test, split, overlap, union, difference, xor, centroid computation, surface computation, bounding box computation, translations, rotations, …). When dealing with large number of primitives the data-structures are also pretty standard (binary-/quad-/octree, point set, bounded volume hierarchy, half-edge data-structure, …). So having everything as trait, plus providing good implementation of the primitives and data-structures that one can reuse to implement their own with the extra quirks and then implement the trait for them would be awesome.
Other generic mathematical algorithms: things that build on the linear algebra stuff and are commonly needed: interpolation (linear/bilinear/trilinear, least squares, bsplines, radial basis functions…), integration (quadrature rules, trapezoid/simpson/… rules, newtons method…), projections/restrictions, krylov solvers, FFT, …
Optimization: steepest descent, quasi-newton, newton-krylov, automatic differentiation…
Domain issues (probably only useful in my domain):
Mesh data-structures for PDE solvers: structured, unstructured, block structured, hierarchical, hierarchical Cartesian, … grids, serialization for them, distributed memory support, load-balancing…
Scientific visualization: way to interface the core geometry libraries and mesh data-structures to write to common visualization formats like VTK, create custom visualizations, write plugins for scientific visualization software that uses VTK like ParaView, Visit,…
Libraries for building PDE solvers: something like Deal.II for Finite Elements, Clawpack for Finite Volumes, … There are also tons of good python libraries as well.
For an overview, e.g., see the Trillinos C++ framework from Sandia. The C alternative is Argonne National Laboratory’s PetsC library. Both of them offer the full-stack, and inter-operate to some degree.
One thing I realized when writing this is that as I get closer and closer to my domain, while I would appreciate if there were libraries available, as long as the fundamental libraries and abstractions are there and work, I don’t mind writing my own (since I probably want to do custom stuff anyways). But having to write the full stack of file I/O, distributed computing, linear algebra and solvers, mathematical functions, multi-dimensional iteration, graphs… just to be able to get started writing what I am interested in (PDE solvers) is a titanic task.
Other people doing HPC will probably have completely different needs in terms of Domain issues, but I guess that for things like Machine Learning, Bioinformatics, HFT, … at least some of the fundamental issues will also apply. If you have similar or different needs please discuss them!