The optimal CPU/ISA for rust


#1

Hi folks,

triggered by the recent blowup of Intel CPUs (which seems to be a general problem of speculative execution, maybe even OOE in general), I’m again thinking about CPU/machine design (I’m still very fascinated by the old Burroughs machines).

Let’s imagine we could design our own CPUs, specifically for rust applications (and we’d invent our OS, too):

What would an optimal ISA look like ? What kind of instructions/semantics do we need most ? Which ones (of today’s CPUs) don’t we need at all ?

Do we need any kind of hidden speculative execution or could we have explicit instructions for that ? IOW: make massive parallelization (micro-threads) really fast, so the compiler could generate parallelizable code pathes, and allow the cpu/vm to choose whether to run them parrallel or even speculatively (eg. forbidden for bound checks or access control).

What about memory addressing ? Should we have tagged memory ? Maybe typed memory instead of linear address space ?

Should we replace caches by local memory w/ paging and replace the whole TLB magic by old fashioned segment-based addressing (while having potentially many small segments - IOW: heap management in the ISA itself) ?

Should we replace (user accessible) registers by stack ? (like Burroughs did)

Let me know your oppions / ideas.

–mtx


#3

To large extent Rust (and LLVM) is shaped by what existing architectures can do and doesn’t go much beyond that. If you were reinventing CPUs you’d probably also want to reinvent languages for them.

But for a Rust-machine specifically, perhaps look at what MIR does?


#4

Well rust has an OS written in it, its called Redox. https://www.redox-os.org/


#5

CPUs and their ISAs evolve continuously, as a function of their intended areas of application and of the available and anticipated semiconductor and packaging technologies. The primary driver is ROI (return on investment). The Burroughs architectures were developed at a time when restricted memory size and small gate count were major limiting factors. Those architectures are not really appropriate for today’s CMOS technology with microscopic gate sizes, where heat dissipation and unwanted quantum tunneling are major issues.

The recent side-channel attacks on speculative-execution CPU features do suggest that barrel processors, which is an implementation technique that has always been appropriate for hard real-time applications such as avionics engine control, might become more generally relevant, particularly in conjunction with lightweight threads such as Rust offers.


#6

Seeing how many recent side-channel attacks are based on cache timing and other monitoring of low-level CPU operation, I would personally be in favor of making access to high-precision CPU clocks (TSC, APIC timer) and performance monitoring counters a privileged OS operation, only giving untrusted code access to a low-resolution clock.

That would mitigate most attacks (e.g. by making cache timing attacks impractically slow) without being a major hindrance to legitimate performance analysis applications (just give the profiler process appropriate permissions).


#7

Direct access to high-precision clocks isn’t the only way to get a high-precision timer- you can just update a counter in a loop on another core. Making hardware parallelism a privileged operation is not gonna work out.


#8

Just going to throw this out there since I’m a huge fan and both LLVM and Rust just got support for it: RISC-V

https://riscv.org/

Why? Because I think, as much as an instruction set architecture and a programming language are capable of doing so, they share a lot of the same design criteria, philosophy, and goals.

First of all, RISC-V is a free and open source instruction set architecture with open source core designs. It’s overseen by the RISC-V foundation, whose role is rather like the Mozilla Foundation’s relationship with Rust: provide coordination and infrastructure for the project, but strongly encourage outside collaboration and ensure the project is a community effort. RISC-V is very much intended to be a research and experimentation platform, where many people can try different approaches and the community moves on with the ones that work the best in a collaborative, open-source manner.

Second, both are something of hypermodern projects, incorporating tons of hard-won knowledge about the problems of previous systems, and aiming to solve them strategically through better design.

Beyond that, here is how RISC-V matches up with Rust’s “safe, concurrent, practical” credo:

Safe: the RISC-V architecture was developed from the ground-up with a security-oriented mindset, and includes numerous and uncommon safety features including an every-word-tagged memory architecture and cores where control and data circuits are physically isolated from each other. The RISC-V foundation also announced that all existing RISC-V cores are NOT vulnerable to Meltdown/Spectre-style attacks (more on that below)

Concurrent: RISC-V cores are amenable to massive parallelism. For example, a company named Esperanto just announced a massively parallel RISC-V CPU, consisting of 4096 ET-Minion “puny” cores and 16 more powerful ET-Maxion cores to drive them, while letting you target all of them with a single ISA. RISC-V is also a great testbed for research into new strategies for inter-core connectivity, such as a Labeled von Neumann Architecture which treats the CPU more like a traditional packet network and applies MPLS-style labels to messages flowing between cores, providing both efficient inter-core message routing and quality-of-service guarantees which prevent a single core from monopolizing CPU resources.

Practical: the modest goal of the RISC-V project is world domination. It aims to be the one true ISA, usable for everything from microcontrollers to, at least hypothetically, complex server-class CPUs ala Xeon, POWER, or SPARC.

Regarding Meltdown/Spectre, the RISC-V foundation just made an announcement about it, highlighting that no extant RISC-V cores are vulnerable (due to lack of speculative execution features):

https://riscv.org/2018/01/more-secure-world-risc-v-isa/

Research into adding speculative execution features to RISC-V is just now starting, with an announcement last month of the taping out of the first out-of-order silicon RISC-V chip. For RISC-V, the announcement of Meltdown/Spectre is almost perfectly timed, as they are greenfielding speculative execution right now with zero legacy, 20/20 hindsight, and an extremely solid foundation upon which they can build things like capability-based access controls for things like caches and system memory which could physically prevent speculation units from accessing anything outside the current protection domain at the circuit level. The RISC-V foundation had this to say:

The RISC-V community has an historic opportunity to “do security right” from the get-go with the benefit of up-to-date knowledge. In particular, the open RISC-V ISA makes it possible for many different groups to experiment with alternative mitigation techniques and share results. The RISC-V Foundation was formed with an open and inclusive governance model to allow for contributions from leading experts across academia and industry.

If this all sounds exciting to you, and you’d like to play around with Rust and RISC-V, support has just recently landed:

The HiFive1 is a relatively cheap ($60) hobbyist board in an Arduino form factor, and a Rust library is available to access it peripherals if you’d like to play around with things like blinking LEDs or other Arduino-style projects.


#9

Getting precise time measurement from a spin loop relies on several assumptions

  • That the CPU clock rate is stable, without major power management-induced disturbances.
  • That the spinning thread is alone on its CPU core, or that both hyper-threading and OS-driven preemption is disabled.

The further away you get from these assumptions, the less precise your time measurements get. Now, whether they get imprecise enough to make cache timing side-channel attacks impractical is an complex question, which does not have a single absolute answer. It depends on…

  • System characteristics and operating conditions: a battery-powered smartphone or laptop, which is configured for aggressive power management and can have dozens of CPU-hungry background apps, is much less amenable to spin loop timing than the average HPC center node, which is usually set up to run with all power management features disabled and minimal background system activity.
  • The details and circumstances of the cache timing attack: measuring the ~100x latency difference between L1d and main memory is much easier than measuring the ~10x latency difference between L1d and L3 or the ~3x latency difference between L1d and L2.

…but I think it is fair to say that overall, we give the attacker a harder time by forcing the use of this less reliable timing source. It’s not as good as a more in-depth hardware redesign for better handling of secret information, which would be best in the long run, but still a short-term fix worth having in meantime :slight_smile:


#10

Let’s say you can take a system that leaks my wallet to the attacker 100% of time and make it a system that does that only 1% of time.

Will that stop attackers? Hell no, distributing malicious JavaScript costs next to nothing. Will that make me trust the system any more? I don’t think this needs an answer.

Also, attacks only ever get better.


#11

When you consider that this is, modulo adjustment of probabilities, the value proposition of pretty much every single software security measure ever built, it does not look that bad! :wink:

But again, I will be the first to admit that the better longer-term solution is for hardware people to take handling of secret data more seriously. For another area where this would be useful, there is a reason why cryptographers spent so much time into researching constant-time algorithms, and hardware should help us more at writing those instead of always making our lives harder.


#12

I disagree here. In the early 80’s the range of CPU architectures was incomparably greater than what we have now, that’s true. Sometimes such a restriction of possibilities as we have seen in the last 35 years is due to purely technological reasons: some ways of doing things are simply more efficient. But this is not only what happened with CPU architecture: there were historical contingencies that give a very Darwinian flavor to the story (animals are not optimally designed, or mammal eyes would not have a blind spot).

The driving factor for evolution was the synergy between C, Unix and minicomputers. C and Unix were designed on and for PDP-11s (PDP-11 assembler quirks are still visible in modern C) and Unix spread because it Unix + PDP-11s gave an unrivalled price/performance ratio for the time. In the early eighties, Pascal and C became the only two contenders for low-level non-assembly languages – a Pascal compiler for a Burroughs machine does very little, the language is incredibly close to the instruction set.

Pascal is an incomparably better language than C, but C had its killer weapon: modularity and linking loaders, thus a faster development cycle.

And microprocessors became competitive with minis. Once upon a time you had the choice between Pascal and C to develop on a Macintosh (and the 68000’s ISA is a lot like the PDP-11’s). But Pascal soon fell by the wayside for the reason I just gave.

So the name of the game in ISA design became running C efficiently. The discovery that pipelines were a key factor for efficiency (in general) led to the invention of RISCs. And now we are in a world where descriptors or microprogramming are completely gone, and LLVM is a virtual load-store machine, completely taylored for RISCs (the non-RISC X86 is the one dinosaur that survived), blissfully ignorant of descriptors or microprogramming. With descriptors you can implement the greater part of a garbage collector in hardware. That’d be useful to have sometimes, wouldn’t it? I don’t think that these two techniques I’ve just mentioned are incompatible with pipelines, just that people don’t care enough to try.

Rust is made for LLVM, but it adresses some problems that no language or ISA ever has, so it could lead to a renewal of ISA design.

OK, I forgot to mention that an important historical factor was the forever changing speed ratios between register memory, RAM, and storage. But these are not enough to explain the railroading to C+LLVM+RISC that we have seen happen.


#13

I can’t judge the veracity of this historical account, but I don’t see how that means Rust might enable or benefit from a revolution in ISA design. Rust is deliberately designed such that it maps extremely well to current CPU architectures and neither requires not benefits from features they don’t have. After optimization, it looks very much like C or C++ code in terms of code organization, operations used, etc.

Sure, there are slight differences in how common some operations are (e.g., bounds checks are more common). If a CPU vendor cared they could potentially design a couple instructions to accelerate those operations that are more common in Rust code, though not by much since (as I said) they already work very well on current architectures.

It’s conceivable there might be future extensions to Rust tailored to a new direction in ISA design. However, these extensions will likely be easier to apply to Rust than to C because it’s less entrenched, has fewer implementations, has accumulates less legacy cruft, makes it easier to write compiler plugins or DSLs, etc. rather than solely because of language design differences.


#14

It’s conceivable there might be future extensions to Rust tailored to a new direction in ISA design.

That’s what I meant, should have been more careful about the wording. The ability to parcel out a slice by means of reborrowing has enormous potential for parallel architectures, but is still too limited and should be extended to more complex data structures before it becomes interesting for hardware. But that probably won’t be for Rust 1.


#15

Mill is designed around C code, but Unsafe Rust is very close to C, and the ways it differs don’t seem to affect anything. The Mill also has support for multiple return values built-in, which should map cleanly to tuple returns.

Also, I think being statically scheduled would help with constant-time code.