I know it’s really counter intuitive, but it’s true. At least when double rounding after every operation. Here’s an example with four bits and six bit significands for simplicity. Consider adding 1.010 >> 1 to 1.000 << 4 after extending to six bit:
1000.00
000.101
---------
1000.101
Rounding this to six bits yields 1000.10 (half to even), further rounding to four bits yields 1000 (again half to even). Directly rounding to four bits would round up since the fractional part if 0.5 + 0.125, i.e. more than half. The result of 1000 is more than half an ULP off!
It seems plausible that performing an entire chain of calculations in higher precision and rounding once at the end might improve things, if not always, then at least in most cases. But double rounding after every operation means there is no opportunity for the extra bits to help, only for the information discarded by the first rounding to make the second worse.
Note that double rounding being wrong is not a IEEE 754 quirk. It can happen with base 10 and other rounding modes (let’s take round-half-up): 0.049 rounded to one decimal digit is 0.0, but first rounding to two decimal digits yields 0.05 and rounding that again yields 0.1.
Having spent a good chunk of the last weeks with rounding floating point numbers, I respectfully disagree. Regardless, a 128 KiB lookup table for every transcendental function would result in one or even several megabytes of binary size added. That is a pretty big cost.
But why can’t it be a newtype struct f8(u8); with inline assembly to access the hardware support for those numbers? More generally, I agree with @Aatch here.