As of today rust uses a naive implementation for converting integers to decimal strings in IntDebug/Display methods. So I propose adding an optimization for the most commonly used cases, that is, converting to decimal characters.
This should be non measurable overall but might give us a minor speedup on serde and rustc serializers.
I wrote a reasonably optimized version here. Further optimizations are possible but I tried to keep the code size small (which I think is important), it’s a road of diminished gains.
The benchmarks are nothing scientific but I tried to avoid the most common pitfalls. Timing includes formating machinery (buffer is always preallocated though) so it should be reasonably close to the real world code path.
Running with rust 1.2 nightly @ x64 Linux - Intel(R) Core(TM) i7-2670QM CPU @ 2.20GHz
test tests::new_08 ... bench: 562 ns/iter (+/- 111) (-18%)
test tests::new_16 ... bench: 1424 ns/iter (+/- 60) (-08%)
test tests::new_32 ... bench: 3342 ns/iter (+/- 92) (-16%)
test tests::new_64 ... bench: 7692 ns/iter (+/- 373) (-48%)
test tests::stdlib_08 ... bench: 626 ns/iter (+/- 12)
test tests::stdlib_16 ... bench: 1540 ns/iter (+/- 113)
test tests::stdlib_32 ... bench: 3887 ns/iter (+/- 72)
test tests::stdlib_64 ... bench: 11436 ns/iter (+/- 317)
One problem I can see with your benchmarks is the 0ns baseline – this basically means your code got inclined to nothing (which you usually want to avoid in benchmarks).
Apart from that, good job. I see that you use a lookup table for the common case. AFAIK Java also does this, though on some platforms direct arithmetic conversion is faster.
The baseline (_warmup) is just to warm-up the processor (get it to a stable frequency), it takes 0ns because it’s a noop loop. I’ll remove that line to avoid further confusion.
As much as modern processors have a highly parallel ALU unit I doubt the arithmetic alternatives will be faster . I’ll give it a try though!
On my system, a noop loop takes more than 0ns – so either there's something wrong with my system, or your _warmup || 1 closure gets inlined away. I'll investigate and come back with my findings.
Edit: I had forgotten to compile my benchmark with -O