Stabilizing basic functions on arrays and slices

Proposal:

  • Resolve https://github.com/rust-lang/rust/issues/27750 in Rust 1.5 (if possible, otherwise 1.6) by making clone_from_slice stable and removing its feature gate.

  • Resolve https://github.com/rust-lang/rust/issues/27740 in Rust 1.5 (if possible, otherwise 1.6) by deprecating std::slice::bytes (The MutableByteVector trait that gives slices the set_memory method, and a one-off and oddly placed copy_memory).

  • In Rust 1.6 define and document clone_from_slice to do at most one bounds check on each slice, and mark clone_from_slice #[inline(always)] to maximize the chances that one or both of the bounds checks will be removed. Further, document that for primitive types (at least), clone_from_slice is equivalent in performance to calling memcpy other than the two bounds checks.

  • In Rust 1.6, add clone_from_value to slices:

/// Sets every element of the slice to `value` using `Clone::clone_from` 
/// For primitive types, this is the Rust analog of the C `memset` function
/// plus at most one check of the length of the slice.
#[inline(always)]
fn clone_from_value(&mut self, value: &T) -> usize where T: Clone {
    for b in self {
        *b.clone_from(value);
    }
}

Justification:

  • We need a stable, safe, and fast API for copying slices (where safe is defined here as "no use of unsafe") in a timely manner. A timely conclusion is highly desirable because this is fundamental functionality that makes it difficult to use Rust as a replacement for C.

  • copy_memory is redundant with clone_from_slice, and clone_from_slice is more general because it works on all types, not just [u8]. Avoiding this redundancy is good because it helps the developer avoid the need to make a choice between the two.

  • It is important to define the performance characteristics of these functions, because developers (me, at least) are looking for functions that specifically offer such performance guarantees.

  • It would be nice if alternatives like copying with for loops (e.g. using a.into_iter().zip(b.into_iter()) were optimized better, but even if they were optimized perfectly, developers would prefer to use a named function with clear performance guarantees.

Thoughts? I would be happy to contribute the patches to making this happen.

Note that this is based on my own experience writing crypto code (to replace C code) in Rust, and also the experience of others writing similar code. See https://github.com/DaGenix/rust-crypto/blob/c861644f6c84618ceacc8dd042e7a3b27e5a12bc/src/cryptoutil.rs#L19 for an example; note in particular all the uses of unsafe.

3 Likes

The 1.5 beta has already been shipped, so these changes can happen for 1.6 at earliest. I’ll make sure this post is discussed at the lib team meeting today, where we decide on features to put into FCP (final comment period) for stabilization for the 1.6 cycle.

Oh, and I should say: thanks much for the thoughtful proposal!

1 Like

FWIW, I find it extremely unlikely that two bounds checks (a single comparison and a single well-predicted branch) will even be a blip on the radar compared to the rest of the operation, but maybe you've seen otherwise in practice? (Also, I don't think #[inline(always)] is quite the right tool: #[inline], maybe, but the former is too blunt, and is often a deoptimisation due to forcing code-bloat at every location it is used, even non-perf critical ones.)

Also, there's unfortunately, no actual guarantees that clone_from_slice is the same as memcpy; it's one reason specialisation is useful (one can have a specialised clone_from_slice for T: Copy), and a large reason that copy_memory/set_memory exist.

I haven't measured the difference for that. However, in cryptography, at least, we are often dealing with fixed-sized arrays with fixed-sized inputs, so there would be no reason to ever do any bounds checks because the compiler could in theory determine statically that there is no out-of-bounds access.

I think it has to be #[inline(always)] so that the compiler can choose best implementation to use. For example, if you're copying 64-bit aligned data and/or the inputs are only 8 bytes long then that's useful information to pass to the memset/memcpy intrinsics so they can make the beset optimizations.

I understand that currently there is no such guarantee. I'm saying that clone_from_slice should be changed to make that guarantee. My point is that copy_memory/set_memory only exist because the implementation of clone_from_slice is not as good as it could be.

Note that I don't expect that clone_from_slice and clone_from_bytes to necessarily be implemented with Rust code. I expect that they should be implemented using whatever memcpy/memset intrinsics LLVM provides.

As I said, I strongly disagree that it has to be #[inline(always)]. The compiler can decide to inline thinks if it things it will beneficial (likely for primitives), and #[inline] will encourage it to do so more often than it otherwise would, but forcing it to do it to do so unconditionally is very unfortunate, because these functions are used for more than just fixed-sized arrays/primitives. (We've had serious compile time and code size problems in the past due to #[inline(always)].)

Yes, and as I said there's no way to change the implementation to get the guarantees without specialisation. clone_from_slice/clone_from_value cannot be guaranteed to raw byte copes (i.e. call byte copy functions directly) without it, because only a subset of Clone types (specifically Copy ones) are guaranteed to be able to be safely duplicated with a memcpy. For instance, clone_from_sliceing slices of Vec<T>'s will require following pointers, and may require new allocations etc.

Whether to use the memcpy intrinsic or whether to use a loop around clone_from is something that can be decided by the compiler. i.e. until specialization arrives, these functions can be intrinsics. I don't think it would be good to wait for specialization to be implemented because having a safe and efficient memset and memcpy is fundamental.

Note that the reason I posted this is that I noticed that my Rust replacement for OpenSSL's HASH_UPDATE and HASH_FINAL code was doing tons of bounds checks and was generally slower. (FWIW, My goal is to write all my Rust crypto code without unsafe except for the calls to the assembly language primitives.) In particular, PBKDF2 requires a lot (hundreds of thousands) of very small memcpy and memsets when using 100,000 iterations, so any overhead actually does add up.

OK. I think perhaps I was misunderstanding the difference between the two and maybe there's some miscommunication about how exactly the implementations would go. Let's not worry about this part.

I agree with the two resolutions of the issues. Sounds very good to me. copy_memory is redundant with equivalent operations being available though the I/O traits anyway, so let’s make the copy from slice to slice generic over Clone.

No matter how, I think we’ll find a way to specialize clone_from_slice for memcpyable types somehow. It already compiles quite well (fully vectorized).

I dread to bring it up… but the name is really not that great.

I would prefer it be named std::slice::copy<T: Clone>(source: &[T], destination: &mut [T]) as a free function, were it not for the Clone vs Copy confusion / distinction. The Clone trait facilitates high level copying, it just has to have another name…

But we might have specialization by provisional and ad-hoc means as a crutch. We already have std::intrinsics::needs_drop, we could also have an intrinsic to tell if a type is Copy, so this is very far from impossible.

The clone from value suggestion is already served well by a simple elementwise assign in a for loop. This single iterator case already compiles really well and isn't tricky (unlike the two iterator/slice operations for copying).

let s = &mut [....];
for elt in s {
    *elt = x;

}

if x is a single byte, or if llvm can identify that x is all the same byte (such as None::<i32> which is 8 zero-bytes per element), it will replace the loop with memset.

One line of code, guaranteed to be optimized, easy to teach because it has a name:

s.fill(x); // See the RFC regarding the new propose name.

Three lines of code, not guaranteed to be optimized, hard to teach because it doesn’t have a name:

for elt in &mut [....] {
    *elt = x;
}

Note that the compiler doesn’t optimize this simpler variant of the above well at all:

for i in 0..s.len() {
    s[i] = x;
}

In particular, every iteration has a bounds check in it.

The existence of copy_memory in unstable Rust currently, and the existence of memset and memcpy in C, and the existence of std::fill and std::copy in C++ are evidence that it is worth having a named function for these.

Note also that bluss has already showed that the compiler will optimize a for loop over two slices efficiently, if it formed in a particular way: https://users.rust-lang.org/t/how-to-zip-two-slices-efficiently/2048

I just discovered another issue: The above is only true in release builds. In dev builds, the loops are very slow. This slowness of the dev vs release mode was enough to cause my dev-mode tests to time out on Travis CI, while the release mode tests completed quickly. (The difference in execution time of the test in release vs debug mode was more than 10 minutes.)

In other words, we need functions for copying/filling that are fast even in dev mode, and thus we can't (always) rely on the loop optimizer for them.

Iterator::next badly needs some analogue of GCC's __attribute__((always_inline)), it would speed up a lot of other loops in debug mode besides slice copiyng.

There is #[inline(always)], but as discussed a bit above this is often a deoptimisation, and, I suspect it won’t help as much as you might hope, since the lack of optimisations means there’s still a lot of code run (i.e. it is essentially just removing the overhead of a single function call, there will still be a ton of loads/stores/generally pointless code).

This is only possible in very few cases, basically for Copy types and the equivalent of memset and memcpy. For all other cases, you have a generic method call that needs to be instantiated without an optimizer, and there is no way rustc can emit an efficient loop for you.

Oh, I never noticed #[inline(always)] works in Debug. Nice.
Yes, it's just a function call, but it's a call in a loop, in all iterator-based loops.
I should probably measure and report the impact.

clippy has a lint against #[inline(always)], and it found a good number of matches in both servo and rustc. We chalk that up as false positives. Still, one could argue that it sets a bad example…

It’s not always bad, it just has to be applied with care (and profile the result). Since rustc, llvm etc change, I guess the uses inside libstd could need tuning. The servo people seem very aware of this, and I trust they have good reasons for their inlining decisions (and they need to look at libstd too).

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.