Removal of all unstable placement features

I was thinking if it is possible to do a syntax-based transformation of the RHS at HIR→HAIR level to perform the proper lowering, e.g.

place <- S { a: $expr1, b: $expr2 }

would expand to

place.a <- $expr1;
place.b <- $expr2;

and

place <- [$expr; $n]

would expand to

for i in 0..$n {
    place[i] <- $expr;
}

therefore

place <- S {
    a: [[[0; 1024], [x; 1024]]],
    b: foo(),
};

would become

// iteration 1
place.a <- [[[0; 1024], [x; 1024]]];
place.b <- foo();
// iteration 2
place.a[0] <- [[0; 1024], [x; 1024]];
place.b <- foo();
// iteration 3
place.a[0][0] <- [0; 1024];
place.a[0][1] <- [x; 1024];
place.b <- foo();
// iteration 4
for i in 0..1024 { place.a[0][0][i] <- 0; }
for i in 0..1024 { place.a[0][1][i] <- x; }
place.b <- foo();
// iterator 5
/* (convert all `<-` into std::ptr::write) */

For sure this requires support of safely GEPing place.a[0][1][i] from the uninit pointer place.

2 Likes

Personally; I really dislike the arrow form a <- b being desugared to anything other than bind (b >>= \a -> cont). I think Haskell / Idris folks like me would appreciate some other syntax for placement features.

APL used <- for assignment, you know.
We shouldn't disappoint those guys either!

Who to disappoint; such difficult choices ^,-

It seems to me that you'd only confuse the APL folks with <- when we use = for assignment and <- for placement "assignment".

1 Like

Just anything but := :poop:

I have an idea: let’s use =.

…no, really. I think a viable approach would be to remove the concept of placement as a special construct, in favor of a guarantee that whenever any old expression could be constructed in place, it will be. For example, a large function return value will be constructed in the caller’s stack frame; let foo = [expr; 123] will write expr directly to the corresponding slots of foo; and so on. Similar to the existing concept of copy elision, but guaranteed rather than optional, and performed earlier in the pipeline so it applies even when optimizations are off.

11 Likes

How does that work when the destination place is anything other than a simple path into a local? For example, how do you append into a Vec without extra moves? Even if you overload = to call into some variant of the Place trait (very undesirable IMO), the order of evaluation will have to be different because normally LHS = RHS evaluates RHS first, and LHS afterwards, but for placement to work you need to evaluate LHS first.

3 Likes

Please do not derail the thread with syntax bikeshedding (except for maybe avoiding new operator at all), syntax was discussed a lot already, and is not the problem in this case.
Haskell/Idris folks is probably the last group of users I’d personally care about, they have every piece of punctuation occupied for something highly useless for non-PL-enthusiasts.

The problem is guaranteed temporary elimination / RVO.

8 Likes

How do you see this working for enums? I don’t see a pure-desugaring way to just assign the fields, though it can clearly be made to work within the compiler.

How does that work when the destination place is anything other than a simple path into a local? For example, how do you append into a Vec without extra moves?

Well, to start with, with an API like vec.push_with(|| Foo), this could 'just work' as a consequence of return elision; I can think of a few possibilities for what the implementation could look like.

But that's a bit ugly. Ideally we wouldn't need an alternate API at all, whether push_with or the nicer looking vec <- Foo; either way, we'd be left with Vec::push as a footgun that, essentially, you should never use. It would be better if there was some sort of function attribute or other marker that would allow vec.push(Foo) to construct Foo in place. That is, some equivalent of the Place "sandwich" (first the container allocates, then the caller evaluates, then the container finalizes) would be implicit in the method call, at least on the caller's end (again, not sure what exactly the callee's end would look like).

Admittedly, that would be a bit weird, because regardless of the specific design, it would imply calling into some functionality of vec before evaluating Foo. But that isn't totally unprecedented: in particular, when a method call goes through deref coercion, the Deref impl is called before the method arguments are evaluated. Now, in that case there's the mitigating factor that Deref impls are supposed to be pure and as boring as possible, although that guideline isn't always followed (lazy_static…). Still, if the choice is between some slightly surprising behavior that's unlikely to cause many problems in practice, versus a prominent new language feature with dedicated syntax… well, I like the idea of the former.

4 Likes

I believe this is possible at HAIR level. There is rustc::mir::ProjectionElem::Downcast after all. If the RHS of the <- expression can be kept intact at MIR level, we could even do it less-awkwardly as a MIR transformation pass.

2 Likes

cc @vadimcn

either way, we’d be left with Vec::push as a footgun that, essentially, you should never use.

How would compilation under a different model that solves this problem scale?

For example, suppose Vec::push is behind a compilation firewall:

// crate A
pub struct MyVec(Vec<i32>);
impl MyVec { #[inline(never)] fn push(&mut self, v: i32) { self.0.push(v) } }

// crate B
let mut x: A::MyVec;
x.push(3);

Basically, when push is called in crate B, nothing can be known about the internals of push. To solve this problem, we would somehow need to place 3 directly into the heap going through all Vec's opaque logic, but crate B cannot, by design, know any of that. So I wonder if this problem is even solvable.

While this solves the problem, every closure is a new different type from all other closures, so this hits compile-times more than is necessary, e.g., vec.push_with(|| a) and vec.push_with(|| b) would require two monomorphizations, but since the type is always the same the solution should IMO require one at most.


EDIT: I don't want to be discouraging here. I honestly don't know of any good solution to this problem, it is a really tough one to solve!

3 Likes

C++ solves it by making all constructors do placement new (unconditionally) via the hidden this variable. This is the only way afaik.

Rust would have to do something similar.

In C++, constructors can be called in several ways, for example:

  • Without the new operator: stack allocation is implicit.
  • With the new operator: for heap allocation.
  • With the new operator and an explicit placement address.
  • As a nameless temporary of a return statement: “return-value optimisation” is mandatory.
  • As a local (seemingly stack allocated) object returned by value from a function: where “named return-value optimisation” is permitted.
  • As the right-hand side of an assignment: copy elision.

In all these cases the placement address is implicitly passed on to the constructor.


Interestingly, constructors in C++ may only fail by throwing an exception. This allows them to be used in all these cases without call-side checks. In Rust we would probably want to differentiate between fallible and infallible constructors, perhaps by the use of the Try trait?

1 Like

I'm not proposing that the compiler would make decisions based on the body of push, but based on some attribute or other marker on the declaration of push. I know that's vague, but I wanted to leave open the space of possible designs.

Here's a strawman for a more specific design:

Add a new trait as an alternative for the Fn trait for functions (or function-like objects) that want to allow constructing their arguments in place – call them "placer functions", and call the trait PlacerFn. (There probably also need to be PlacerFnMut and PlacerFnOnce.) In the simplest case, implementations would manually impl this trait (XX: but how to make this work with method syntax?). However, there might be some potential to allow wrapper functions that delegate to placer functions to be written as normal functions with an attribute attached, as long as they satisfy certain conditions and have the compiler generate an impl automatically.

A placer function receives its arguments in stages, through a series of calls: that way, given a function like Vec::insert(&mut self, index: usize, element: T), the implementation can receive the index before returning a place in which to construct the element (so it could shift existing elements forward, and return a pointer to the newly vacated &vec[index]). As it happens, when these kinds of functions have more than one argument, they usually take the value as their last argument, so we can simplify things by requiring arguments to be received in order from left to right; this also avoids messing with the order of evaluation on the caller side.

Here's a possible definition of the PlacerFnOnce trait:

// These names all suck and should be bikeshedded
trait PlacerFnOnce<UpfrontArgs, PlacedArg, RemainingArgs> {
    type Output;
    type Place: Place<PlacedArg> + PlacerOrNormalFnOnce<RemainingArgs, Output=Output>;
    extern "rust-call" fn provide_arguments(self, UpfrontArgs) -> Self::Place;
}

Whereas the Fn trait has one generic parameter, Args, representing a tuple of all the arguments, this has three, splitting the argument list into three parts:

  1. UpfrontArgs is a tuple of any arguments that don't need to be constructed in place, but should be passed up front, by value – such as the index parameter from Vec::insert.
  2. PlacedArg is the argument that needs to be constructed in place.
  3. RemainingArgs is a tuple of the rest of the arguments, if any, which will be provided in a subsequent call. (There always needs to be a subsequent call, even if there are no more arguments: the call will also act as the equivalent of InPlace::finalize, signaling that the in-place construction has completed successfully.)

[An alternative design could stick with a single Args parameter and use associated types to determine how to split it into parts; this would help avoid ambiguity, but would make the trait definition look a bit more complicated.]

The return value of provide_arguments must impl two traits: Place<PlacedArg>, which is the same as the existing, soon-to-be-deprecated Place trait (i.e. it provides a pointer to place the argument into), and PlacerOrNormalFnOnce, for the subsequent call.

PlacerOrNormalFnOnce is a trait that has blanket impls for types that impl either PlacerFnOnce or the normal FnOnce. (This can be achieved with specialization, but would require the lattice rule, or some other mechanism, to handle cases where someone tries to impl both PlacerFnOnce and FnOnce on the same type.) The compiler could also use this trait to decide which of PlacerFnOnce or FnOnce to use for a function call in the first place.

So the function call desugaring takes three steps:

  1. Call provide_arguments with the upfront arguments to obtain a Place;
  2. Get the pointer from the Place and construct the argument in place;
  3. Do a recursive call on the Place object itself with the remaining arguments. (This is always done by-value through PlacerOrNormalFnOnce, even if the original call was through PlacerFn or PlacerFnMut, so that the callee can take ownership of the Place.)

If the Place itself impls PlacerFnOnce, that indicates that there's another argument somewhere in the list that needs to be constructed in place, so it goes through the whole process again. On the other hand, if the Place impls normal FnOnce, that's the end of the line: it will be called with the remaining arguments and produce the final function return value.

Similarly to the existing <- behavior, if evaluation of an in-place argument diverges (i.e. it panics, executes break or return, etc.), then the Place will be dropped instead of being passed by value into a method call; the Drop impl should handle reverting the container to its previous state.

Also note that the PlacerFn* traits would not be usable as trait objects, at least not unless you provided the Place associated type up front as part of the trait object type. That's probably not the end of the world; I'd love to see an expansion of object safety in general, but that's a quite separate topic.

TL;DR: basically the same thing under the hood as the existing Placer, but implicit in function calls.

It was pointed out that although I got rid of most of placement and closed the tracking issues, box_syntax still remains (I didn’t have the energy at the time to chase that down as it’s actually used internally, e.g. in liballoc and it’s less clear to me how to proceed with removal).

I’ve created a new tracking issue at https://github.com/rust-lang/rust/issues/49733 for just box syntax. It’s centered around “how can this be removed”.

This is an interesting new take on a way to provide places to callers - putting aside the detail I think I like the thrust.

By itself it has similar problems to the original RFC (not clear how to use a place), but supplemented with the other half of the story (e.g. the other thread, as you note) it could certainly work.

A general question though: is it true that you always want something to be constructed in-place? If your construction involves a number of transformations of the data (maybe creating data in other places) before finishing you could end up chasing all over the heap, even though you’re just working with usizes and would prefer to do things on-stack. Maybe this is excessively hypothetical.

Well, let's see. I guess you can split this concern into two parts:

Returning primitives/small values in registers

ABIs usually specify that primitive integers and floats are returned in registers, while 'large' structs are returned through a hidden out parameter (for small structs, it depends on the architecture). Thus, having the compiler do the out-parameter transformation at an earlier stage wouldn't make much difference for structs, although rustc might use its own size threshold which could be different from the default ABI's. But we definitely want to keep integer returns in registers; sticking out parameters on every function that returns an integer would be dumb and slow. Thus, "guaranteed copy elision" would actually have to be "guaranteed copy elision for large values", for some arbitrary definition of "large".

For most purposes, that shouldn't make any difference; the whole point is to avoid copying large values. But it does mean that some things you might try to do in unsafe code, which… might have some arguable claim to being sound if copy elision were guaranteed for everything, wouldn't be sound. Such as the following:

unsafe fn get_integer_and_pointer() -> (i32, *mut i32) {
    let mut x = 5;
    let ptr_x = &mut x as *mut i32;
    (x, ptr_x)
}

fn user() {
    unsafe {
        let tuple = get_integer_and_pointer();
        // The return of tuple should have been copy-elided, so
        // tuple.1 should still point to &tuple.0.
        *tuple.1 = 6;
        assert_eq!(tuple.0, 6);
    }
}

For now, I think it's fine to just say "don't do that". :slight_smile:

However, I was thinking that as a hypothetical future extension, as part of support for self-borrowing (immovable) structs, guaranteed copy elision could actually be considered by the lifetime checker, so you could do something like the above, but in entirely safe code. In that case, there'd have to be some rule that self-borrowing structs are always copy-elided even if they're "small", and maybe some way for structs to opt into this behavior even if they don't contain self-borrows, for the sake of unsafe code…

Well, that sounds like an extra complication for this hypothetical extension, but not the end of the world, and of course I have no idea if that's even a direction we'll ever want to go down. :slight_smile:

Cache efficiency

With small values excluded, there's no longer a question of having to "chase" pointers where we otherwise wouldn't have to. The only change would be for large values which were already initialized through a pointer: in some cases, that pointer might now point directly to the heap, whereas without copy elision, the pointer would point to the caller's stack frame, and the caller would then copy that data to the heap.

In most cases, that's a clear win: the data has to be written to the heap one way or the other, and copy elision just eliminates a temporary copy. However, there might be cases where one version or the other makes more efficient use of the CPU cache. In particular, any chunk of stack memory we've used recently is probably already in L1 cache, whereas some heap allocation might not be. Again, since the heap is the ultimate destination, we're going to have to access it one way or another, but:

  • Accessing the non-cached memory will evict something else from L1; the timing of that, relative to other instructions that might access that "something else", can make a difference. But there's no reason to prefer evicting later vs. earlier; if anything, writing directly to the heap is likely to be better (albeit not much) since it reduces the total amount of memory being accessed.

  • If we end up having to block on a read of the existing data in that cache line, that could be slow. In theory that shouldn't be an issue, because we're writing all-new data! However, the way CPUs are implemented, even just writing to any part of an uncached cache line will typically trigger a read of the rest of the cache line, which needs to finish before the instruction can be retired. (That assumes the cache is in write-allocate mode.) On modern out-of-order CPUs, though, that's no problem. The processor will keep executing instructions while it waits; the write will hang around in the store queue, and even if the code tries to read back the value, the read will be satisfied from the queue (store-to-load forwarding) rather than having to wait for the data to actually be available in the cache.

    On teeny embedded CPUs, that functionality may not be available, so any write to an uncached location may block execution for a while… but in most cases, without copy elision, the same wait would just occur later on when copying to the heap. However, at least ARM Cortex-M7 has special cache behavior to handle bulk writes more efficiency, so it could theoretically be faster to copy later. But even then, you have to contend with the extra cost of the copy itself – which applies even when the heap data is in cache to start with… plus code size bloat.

Anyway, getting out of the weeds, I think it's safe to say any difference here would be rather marginal, and in many cases could be an improvement.

1 Like

So from my POV, the problem is that we promised RVO/NRVO, but never quite followed on that promise.

Isn't this a misoptimization for the current semantics of ExprRepeat, as it evaluates $expr multiple times? The correct version would be this:

if $n > 0 {
    place[0] = $expr;
    place[1..] = [place[0]; $n-1];
}

Maybe we should also have a variant of ExprRepeat that works with non-Copy but constexpr $expr.

For years there was supposed to be a guaranteed MIR optimization that does exactly this, except also in a few more cases (e.g., NRVO). It just never happened to be implemented because nobody found the time for it.

2 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.