[Idea] Returning unsized locals via stack

Motivation

-> impl Trait

  • if/else can't return diff. types
  • not object-safe
  • can't return T[]

Idea

New calling convention

  • do not fully clean stack on fn return
  • return pointers to objects on fn's stack
  • live objects shifted on stack to close gaps
    • either by the callee upon ret or
    • by the caller perhaps easier to implement and offeres more flexibility
    • shifting overhead on par with missing return value optimization
  • order of objects on stack needs to be known to shift efficiently
    • only allow returning (a, b, c) this way if compiler can prove a, b, c sit on stack in this order

Details

How to write fn return types? Perhaps

// S is a concrete struct type
fn foo() -> (dyn Trait, dyn u32[], dyn S) {...}

Such tuples values can be

  • constructed in return position
  • assigned from function returns - with or without destructuring

It's forbidden to

  • move or copy them
  • put them into other composite data types

Would treating them as not :Sized be enough to enforce these restrictions?..

Further possibilities:Explicit stack management

Explicit stack management

Suppose

// invoke foo() and immediately shift a, b, c on the stack
// closing the gaps and adjusting sp
let (a, b, c) : (dyn Trait, u32[], S) = foo();

// do not close the gaps or adjust sp automatically
let (a, b, c) : (dyn Trait, u32[], S) = #[no-compaction] foo();

// perhaps we shall compact manually later - syntax for that to be invented

// perhaps we shall soon return (a, b, c)
// and it's more efficient for our
// own caller to do compaction

// perhaps we are confident we shall be allocating very little stack
// from now on in this fn so it's better not to shift and let the stack
// get cleaned when this fn returns

// perhaps an equivalent of proposed alloca scopes:
// https://reviews.llvm.org/D97768
// will be supported by Rust surface syntax
// ... and objects returned this way are to be treated
// as if they were allocated via alloca
Minor technical details

CC = calling convention

  • callers of fn-s using the new CC will probably need to use rbp or equiv. register
  • fn-s using unsized locals/alloca are probably already using it

.

  • fn-s using the new CC may need to return an extra pointer-sized value
  • it would equal to stack position just before the 1st argument of the fn
  • this is the value sp would have been restored to on ret had the CC been regular
  • on x86-64 the value may go into some register not used by the regular CC
  • this may be necessary for the caller to do stack shifting

.

  • implementing the new CC will require LLVM-level C++ work
    • on each supported platform
  • hopefully if stack compaction is done by the caller
    • this can be a very simple adaptation of the currently used CC
    • ideally just some minor subclassing

.

  • to do stack compaction we needs to know the memory size for each object being shifted
  • the callee will probably trivially "know" each size
  • the caller will have to do extra work to learn the sizes
    • a couple of memory reads to extract size from dyn Trait's vtable
    • a multiplication for T[]
  • this appears to be the cost of
    • simplifying LLVM work (see above)
    • gaining flexibility (see "Explicit stack management" above)
  • my uneducated estimation is that it may be worth it to pay these costs
1 Like

This would be a weird requirement (compiler-dependent type system?).

But it's possible to reorder things arbitrarily in linear time without using extra space, so I don't think it's necessary.

This bit is awkward, but by necessity..

Update: I have re-thought the rule, see post number 11

If we refer to positions in a tuple type which are

  • dyn Trait
  • dyn Struct
  • T[]
as `dyn` tuple positions, then the rule I had in mind could be spelled as
  • it is allowed to use local variables a and b
  • in dyn positions
  • while constructing a return tuple
  • placing a before b

only if

  • a is assigned to before b in each execution path through the fn
----------
The rule could be more subtle Say if - the tuple type is `(A, B, C, D`) - both `A` and `B` are of types statically known to `fn` after monomorphisation - return tuple is constructed as `(a, b, c, d)` then - it doesn't matter which order `a` and `b` are assigned to however - if `c` is an unsized local, meaning - it's an `alloca`, e.g. `dyn Trait` of `T[]`

then it is important that

  • c is assigned to before d
  • even if d is of a statically known type

The "subtle" rule however is so messy that the crude version is possibly better.

My point is, you could remove the rule altogether. If you're going to require the compiler to move the outputs on the stack, it might as well reorder these things on the stack. Granted, it's simpler when they are already in the same order, but it's also doable (and also in linear time) when they are not. The optimizer can try to put them in the optimal order in the first place, but it doesn't have to.

My thinking was:

  • if things are in a "wrong" order on stack
  • some explicit action can be taken to put them in the "right" order

The explicit action could be to allocate more space on stack and create fresh copies. I believe the action needs to be explicit to put pressure on programmer to try and lay things out correctly from the start if at all possible.

The idea is to make this cheap. Linear time is good. "Free" is better :slight_smile: I am assuming free space on stack is generally available.

But your idea is already not free. You said it yourself: "live objects shifted on stack to close gaps". With or without the extra complicated rule, the compiler can try to make it free or cheap, and probably succeed in many cases. Adding such an awkward rule can prevent many useful functions from being written, for dubious gain in optimization. What if somebody wants to do if ... { (a, b) } else { (b, a) }? Your rule would prevent it.

Can be free when "Explicit stack management" from my 1st post is used. Otherwise

e.g. the cost is very predictable and on par with what we're already often paying.

And all I'm saying is you don't need the rule for that. Even if the compiler has to shift with reordering at runtime, maybe it's up to 2x more moving than if it has to shift without reordering.

Making the programmer do extra work in tricky cases doesn't guarantee that the cost will be lower. The programmer may have to program in the same reshuffling manually if it's not allowed to be done by the compiler.

Precisely, that was my goal.

Or invoke some sort of macro. Or whatever. But that action would make programmer aware there is a cost involved.

This is currently possible by combining:

with:

To be fair, stackbox on its own can already quite suffice for many things, provided you take a &'out mut stackbox::Slot<T> parameter: you can then yield a StackBox<'out, U> where U = T : Sized, or one of the coercion-enabled cases (mostly: U = [Item] if T = [Item; N]).

3 Likes

CPS is great :+1:

a new calling convention sketched above could be useful as well
Rust being one of the languages that could use it


The most awkward part is the requirement to order items on stack. I still see it as necessary. This comment is to spell out the "rule" better:

  • I see a case for introducing a new kind of local let bindings in an fn: so called dyn let bindings
    • already existing unsized local traits: dyn Trait
    • already existing unsized arrays - suggestion here is to change syntax from T[] to dyn T[]; I'm suggesting to change syntax so as to avoid false similarity with unsized members of structs
    • new kind of local let bindings:
      • regular local variables of known concrete type but allocated dynamically via alloca
        struct S {..}
        fn ...() .. {
          let s : dyn S = ...;
        
      • the difference is that programmer has a degree of control over where exactly on the stack they will be located
      • if such a variable is assigned to after an unsized local like dyn Trait has been assigned to then the programmer can be assured that it will be allocated on stack after the said dyn Trait
      • further when concrete types are returned via the new CC they become exactly this: dyn ConcreteType
  • once we have introduced the new kind of local let bindings the rule is now this:
    • when constructing a tuple with dyn members to be returned via the new CC
    • only regular (non-dyn) local bindings, only dyn local bindings or a mix thereof can be used
    • if a mix is used all regular (non-dyn) local bindings have to come before all dyn local bindings
    • all bindings used to construct such tuples dyn or not have to be declared in the order they are used to construct them
    • all dyn bindings used to construct such tuples which are unsized (dyn Trait, dyn T[]) have to be assigned to in the order they have been declared
    • all dyn bindings used to construct such tuples which are sized (dyn ConcreteType) have to be assigned to only after all preceding unsized dyn bindings (dyn Trait, dyn T[]) used to construct such tuples have been assigned too; this is because only after that has happened the exact location of the said dyn ConcreteType on stack can be determined
    • the compiler then guarantees to lay them them out on stack - both dyn and not dyn - in the order they were declared; so all dyn will come after all non-dyn
    • if these rules cannot be followed the programmer needs to resort to explicitly creating new copies of some of the local variables involved or to invoke some sort of compiler provided facility to automate this; the surface syntax for the new facility may look like a macro invocation
  • the purpose of introducing this rather complex rule is that the objects returned via such tuple can be compacted - shifted up the stack - in the simplest possible manner
    • in particular the goal is to ensure that the code doing the shifting doesn't have any conditional branches; just a sequence of straight-forward size/location calculations and memcpy-s
  • the purpose of not always automating this - and indeed forcing programmer to manually create extra copies or to manually invoke the built-in mechanism helping with this - is to make programmer aware that the fn as written is indeed incurring extra costs and to push programmer where possible into writing fn-s in such a manner that this extra shifting of objects is not necessary

This is also true if your conditions are not satisfied. The sequence of shifts required is always known at compile time, regardless of the order.

Actually, the return values don't even have to be in any specific order on the stack. Only the dyn pointers that have to be returned along with them have to be in the right order.

1 Like

How about this?

let a : dyn Trait;
let b : dyn Trait;
if (..) {
  a = ..
  b = ..
} else {
  b = ..
  a = ..
}
return (a, b);

Do you still know the order at compile time?

True if shifting is done by the callee. Do you however find reasons given in "Explicit stack management" (post no. 1) sufficient to instead make the caller responsible for shifting?

Do you agree that if the caller is doing the shifting then actual objects have to be in order, not just the pointers?

Also probably not a very convincing reason but it seems much simpler to emit LLVM IR/write LLVM C++ to do the shifting in the caller.

No, I don't see the point really of complicating the calling conventions so much: leaving trash on the stack to avoid memmoves of outputs is pretty crazy, and you have to invent all these complicated rules to even make it possible.

Yes, but that's why it's the callee that should move the outputs to the right place. Then you don't have to do any calculations at runtime or make any rules about code ordering, seems vastly simpler.

I'd wager in cases like my example above you still need to either

  • trace the order of allocation dynamically
  • sort the resulting pointers before shifting
  • try to apply some code transformation to avoid either - but my hunch is it's not always possible

I agree these are two justifiable alternative points of view here

  • yours: it's not worth it to complicate the CC so it's better
    • if it's the callee that always does the shifting
  • mine: to squeeze every last ounce of performance out of a low-level language, it's worth it to
    • have the complicated rules about code
    • perform extra copying in otherwise impossible cases
    • have the caller do the shifting

Update: "your" approach has some performance advantages too:

  • (as I noted earlier) the callee may likely have an easier time computing object sizes
  • it may be conductive to smaller overall code size to do the move in the callee

Update2: ideally I'd let the caller choose between who does the shifting but I couldn't come up with a nice way to allow this

In this example, not only do you have variable size return types, you also have variable size local variables, which is another feature that doesn't exist yet, I thought you're only proposing this for return types from functions.

Supporting variable sized local variables as well would indeed add additional complexity.

Actually it does exist on unstable and my idea heavily relies on it.

Had there been no unsized locals on horizon there would be no way to use the results in the caller.

I merely proposed a minor syntax tweak (let a : dyn T[] = ..) and added dyn ConcreteType.

OK, I see. So in that case yes if you combine the two then indeed there may be a runtime check required in those cases.

I don't know how the feature is implemented. I would think that if it were to be implemented well, then the compiler should already compact the stack at the end of the if statement, when merging the two if branches. Essentially the same thing that you're proposing for function calls, but applied to local blocks instead.

I don't think it's possible to do this invisibly. Imagine you've taken a reference to an item in such array... This compaction is a move and needs to be spelled explicitly as such.

I had the possibility of doing it explicitly in mind. I didn't spell it out to save space but it is part of what I was implying in "Explicit stack management" part of my post no. 1.


It might be possible to imagine a scheme under which the caller decides who does the shifting

Hmm.. suppose every fn returning dyn values is broken into two parts:

  • the main part that
    • at such a point where
      • all return values are in their right places
      • two extra registers contain
        • stack position before 1st fn-s arg
        • stack position after the last dyn returned value
    • ends with a jump back into the caller
      • without adjusting rsp
      • without restoring rbp
  • additional part that
    • assumes all registers and stack are mostly preserved as the 1st part left them
    • does stack compaction
    • sets rsp after the place where last dyn return object has been moved to
    • restores rbp
    • jumps back into the caller
      • the address for this jump would end up at the tip of the stack after the caller invoked the addition part

This would enable the caller

  • that does want compaction done for it to
    • first call the main part
    • and then the additional part.
  • that does not want compaction doe for it
    • first call the main part
    • then restore rbp and set rsp itself

It seems the call site for callers that don't want compaction would end up being 7 bytes longer and call sites that do want compaction would end up 5 bytes longer on x86-64.

I don't know how difficult it is to ab-use LLVM to generate this. The thinking was first generate the main fn and then tear it down into two parts - but I don't know how feasible it is. Certainly doing "caller always shifts" seems simpler.

I'm still assuming all complex code rules are in use here. Even if the callee does the shifting it appears much easier to do - and the costs appear clearer to the programmer - if things are laid out consecutively on stack. One could say that this kind of a flexible scheme still mandates consecutive layout of objects on stack - and hence the complex code rules.

Heh I can imagine fun ways for the 1st "half" of the function to communicate to the caller where the 2nd "half" is:

  • it can put that into one more register

  • alternatively it can "call" into the caller thus placing it's own instruction address on stack again; this looks pretty interesting..

    • if the caller does want compaction done by callee it just executes ret
      • it's just one byte at call site
      • it might be easy to implement at LLVM level
    • if the caller does not want compaction done it instead does something like
      mov <some-reg>, rsp
      mov [rbp], rbp
      
      which again isn't that much - just 7 bytes
  • hmm something similar is possible on ARM; "bad" news this needs to be carefully developed for each target architecture

I still don't like this... this means that final results are composed twice:

  • first unshifted before 1st part ends
  • then shifted after 2nd part ends

However the RFC specifies that you can't initialize them separately, but you're assuming you can.