by the caller perhaps easier to implement and offeres more flexibility
shifting overhead on par with missing return value optimization
order of objects on stack needs to be known to shift efficiently
only allow returning (a, b, c) this way if compiler can prove a, b, c sit on stack in this order
Details
How to write fn return types? Perhaps
// S is a concrete struct type
fn foo() -> (dyn Trait, dyn u32[], dyn S) {...}
Such tuples values can be
constructed in return position
assigned from function returns - with or without destructuring
It's forbidden to
move or copy them
put them into other composite data types
Would treating them as not :Sized be enough to enforce these restrictions?..
Further possibilities:Explicit stack management
Explicit stack management
Suppose
// invoke foo() and immediately shift a, b, c on the stack
// closing the gaps and adjusting sp
let (a, b, c) : (dyn Trait, u32[], S) = foo();
// do not close the gaps or adjust sp automatically
let (a, b, c) : (dyn Trait, u32[], S) = #[no-compaction] foo();
// perhaps we shall compact manually later - syntax for that to be invented
// perhaps we shall soon return (a, b, c)
// and it's more efficient for our
// own caller to do compaction
// perhaps we are confident we shall be allocating very little stack
// from now on in this fn so it's better not to shift and let the stack
// get cleaned when this fn returns
// perhaps an equivalent of proposed alloca scopes:
// https://reviews.llvm.org/D97768
// will be supported by Rust surface syntax
// ... and objects returned this way are to be treated
// as if they were allocated via alloca
Minor technical details
CC = calling convention
callers of fn-s using the new CC will probably need to use rbp or equiv. register
fn-s using unsized locals/alloca are probably already using it
.
fn-s using the new CC may need to return an extra pointer-sized value
it would equal to stack position just before the 1st argument of the fn
this is the value sp would have been restored to on ret had the CC been regular
on x86-64 the value may go into some register not used by the regular CC
this may be necessary for the caller to do stack shifting
.
implementing the new CC will require LLVM-level C++ work
on each supported platform
hopefully if stack compaction is done by the caller
this can be a very simple adaptation of the currently used CC
ideally just some minor subclassing
.
to do stack compaction we needs to know the memory size for each object being shifted
the callee will probably trivially "know" each size
the caller will have to do extra work to learn the sizes
a couple of memory reads to extract size from dyn Trait's vtable
a multiplication for T[]
this appears to be the cost of
simplifying LLVM work (see above)
gaining flexibility (see "Explicit stack management" above)
my uneducated estimation is that it may be worth it to pay these costs
Update: I have re-thought the rule, see post number 11
If we refer to positions in a tuple type which are
dyn Trait
dyn Struct
T[]
as `dyn` tuple positions, then the rule I had in mind could be spelled as
it is allowed to use local variables a and b
in dyn positions
while constructing a return tuple
placing a before b
only if
a is assigned to before b in each execution path through the fn
----------
The rule could be more subtle
Say if
- the tuple type is `(A, B, C, D`)
- both `A` and `B` are of types statically known to `fn` after monomorphisation
- return tuple is constructed as `(a, b, c, d)`
then
- it doesn't matter which order `a` and `b` are assigned to
however
- if `c` is an unsized local, meaning
- it's an `alloca`, e.g. `dyn Trait` of `T[]`
then it is important that
c is assigned to before d
even if d is of a statically known type
The "subtle" rule however is so messy that the crude version is possibly better.
My point is, you could remove the rule altogether. If you're going to require the compiler to move the outputs on the stack, it might as well reorder these things on the stack. Granted, it's simpler when they are already in the same order, but it's also doable (and also in linear time) when they are not. The optimizer can try to put them in the optimal order in the first place, but it doesn't have to.
some explicit action can be taken to put them in the "right" order
The explicit action could be to allocate more space on stack and create fresh copies. I believe the action needs to be explicit to put pressure on programmer to try and lay things out correctly from the start if at all possible.
The idea is to make this cheap. Linear time is good. "Free" is better I am assuming free space on stack is generally available.
But your idea is already not free. You said it yourself: "live objects shifted on stack to close gaps". With or without the extra complicated rule, the compiler can try to make it free or cheap, and probably succeed in many cases. Adding such an awkward rule can prevent many useful functions from being written, for dubious gain in optimization. What if somebody wants to do if ... { (a, b) } else { (b, a) }? Your rule would prevent it.
And all I'm saying is you don't need the rule for that. Even if the compiler has to shift with reordering at runtime, maybe it's up to 2x more moving than if it has to shift without reordering.
Making the programmer do extra work in tricky cases doesn't guarantee that the cost will be lower. The programmer may have to program in the same reshuffling manually if it's not allowed to be done by the compiler.
To be fair, stackbox on its own can already quite suffice for many things, provided you take a &'out mut stackbox::Slot<T> parameter: you can then yield a StackBox<'out, U> where U = T : Sized, or one of the coercion-enabled cases (mostly: U = [Item] if T = [Item; N]).
a new calling convention sketched above could be useful as well
Rust being one of the languages that could use it
The most awkward part is the requirement to order items on stack. I still see it as necessary. This comment is to spell out the "rule" better:
I see a case for introducing a new kind of local let bindings in an fn: so called dyn let bindings
already existing unsized local traits: dyn Trait
already existing unsized arrays - suggestion here is to change syntax from T[] to dyn T[]; I'm suggesting to change syntax so as to avoid false similarity with unsized members of structs
new kind of local let bindings:
regular local variables of known concrete type but allocated dynamically via alloca
struct S {..}
fn ...() .. {
let s : dyn S = ...;
the difference is that programmer has a degree of control over where exactly on the stack they will be located
if such a variable is assigned to after an unsized local like dyn Trait has been assigned to then the programmer can be assured that it will be allocated on stack after the said dyn Trait
further when concrete types are returned via the new CC they become exactly this: dyn ConcreteType
once we have introduced the new kind of local let bindings the rule is now this:
when constructing a tuple with dyn members to be returned via the new CC
only regular (non-dyn) local bindings, only dyn local bindings or a mix thereof can be used
if a mix is used all regular (non-dyn) local bindings have to come before all dyn local bindings
all bindings used to construct such tuples dyn or not have to be declared in the order they are used to construct them
all dyn bindings used to construct such tuples which are unsized (dyn Trait, dyn T[]) have to be assigned to in the order they have been declared
all dyn bindings used to construct such tuples which are sized (dyn ConcreteType) have to be assigned to only after all preceding unsized dyn bindings (dyn Trait, dyn T[]) used to construct such tuples have been assigned too; this is because only after that has happened the exact location of the said dyn ConcreteType on stack can be determined
the compiler then guarantees to lay them them out on stack - both dyn and not dyn - in the order they were declared; so all dyn will come after all non-dyn
if these rules cannot be followed the programmer needs to resort to explicitly creating new copies of some of the local variables involved or to invoke some sort of compiler provided facility to automate this; the surface syntax for the new facility may look like a macro invocation
the purpose of introducing this rather complex rule is that the objects returned via such tuple can be compacted - shifted up the stack - in the simplest possible manner
in particular the goal is to ensure that the code doing the shifting doesn't have any conditional branches; just a sequence of straight-forward size/location calculations and memcpy-s
the purpose of not always automating this - and indeed forcing programmer to manually create extra copies or to manually invoke the built-in mechanism helping with this - is to make programmer aware that the fn as written is indeed incurring extra costs and to push programmer where possible into writing fn-s in such a manner that this extra shifting of objects is not necessary
This is also true if your conditions are not satisfied. The sequence of shifts required is always known at compile time, regardless of the order.
Actually, the return values don't even have to be in any specific order on the stack. Only the dyn pointers that have to be returned along with them have to be in the right order.
let a : dyn Trait;
let b : dyn Trait;
if (..) {
a = ..
b = ..
} else {
b = ..
a = ..
}
return (a, b);
Do you still know the order at compile time?
True if shifting is done by the callee. Do you however find reasons given in "Explicit stack management" (post no. 1) sufficient to instead make the caller responsible for shifting?
Do you agree that if the caller is doing the shifting then actual objects have to be in order, not just the pointers?
Also probably not a very convincing reason but it seems much simpler to emit LLVM IR/write LLVM C++ to do the shifting in the caller.
No, I don't see the point really of complicating the calling conventions so much: leaving trash on the stack to avoid memmoves of outputs is pretty crazy, and you have to invent all these complicated rules to even make it possible.
Yes, but that's why it's the callee that should move the outputs to the right place. Then you don't have to do any calculations at runtime or make any rules about code ordering, seems vastly simpler.
In this example, not only do you have variable size return types, you also have variable size local variables, which is another feature that doesn't exist yet, I thought you're only proposing this for return types from functions.
Supporting variable sized local variables as well would indeed add additional complexity.
OK, I see. So in that case yes if you combine the two then indeed there may be a runtime check required in those cases.
I don't know how the feature is implemented. I would think that if it were to be implemented well, then the compiler should already compact the stack at the end of the if statement, when merging the two if branches. Essentially the same thing that you're proposing for function calls, but applied to local blocks instead.
I don't think it's possible to do this invisibly. Imagine you've taken a reference to an item in such array... This compaction is a move and needs to be spelled explicitly as such.
I had the possibility of doing it explicitly in mind. I didn't spell it out to save space but it is part of what I was implying in "Explicit stack management" part of my post no. 1.
It might be possible to imagine a scheme under which the caller decides who does the shifting
Hmm.. suppose every fn returning dyn values is broken into two parts:
the main part that
at such a point where
all return values are in their right places
two extra registers contain
stack position before 1st fn-s arg
stack position after the last dyn returned value
ends with a jump back into the caller
without adjusting rsp
without restoring rbp
additional part that
assumes all registers and stack are mostly preserved as the 1st part left them
does stack compaction
sets rsp after the place where last dyn return object has been moved to
restores rbp
jumps back into the caller
the address for this jump would end up at the tip of the stack after the caller invoked the addition part
This would enable the caller
that does want compaction done for it to
first call the main part
and then the additional part.
that does not want compaction doe for it
first call the main part
then restore rbp and set rsp itself
It seems the call site for callers that don't want compaction would end up being 7 bytes longer and call sites that do want compaction would end up 5 bytes longer on x86-64.
I don't know how difficult it is to ab-use LLVM to generate this. The thinking was first generate the main fn and then tear it down into two parts - but I don't know how feasible it is. Certainly doing "caller always shifts" seems simpler.
I'm still assuming all complex code rules are in use here. Even if the callee does the shifting it appears much easier to do - and the costs appear clearer to the programmer - if things are laid out consecutively on stack. One could say that this kind of a flexible scheme still mandates consecutive layout of objects on stack - and hence the complex code rules.
Heh I can imagine fun ways for the 1st "half" of the function to communicate to the caller where the 2nd "half" is:
it can put that into one more register
alternatively it can "call" into the caller thus placing it's own instruction address on stack again; this looks pretty interesting..
if the caller does want compaction done by callee it just executes ret
it's just one byte at call site
it might be easy to implement at LLVM level
if the caller does not want compaction done it instead does something like
mov <some-reg>, rsp
mov [rbp], rbp
which again isn't that much - just 7 bytes
hmm something similar is possible on ARM; "bad" news this needs to be carefully developed for each target architecture
I still don't like this... this means that final results are composed twice: