All tail-calls are sibling calls


#1

Tail-call elimination: the third rail. I figure there should be a warning sign discussing this topic; it’s been beat to death a number of times, so how dare I revive it yet again. But please. Forget about Rust, the language, for a second, and think about stack frames.

Consider a long sequence of tail calls: f() -> g() -> h() -> j() -> k()… there are two possibilities, either the sequence repeats or it doesn’t (Law of the Excluded Middle, On Interpretation, Aristotle, ca 350 BC). However, if the sequence is long enough that TCE actually matters, it obviously repeats, because the last time I wrote 6174 different functions that all individually tail-called each other was… never. Because nobody writes code like that – okay, so this isn’t quite rigorous – we know that at some point the sequence repeats and calls the same function over and over; furthermore, because that function (and its descendants) call at most a finite group of functions a finite number of times, like this: a() -> b() -> a() -> c() -> b() -> a() …; we call it the “state-machine”.

In the C calling convention, whoever started this chain has to figure out how to end it, which means that the calling function is going to deallocate the stack space that this chain is using when it finishes. In principle this is hard, because the caller doesn’t know how much stack space the final callee needs. But it does know enough for us to do TCE if all of the functions use the same amount of stack space to store their arguments, which is an optimization currently implemented by both gcc and llvm as “sibling-call optimization”.

Now, of the functions in the state-machine, each requires a certain size of stack arguments, and this means that one of them requires the largest stack-argument-size, because any finite set has a supremum. It therefore is possible for all of the other functions in the state-machine to also use this amount of space, which is only a very slight overallocation in any case, if only the caller knew in advance what sort of party it was really attending (the functional kind, with immutable glowsticks and pattern-matching ecstasy or something just bear with me here dammit). As before, if all functions in the chain use the same amount of stack space (for the function signature), llvm can perform tail-call elimination.

So consider a couple of functions like this:

func a(foo: int) -> int { foo > 5 ? b(foo, 1) : 6 }

func b(foo: int, bar: int) -> int { foo + bar }

func main(ohcrapiforgotthesignatureofmain) { //do some stuff a(sna) //do some other stuff }

We can imagine the stack like this:

(in a) main | foo >

(in b) a | foo | bar>

Clearly, if we tail-call here, main() doesn’t know what to do. But we can also write it like this:

(in a) main | 0x00 | foo>

(in b) main | foo | bar>

Now even if main() is uniquely responsible for dereferencing everything it doesn’t matter; main() was going to pop two variables off the stack whether or not a() had tail-called b().

We can achieve this effect by padding the signature of any function which wants to make a tail-call, and we can calculate it all at compile-time: if a function returns with a become statement, we can calculate the tree of it’s possible tail-call targets (a() becomes b() becomes c() or d(), e.g.), and pad the signature to the size of the largest callee. Here’s some relatively simple code that solves this problem for some made-up functions.

But what if a larger function wants to become smaller? For the most part, this is trivial: the becomee will just put its args in the top slots and ignore the extra, which is automatically deleted by the original caller whenever the tail-call chain ends. So if b(foo, bar) becomes a(foo) then:

(in b) main | foo | bar >

(in a) main | foo (ignored) | foo >

a() doesn’t need to know that it has too much stack space, and main() will have no problem freeing it anyway. Thus, tail-calls work with cdecl, whether we’re going to become bigger or smaller. With this method, we restrict the use of become to functions that are compiled in the same compilation unit as the caller (due to requisite signature padding), and we expect the compiler to do some extra work for us to export a meaningful .h file:

int a(char __pad[4], int foo);

int b(int foo, int bar);

But wait, you said, LLVM only does sibling-call optimization if the function signatures have matching types!

This objection is valid. In principle, types are a lie: one could simply replace the signature of any affected function with an array of u8[] and recover the original types inside the function without doing any actual computation. In practice, this is probably not feasible. On the other hand, it’s certainly conceivable that LLVM could be convinced to add the relevant functionality, since it’s not really costing anything.

But please don’t think I’m saying that tail-call elimination should be implemented, much less by 1.0, or even that tail-call elimination is feasible to implement, which is a decision that should be made by someone who has some experience with the Rust codebase (i.e. not me), only that it can be implemented, which is really the interesting technical question, to me, anyway. Also, I might have missed something. ^,~


#2

I thought the main issue with TCO was that syntactical tail calls might not compile to tail calls because of destructors.

Functional languages have historically used GC for almost everything, and if no operations are needed to clean up GC pointers, then you get guaranteed TCO “for free”.

This is the only function of the proposed become foo(...); “combined tail call return”, AFAICT: to call any destructors before the foo call (that and to require the backend to either apply TCO or error out).


#3

Nah, there’s the additional question of the caller deallocating the stack frame, which until now has been considered impossible to make work with tail-call elimination. For the most part everyone gave up on this part, as I recall.

Destructors work okay with TCE in principle, you have to call the destructor before the callee returns even if you don’t mess with the stack, but if the function tail-calls and passes the address of a stack pointer which is going to be destroyed upon return there’s a problem, because TCE should mean that the pointer goes out of scope, and now you’ve lost the ability to pass stack variables, which sucks. This is why, for example, gcc will refuse to do TCE if a function calls alloca(), though sometimes it does work. But even without doing any real work this at most entails a check that forbids you from using become if any of your stack variables escape (i.e. tell the programmer to fix her code); it doesn’t affect all tailcalls like the cdecl thing.


#4

If you’re going to specifically identify tail call functions, you can just switch the calling convention to a custom callee-cleanup one, which allows the number of arguments to change (without having to do any magic with calculating stack sizes in advance). See also this ancient bug and this RFC bug.


#5

Not the same thing. The proposal relates to functions that make tail calls, not functions that recieve them. The ancestor (which normal-calls into the tail-call chain) doesn’t need to know anything about the callee, whether it makes tailcalls, etc – there’s no need to mess with calling conventions at all. Furthermore the change just modifies the function signature – a much smaller burden than changing the actual generated code.

You can’t change the calling convention because you lose the ability to tail-call functions that expect the normal convention. In this case there are no requirements (not even alloca()) placed on a function that receives a tailcall: very different from changing the conventions! In theory you could even tailcall a function written in C (but there’s still the issue of return types which makes that impossible in all but the most trivial of cases).

I’ve read both the bugs and several novels worth of mailing list threads about the issue. I might be wrong, but I’m not ignorant of history. Plus, cdecl is faster anway (albeit trivially so).


#6

I see. But there’s not much difference, is there? If the function that receives a tail call does not itself tail call, then it will necessarily grow the stack if it makes further calls, so it doesn’t matter whether it reuses existing stack space or not. Therefore, you can just mark all tail calling functions as callee-cleanup (which fact would have to be preserved in metadata, of course).

Changing calling conventions is very likely to be easier than calculating in advance functions’ argument stack space requirements, which depends on ABI and such. And that wouldn’t work for virtual calls.


#7

Certainly, calling conventions also depend on the ABI “and such”; it just happens to be handled by LLVM at the moment. The platform-specific size information is easy to store for the most common platforms.

(If the size of a function’s arguments can’t be inferred from its type, how could a rust program import, say, a C header?)

[quote=“comex, post:6, topic:1696”]And that wouldn’t work for virtual calls. [/quote]

It should work just fine; the function’s arguments are encoded in its type information. What would happen is that functions which take dummy arguments for a tail-call would also have their type information modified, because a caller which is unaware of the padding wouldn’t be able to call the function properly. But in principle anything is possible.

This virtual call problem is equally significant in the case of fastcall, and you would probably solve it in the exact same way, by changing the type of fastcall functions. After all, if I pass a fastcall function pointer around, whoever calls it better not clean up its arguments. Arguably it’s easier with padding, because one can arrange to ensure that padding is unnecessary (although as we get increasingly abstract it’s harder for me to imagine realistic situations where it matters, but the mathematician in me doesn’t like loose ends very much), so one can usually pass a tailcaller where one would also sometimes pass a non-tailcaller.

Also, in the padding case, the functions remain callable from outside of Rust, and can similarly call outside of Rust; it just requires a little finesse (or a C macro which can even be autogenerated).

#define a(int foo) a_({0, 0, 0, 0}, foo) int a_(char _pad[4], int foo);