Should the tail recursion expression be called `become`, or something else?

Vorpal · February 12, 2025, 5:05am

Another datapoint in favour of supporting become that lowers to tail calls is that Python recently added an experimental interpreter based on tail calls (with clang). They are apparently seeing speedups anywhere in the range of 10% (mean) to 40% (max). That is a pretty big leap.

github.com/python/cpython

A new tail-calling interpreter for significantly better interpreter performance

opened 10:09PM - 06 Jan 25 UTC

closed 01:15PM - 07 Feb 25 UTC

Fidget-Spinner

type-feature performance interpreter-core

# Feature or enhancement ## Proposal Prior discussion at: https://github.com/f…aster-cpython/ideas/issues/642 I propose adding a tail-calling interpreter to CPython for significantly better performance on compilers that support it. This idea is not new, and has been implemented by: 1. Protobuf https://blog.reverberate.org/2021/04/21/musttail-efficient-interpreters.html 2. Lua (Deegen) https://sillycross.github.io/2022/11/22/2022-11-22/ CPython currently has a few interpreters: 1. A switch-case interpreter (MSVC) 3. A computed goto interpreter (Clang, GCC) 4. An uop interpreter. (Everything) The tail-calling interpreter will be the 4th that coexists with the rest. This means no compatibility concerns. ## Performance Preliminary benchmarks by me suggest excellent performance improvements --- 10% geometric mean speedup in pyperformance, with up to 40% speedup in Python-heavy benchmarks: https://gist.github.com/Fidget-Spinner/497c664eef389622d146d632990b0d21. These benchmarks were performed with clang-19 on both main and my branch, with ThinLTO and PGO, on AMD64 Ubuntu 22.04. PGO seems especially crucial for the speedups based on my testing. For those outside of CPython development: a 10% speedup is roughly equal to 2 minor CPython releases worth of improvements. For example, CPython 3.12 roughly sped up by 5%. The speedup is so significant that if accepted, the new interpreter will be faster than the current JIT compiler. ## Drawbacks 1. Maintainability (this will introduce more code) 2. Portability I will address maintainability by using the interpreter generator that was introduced as part of CPython 3.12. This generator will allow us to automatically generate most of the infrastructure needed for this change. Preliminary estimates suggest the generator will be only 200 lines of Python code, most of which is shared/copied/same conceptually as the other generators. For portability, this will fix itself (see the next section). ## Portability and Precedent At the moment, this is only supported by clang-19 for AArch64 and AMD64, with partial support on clang-18 and gcc-next, but likely bad performance on those. The reason is that we need both the `__attribute__((musttail))` and `__attribute__((preserve_none))` attributes for good performance. GCC only has `gnu::musttail` but not `preserve_none`. There has been prior precedence on adding compiler-specific optimizations for CPython. See for example the original computed goto issue by Antoine Pitrou https://bugs.python.org/issue4753. At the time, it was a new feature only on GCC and not on Clang, but we still added it anyways. Eventually a few years later, Clang also introduced the feature. The key point gcc will likely eventually catch up and add these features. **EDIT**: Added that it's only a likely case to have bad perf on GCC. Reading https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328, I have not tested on GCC trunk. This is pure speculation that perf is bad. I can try with GCC eventually after the PR lands and we can test it from there. However, testing with clang with just `musttail` and no `preserve_none`, the performance was quite bad. ## Implementation plan 1. Parse error labels in `_PyEval_EvalFrameDefault`. 2. Implement the rest. 3. Add likely/unlikely attributes to `DEOPT_IF/EXIT_IF`. 4. Support GCC 15.0 if we determine the performance is good. 5. Add option to Windows build script. 6. Mention in Whats New. (Note: We NEED PGO, otherwise the perf is not very good on clang) 7. Open new issue to add it as option on Windows build script. 8. Open new issue to set it as auto detect on ``--enable-optimizations`` on configure. 9. Open new issue on improving the code quality, by moving some parameters into other places to free up registers. # Worries about new bugs Computed goto is well-tested, so worrying about the new interpreter being buggy is fair. I doubt logic bugs will be the primary one, as we are using the interpreter generator. This means we share common code between the base interpreter and the new one. If the new one has logic bugs, it is likely the base interpreter has it too. The other one is compiler bugs. However to allay such fears, I point out that the GHC calling convention (the thing behind `preserve_none` has been around for 5 years[1], and `musttail` has been around for almost 4 years[2]. [1]: https://reviews.llvm.org/D69024 [2]: https://reviews.llvm.org/D99517 cc @pitrou as the original implementer of computed gotos, and @markshannon ## Future Use Kumar Aditya pointed out this could be used in regex and pickle as well. Likewise, Neil Schemenauer pointed out marshal and pickle might benefit from this for faster Python startup. ### Has this already been discussed elsewhere? https://discuss.python.org/t/a-new-tail-calling-interpreter-for-significantly-better-interpreter-performance/76315 ### Links to previous discussion of this feature: _No response_ ### Linked PRs * gh-128718 * gh-129078 * gh-129112 * gh-129113 * gh-129115 * gh-129417 * gh-129481 * gh-129525 * gh-129608 * gh-129728 * gh-129754 * gh-129803 * gh-129809 * gh-129812

If such speedups can be achieved with this technique it seems highly desirable to support this in Rust too.

Jules-Bertholet · February 12, 2025, 11:19pm

I strongly disagree. Rust is a systems language, tailored for use cases where the tradeoffs of various algorithms matter. The standard library collections reflect this.

Topic		Replies	Views
Pre-RFC: explicit proper tail calls language design	16	13455	March 25, 2019
If become was implemented, could it be supported everywhere?	5	2403	March 25, 2019
An immutable loop structure `loop let` language design	13	962	October 23, 2024
Tail-call optimization in macro expansion compiler	4	1671	March 25, 2019
Recursive async function permitted, but why? language design	9	1039	September 20, 2020

Should the tail recursion expression be called `become`, or something else?

Related topics