๐ฅ Discover this must-read post from Hacker News ๐
๐ Category:
โ Main takeaway:
24 December 2025
Some time ago I posted an apology peice
for Pythonโs tail caling results. I apologized for communicating performance
results without noticing a compiler bug had occured.
I can proudly say today that I am partially retracting that apology, but
only for two platformsโMacOs AArch64 (XCode Clang) and Windows x86-64 (MSVC).
In our own experiments, the tail calling interpreter for CPython
was found to beat the computed
goto interpreter by 5% on pyperformance on AArch64 macOS using XCode Clang,
and roughly 15% on pyperformance on Windows on an experimental internal
version of MSVC. The Windows build is against a switch-case interpreter, but
this in theory shouldnโt matter too much, more on that in the next section.
This is of course, a hopefully accurate result. I tried to be more diligent
here, but I am of course not infallible. However, I have found that sharing early and making a fool of myself often works well, as it has led to people catching bugs in my code, so I shall continue doing so :).
Also this assumes the change doesnโt get reverted later in Python 3.15โs
development cycle.
Brief background on interpreters
Just a recap. There are two popular current ways of writing C-based
interpreters.
Switch-cases:
switch (opcode) ๐ฌ
Where we just switch-case to the correct instruction handler.
And the other popular way is a
GCC/Clang extension called labels-as-values/computed gotos.
goto *dispatch_table[opcode];
INST_1: ...
INST_2: ...
Which is basically the same idea, but to instead jump to the address of the
next label. Traditionally, the key optimization here is that it needs
only one jump to go to the next instruction, while in the switch-case
interpreter, a naiive compiler would need two jumps.
With modern compilers however, the benefits of the computed gotos is a lot less,
mainly because modern compilers have gotten better and modern hardware
has also gotten better. In Nelson Elhageโs
excellent investigation
on the next kind of interpreter,
the speedup of computed gotos over switch case on modern Clang was
only in the low single digits on pyperformance.
A 3rd way that was suggested decades ago, but not really entirely feasible
is call/tail-call threaded interpreters. In this scheme, each bytecode
handler is its own function, and we tail-call from one handler to the next
in the instruction stream:
return dispatch_table[opcode];
PyObject *INST_1(...) Tell us your thoughts in comments!
PyObject *INST_2(...) ๐ฅ
This wasnโt too feasible in C for one main reasonโtail call optimization
was merely an optimization. Itโs something the C compiler might do, or
might not do. This means if youโre unlucky and the C compiler chooses not
to perform the tail call, your interpreter might stack overflow!
Some time ago, Clang introduced __attribute__((musttail)), which allowed
for mandating that a call must be tail-called. Otherwise, the compilation
will fail. To my knowledge, the first time this was popularized for use
in a mainstream interpreter was in
Josh Habermanโs Protobuf blog post.
Later on, Haoran Xu noticed that the GHC calling convention combined with
tail calls produced efficient code. They used this for their baseline
JIT in a paper and termed the technique
Copy-and-Patch.
So where are we now?
After using a fixed XCode Clang, our performance numbers on CPython
3.14/3.15 suggest that the tail calling interpreter does provide a
modest speedup over computed gotos. Around the 5% geomean range on
pyperformance.
To my understanding, uv already ships Python 3.14 on macOS with tail calling,
which might be responsible for some of the speedups you see on there.
Weโre planning to ship the official 3.15 macOS binaries on python.org with
tail calling as well.
However, youโre not here for that. The title of this blog post
is clearly about MSVC Windows x86-64. So what about that?
Tail-calling for Windows
[!CAUTION]
The features for MSVC discussed below are to my knowledge, undocumented.
They are not guaranteed to always be around unless the MSVC team decide to keep them. Use at your own risk!
These are the preliminary pyperformance results
for CPython on MSVC with tail-calling vs
switch-case. Any number above 1.00x is a speedup
(e.g. 1.01x == 1% speedup), anything below 1.00x is a slowdown.
The speedup is a geomtric mean of around 15-16%, with a
range of ~60% slowdown (one or two outliers) to 78% speedup.
However, the key thing is that the vast majority of benchmaarks sped up!
[!WARNING]
These results are on an experimental internal MSVC compiler, public results below.
To verify this and make sure I wasnโt wrong yet again, I checked the results
on my machine with Visual Studio 2026. These are the results from
this issue.
Mean +- std dev: [spectralnorm_tc_no] 146 ms +- 1 ms -> [spectralnorm_tc] 98.3 ms +- 1.1 ms: 1.48x faster
Mean +- std dev: [nbody_tc_no] 145 ms +- 2 ms -> [nbody_tc] 107 ms +- 2 ms: 1.35x faster
Mean +- std dev: [bm_django_template_tc_no] 26.9 ms +- 0.5 ms -> [bm_django_template_tc] 22.8 ms +- 0.4 ms: 1.18x faster
Mean +- std dev: [xdsl_tc_no] 64.2 ms +- 1.6 ms -> [xdsl_tc] 56.1 ms +- 1.5 ms: 1.14x faster
So yeah, the speedups are real! For a large-ish library like xDSL, we see
a 14% speedup, while for smaller microbenchmarks like nbody and spectralnorm,
the speedups are greater.
Thanks to Chris Eibl and Brandt Bucher, we managed to get the
PR for this
on MSVC over the finish line. I also want to sincerely thank the MSVC team. I canโt say this enough: they have been a joy to work with and
Iโm very impressed by what theyโve done, and I want to congratulate them
on releasing Visual Studio 2026.
This is now listed in the Whatโs New for 3.15 notes:
Builds using Visual Studio 2026 (MSVC 18) may now use the new tail-calling interpreter. Results on an early experimental MSVC compiler reported roughly 15% speedup on the geometric mean of pyperformance on Windows x86-64 over the switch-case interpreter. We have observed speedups ranging from 15% for large pure-Python libraries to 40% for long-running small pure-Python scripts on Windows. (Contributed by Chris Eibl, Ken Jin, and Brandt Bucher in gh-143068. Special thanks to the MSVC team including Hulon Jenkins.)
Where exactly do the speedups come from?
I used to believe the the tailcalling interpreters get their speedup
from better register use. While I still believe that now, I suspect that is
not the main reason for speedups in CPython.
My main guess now is that
tail calling resets compiler heuristics to sane levels, so that compilers can do their jobs.
Let me show an example, at the time of writing, CPython 3.15โs interpreter loop
is around 12k lines of C code. Thatโs 12k lines in a single function
for the switch-case and computed goto interpreter.
This has caused many issues for compilers in the past, too many to list in fact.
I have a EuroPython 2025 talk about this. In short, this overly large function
breaks a lot of compiler heuristics.
One of the most beneficial optimisations is inlining. In the past, weโve found
that compilers sometimes straight up
refuse to inline even the
simplest of functions in that 12k loc eval loop. I want to stress that this
is not the fault of the compiler. Itโs actually doing the correct
thingโyou usually donโt want to increase the code size of something already
super large. Unfortunately, this doesโt bode well for our interpreter.
You might say just write the interpreter in assembly!
However, the whole point of this exercise is to not do that.
Ok enough talk, letโs take a look at the code now. Taking a real
example, we examine BINARY_OP_ADD_INT which adds two Python integers.
Cleaning up the code so itโs readable, things look like this:
TARGET(BINARY_OP_ADD_INT) โก
Seems simple enough, letโs take a look at the assembly for switch-case on
VS 2026. Note again, this is a non-PGO build for easy source information,
PGO generally makes some of these problems go away, but not all of them:
if (!_PyLong_CheckExactAndCompact(value_o)) {
00007FFC4DE24DCE mov rcx,rbx
00007FFC4DE24DD1 mov qword ptr [rsp+58h],rax
00007FFC4DE24DD6 call _PyLong_CheckExactAndCompact (07FFC4DE227F0h)
00007FFC4DE24DDB test eax,eax
00007FFC4DE24DDD je _PyEval_EvalFrameDefault+10EFh (07FFC4DE258FFh)
...
res = _PyCompactLong_Add((PyLongObject *)left_o, (PyLongObject *)right_o);
00007FFC4DE24DFF mov rdx,rbx
00007FFC4DE24E02 mov rcx,r15
00007FFC4DE24E05 call _PyCompactLong_Add (07FFC4DD34150h)
00007FFC4DE24E0A mov rbx,rax
...
PyStackRef_CLOSE_SPECIALIZED(value, _PyLong_ExactDealloc);
00007FFC4DE24E17 lea rdx,[_PyLong_ExactDealloc (07FFC4DD33BD0h)]
00007FFC4DE24E1E mov rcx,rsi
00007FFC4DE24E21 call PyStackRef_CLOSE_SPECIALIZED (07FFC4DE222A0h)
Huhโฆ all our functions were not inlined. Surely that mustโve mean they were
too big or something right? Letโs look at PyStackReF_CLOSE_SPECIALIZED:
static inline void
PyStackRef_CLOSE_SPECIALIZED(_PyStackRef ref, destructor destruct)
What do you think?
That looks โฆ inlineable?
Hereโs how BINARY_OP_ADD_INT looks with tail calling on VS 2026 (again,
no PGO):
if (!_PyLong_CheckExactAndCompact(left_o)) {
00007FFC67164785 cmp qword ptr [rax+8],rdx
00007FFC67164789 jne _TAIL_CALL_BINARY_OP_ADD_INT@@_A+149h (07FFC67164879h)
00007FFC6716478F mov r9,qword ptr [rax+10h]
00007FFC67164793 cmp r9,10h
00007FFC67164797 jae _TAIL_CALL_BINARY_OP_ADD_INT@@_A+149h (07FFC67164879h)
...
res = _PyCompactLong_Add((PyLongObject *)left_o, (PyLongObject *)right_o);
00007FFC6716479D mov eax,dword ptr [rax+18h]
00007FFC671647A0 and r9d,3
00007FFC671647A4 and r8d,3
00007FFC671647A8 mov edx,1
00007FFC671647AD sub rdx,r9
00007FFC671647B0 mov ecx,1
00007FFC671647B5 imul rdx,rax
00007FFC671647B9 mov eax,dword ptr [rbx+18h]
00007FFC671647BC sub rcx,r8
00007FFC671647BF imul rcx,rax
00007FFC671647C3 add rcx,rdx
00007FFC671647C6 call medium_from_stwodigits (07FFC6706E9E0h)
00007FFC671647CB mov rbx,rax
...
PyStackRef_CLOSE_SPECIALIZED(value, _PyLong_ExactDealloc);
00007FFC671647EB test bpl,1
00007FFC671647EF jne _TAIL_CALL_BINARY_OP_ADD_INT@@_A+0ECh (07FFC6716481Ch)
00007FFC671647F1 add dword ptr [rbp],0FFFFFFFFh
00007FFC671647F5 jne _TAIL_CALL_BINARY_OP_ADD_INT@@_A+0ECh (07FFC6716481Ch)
00007FFC671647F7 mov rax,qword ptr [_PyRuntime+25F8h (07FFC675C45F8h)]
00007FFC671647FE test rax,rax
00007FFC67164801 je _TAIL_CALL_BINARY_OP_ADD_INT@@_A+0E4h (07FFC67164814h)
00007FFC67164803 mov r8,qword ptr [_PyRuntime+2600h (07FFC675C4600h)]
00007FFC6716480A mov edx,1
00007FFC6716480F mov rcx,rbp
00007FFC67164812 call rax
00007FFC67164814 mov rcx,rbp
00007FFC67164817 call _PyLong_ExactDealloc (07FFC67073DA0h)
Would you look at that, suddenly our trivial functions get inlined :).
You might also say, surely this does not happen on PGO builds? Well the issue
I linked above actually says it does! So yeah happy days.
Once again I want to stress, this is not the compilerโs fault! Itโs just that
the CPython interpreter loop is not the best thing to optimize.
How do I try this out?
Unfortunately, for now, you will have to build from source.
With VS 2026, after cloning CPython, for a release build with PGO:
$env:PlatformToolset = "v145"
./PCbuild/build.bat --tail-call-interp -c Release -p x64 --pgo
Hopefully, we can distribute this in an easier binary form in the future
once Python 3.15โs development matures!
{๐ฌ|โก|๐ฅ} {What do you think?|Share your opinion below!|Tell us your thoughts in comments!}
#๏ธโฃ #Python #3.15s #interpreter #Windows #x8664 #faster
๐ Posted on 1766671038
