Description
Feature or enhancement
Proposal:
Right now, if you ./configure --enable-shared
, you get a bin/python3 that uses libpython3.x.so, and if you leave the option out (or explicitly ./configure --disable-shared
), you get a bin/python3 that statically links libpython into itself and no libpython3.x.so.
It's very useful to have a libpython3.x.so available for applications that need it because they embed Python in various ways. At the same time, there are some performance speedups from not having the extra layer of indirection in bin/python3. It would be useful to have a build option that gets you a best-of-both-worlds build (at the cost of more disk space): a bin/python3 that statically links libpython and a libpython3.x.so for other binaries that might need it.
As a data point, this is useful enough that Debian currently does this in their Python package in a roundabout way: they build Python twice, once with --enable-shared
and once without, and they then assemble the package by taking the libpython3.x.so from the former build and everything else from the latter build. (See, for instance, the debian/rules file for Debian python3 3.13.3-2: in lines 395-412 they do a --enable-shared
build into $(buildd_shared)
, in lines 438-450 they do a non-shared build into $(buildd_static)
, in line 878 they do a make -C $(buildd_static) install
, and in lines 939-940 they copy libpython3.x.so.1.0 out of $(buildd_shared)
.)
There is not a particular need to do two separate builds, since the behavior changes from --enable-shared
happen after most of the compilation has happened, in generating the final binaries. All that needs to happen is that the Makefile builds the interpreter binary the way it would for a static build, and also builds the shared library as it would if it were a dependency of the interpreter.
I have implemented this change and will open a PR momentarily.
Details on the performance benefits: while this is always a little bit of folklore, I can point to three specific things. First, Debian made this change in 2002 based on a reported 50% speedup/penalty in https://bugs.debian.org/131813 (it is interesting that the maintainer was not able to reproduce the problem, but the end user nonetheless saw the benefit from the change).
Second, there is obviously a benefit from loading one fewer file at process startup, though the impact is most obvious when your files are not in cache and process startup dominates your runtime. I see a ~15% penalty from python3 -c True
on an AWS t2.medium VM from using the shared library:
ubuntu@ip-172-16-0-59:~$ hyperfine -L dir b,c '{dir}/python/install/bin/python3 -c True' --prepare 'echo 3 | sudo tee /proc/sys/vm/drop_caches'
Benchmark 1: b/python/install/bin/python3 -c True
Time (mean ± σ): 183.1 ms ± 7.5 ms [User: 14.9 ms, System: 16.4 ms]
Range (min … max): 173.1 ms … 195.1 ms 10 runs
Benchmark 2: c/python/install/bin/python3 -c True
Time (mean ± σ): 155.8 ms ± 7.4 ms [User: 14.3 ms, System: 16.6 ms]
Range (min … max): 142.5 ms … 166.9 ms 11 runs
Summary
c/python/install/bin/python3 -c True ran
1.17 ± 0.07 times faster than b/python/install/bin/python3 -c True
Finally, in a free-threaded build, running the old "pystones" benchmark in multiple threads is about 10% slower with the shared library:
ubuntu@ip-172-16-0-59:~/ft$ hyperfine -L dir b,c '{dir}/python/install/bin/python3 ~/threadstone.py'
Benchmark 1: b/python/install/bin/python3 ~/threadstone.py
Time (mean ± σ): 2.208 s ± 0.064 s [User: 3.722 s, System: 0.545 s]
Range (min … max): 2.155 s … 2.379 s 10 runs
Benchmark 2: c/python/install/bin/python3 ~/threadstone.py
Time (mean ± σ): 1.993 s ± 0.052 s [User: 3.353 s, System: 0.487 s]
Range (min … max): 1.921 s … 2.122 s 10 runs
Summary
c/python/install/bin/python3 ~/threadstone.py ran
1.11 ± 0.04 times faster than b/python/install/bin/python3 ~/threadstone.py
where threadstone.py
is
import pystone
import concurrent.futures
t = concurrent.futures.ThreadPoolExecutor()
for i in range(500):
t.submit(pystone.pystones, 1000)
t.shutdown()
and pystone.py
is taken from just before 61fd70e.
This particular penalty is very understandable. In a shared library, thread-local storage for variables (globals or statics) in that library is allocated dynamically and on demand with the help of a function call to the C runtime that needs to be called whenever you're making an access and don't already have the right pointer cached. In the main executable, thread-local storage can be allocated up front, statically, with a fixed offset from the register that holds the thread-local storage area. So, code that makes heavy use of thread-local storage will perform better if compiled directly into the main binary. (A convenient thing about how ELF handles this is that there is a relocation type for thread-local accesses, and while generated code starts off including the function call to the helper function, the relocation is able to overwrite that function call with effectively no-op instructions if it's being linked into a main executable. So the same .o file can be used in both cases without having to tell the compiler up front if the code is going into a main executable or a shared library, without putting the performance benefits at risk.)
See more details on the benchmarks, more benchmarks, and the generated assembly code for thread-local storage access in astral-sh/python-build-standalone#592.
Has this already been discussed elsewhere?
This is a minor feature, which does not need previous discussion elsewhere
Links to previous discussion of this feature:
No response