Enable building both an interpreter that statically links libpython and a shared library too

# Feature or enhancement

### Proposal:

Right now, if you `./configure --enable-shared`, you get a bin/python3 that uses libpython3.x.so, and if you leave the option out (or explicitly `./configure --disable-shared`), you get a bin/python3 that statically links libpython into itself and no libpython3.x.so.

It's very useful to have a libpython3.x.so available for applications that need it because they embed Python in various ways. At the same time, there are some performance speedups from not having the extra layer of indirection in bin/python3. It would be useful to have a build option that gets you a best-of-both-worlds build (at the cost of more disk space): a bin/python3 that statically links libpython _and_ a libpython3.x.so for other binaries that might need it.

As a data point, this is useful enough that Debian currently does this in their Python package in a roundabout way: they build Python twice, once with `--enable-shared` and once without, and they then assemble the package by taking the libpython3.x.so from the former build and everything else from the latter build. (See, for instance, [the debian/rules file for Debian python3 3.13.3-2](https://salsa.debian.org/cpython-team/python3/-/blob/3.13.3-2/debian/rules?ref_type=tags): in lines 395-412 they do a `--enable-shared` build into `$(buildd_shared)`, in lines 438-450 they do a non-shared build into `$(buildd_static)`, in line 878 they do a `make -C $(buildd_static) install`, and in lines 939-940 they copy libpython3.x.so.1.0 out of `$(buildd_shared)`.)

There is not a particular need to do two separate builds, since the behavior changes from `--enable-shared` happen after most of the compilation has happened, in generating the final binaries. All that needs to happen is that the Makefile builds the interpreter binary the way it would for a static build, and also builds the shared library as it would if it were a dependency of the interpreter.

I have implemented this change and will open a PR momentarily.

Details on the performance benefits: while this is always a little bit of folklore, I can point to three specific things. First, Debian made this change in 2002 based on a reported 50% speedup/penalty in https://bugs.debian.org/131813 (it is interesting that the maintainer was not able to reproduce the problem, but the end user nonetheless saw the benefit from the change).

Second, there is obviously a benefit from loading one fewer file at process startup, though the impact is most obvious when your files are not in cache and process startup dominates your runtime. I see a ~15% penalty from `python3 -c True` on an AWS t2.medium VM from using the shared library:

```
ubuntu@ip-172-16-0-59:~$ hyperfine -L dir b,c '{dir}/python/install/bin/python3 -c True' --prepare 'echo 3 | sudo tee /proc/sys/vm/drop_caches'
Benchmark 1: b/python/install/bin/python3 -c True
  Time (mean ± σ):     183.1 ms ±   7.5 ms    [User: 14.9 ms, System: 16.4 ms]
  Range (min … max):   173.1 ms … 195.1 ms    10 runs
 
Benchmark 2: c/python/install/bin/python3 -c True
  Time (mean ± σ):     155.8 ms ±   7.4 ms    [User: 14.3 ms, System: 16.6 ms]
  Range (min … max):   142.5 ms … 166.9 ms    11 runs
 
Summary
  c/python/install/bin/python3 -c True ran
    1.17 ± 0.07 times faster than b/python/install/bin/python3 -c True
```

Finally, in a free-threaded build, running the old "pystones" benchmark in multiple threads is about 10% slower with the shared library:

```
ubuntu@ip-172-16-0-59:~/ft$ hyperfine -L dir b,c '{dir}/python/install/bin/python3 ~/threadstone.py'
Benchmark 1: b/python/install/bin/python3 ~/threadstone.py
  Time (mean ± σ):      2.208 s ±  0.064 s    [User: 3.722 s, System: 0.545 s]
  Range (min … max):    2.155 s …  2.379 s    10 runs
 
Benchmark 2: c/python/install/bin/python3 ~/threadstone.py
  Time (mean ± σ):      1.993 s ±  0.052 s    [User: 3.353 s, System: 0.487 s]
  Range (min … max):    1.921 s …  2.122 s    10 runs
 
Summary
  c/python/install/bin/python3 ~/threadstone.py ran
    1.11 ± 0.04 times faster than b/python/install/bin/python3 ~/threadstone.py
```

where `threadstone.py` is
```python3
import pystone
import concurrent.futures
t = concurrent.futures.ThreadPoolExecutor()
for i in range(500):
    t.submit(pystone.pystones, 1000)
t.shutdown()
```
and `pystone.py` is taken from just before 61fd70e05027150b21184c7bc9fa8aa0a49f9601.

This particular penalty is very understandable. In a shared library, thread-local storage for variables (globals or statics) in that library is allocated dynamically and on demand with the help of a function call to the C runtime that needs to be called whenever you're making an access and don't already have the right pointer cached. In the main executable, thread-local storage can be allocated up front, statically, with a fixed offset from the register that holds the thread-local storage area. So, code that makes heavy use of thread-local storage will perform better if compiled directly into the main binary. (A convenient thing about how ELF handles this is that there is a relocation type for thread-local accesses, and while generated code starts off including the function call to the helper function, the relocation is able to overwrite that function call with effectively no-op instructions if it's being linked into a main executable. So _the same .o file can be used in both cases_ without having to tell the compiler up front if the code is going into a main executable or a shared library, without putting the performance benefits at risk.)

See more details on the benchmarks, more benchmarks, and the generated assembly code for thread-local storage access in astral-sh/python-build-standalone#592.

### Has this already been discussed elsewhere?

This is a minor feature, which does not need previous discussion elsewhere

### Links to previous discussion of this feature:

_No response_


### Linked PRs
* gh-133313

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Enable building both an interpreter that statically links libpython and a shared library too #133312

Feature or enhancement

Proposal:

Has this already been discussed elsewhere?

Links to previous discussion of this feature:

Linked PRs

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Enable building both an interpreter that statically links libpython and a shared library too #133312

Description

Feature or enhancement

Proposal:

Has this already been discussed elsewhere?

Links to previous discussion of this feature:

Linked PRs

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions