Open
Description
Feature or enhancement
Proposal:
On the free-threaded build, threading
's concurrency primitives have a bunch of extra overhead across multiple threads due to reference count contention. For example:
import threading
import time
lock = threading.Lock()
def scale():
a = time.perf_counter()
for _ in range(10000000):
lock.locked()
b = time.perf_counter()
print(b - a, "s")
threads = [threading.Thread(target=scale) for _ in range(8)]
for thread in threads:
thread.start()
vs
import threading
import time
def scale():
lock = threading.Lock()
a = time.perf_counter()
for _ in range(10000000):
lock.locked()
b = time.perf_counter()
print(b - a, "s")
threads = [threading.Thread(target=scale) for _ in range(8)]
for thread in threads:
thread.start()
Comparing the two on a 3.15t release build:
0.38904138099997 s
0.39082639699995525 s
0.4013638610001635 s
0.40917961700006344 s
0.526825904000134 s
0.5402126970000154 s
0.540466712999887 s
0.5586060919999909 s
3.425866439999936 s
3.5953266010001244 s
3.6094701500001065 s
3.667731437000157 s
4.458146230000011 s
4.466017671000145 s
4.499206339000011 s
4.50090869099995 s
That's a ~90% slowdown solely due to reference count contention.
We can significantly reduce this overhead by enabling deferred reference counting on these objects. This is already done for threading.local
, but we can also do this for Lock
and RLock
. It would also be nice to do this for primitives like Event
, so that will require a (private) API to expose DRC into Python.
Has this already been discussed elsewhere?
This is a minor feature, which does not need previous discussion elsewhere
Links to previous discussion of this feature:
N/A