Open
Description
On a machine with many cores, the performance of sync.RWMutex.R{Lock,Unlock}
degrades dramatically as GOMAXPROCS
increases.
This test program:
package benchmarks_test
import (
"fmt"
"sync"
"testing"
)
func BenchmarkRWMutex(b *testing.B) {
for ng := 1; ng <= 256; ng <<= 2 {
b.Run(fmt.Sprint(ng), func(b *testing.B) {
var mu sync.RWMutex
mu.Lock()
var wg sync.WaitGroup
wg.Add(ng)
n := b.N
quota := n / ng
for g := ng; g > 0; g-- {
if g == 1 {
quota = n
}
go func(quota int) {
for i := 0; i < quota; i++ {
mu.RLock()
mu.RUnlock()
}
wg.Done()
}(quota)
n -= quota
}
if n != 0 {
b.Fatalf("Incorrect quota assignments: %v remaining", n)
}
b.StartTimer()
mu.Unlock()
wg.Wait()
b.StopTimer()
})
}
}
degrades by a factor of 8x as it saturates threads and cores, presumably due to cache contention on &rw.readerCount:
# ./benchmarks.test -test.bench . -test.cpu 1,4,16,64
testing: warning: no tests to run
BenchmarkRWMutex/1 20000000 72.6 ns/op
BenchmarkRWMutex/1-4 20000000 72.4 ns/op
BenchmarkRWMutex/1-16 20000000 72.8 ns/op
BenchmarkRWMutex/1-64 20000000 72.5 ns/op
BenchmarkRWMutex/4 20000000 72.6 ns/op
BenchmarkRWMutex/4-4 20000000 105 ns/op
BenchmarkRWMutex/4-16 10000000 130 ns/op
BenchmarkRWMutex/4-64 20000000 160 ns/op
BenchmarkRWMutex/16 20000000 72.4 ns/op
BenchmarkRWMutex/16-4 10000000 125 ns/op
BenchmarkRWMutex/16-16 10000000 263 ns/op
BenchmarkRWMutex/16-64 5000000 287 ns/op
BenchmarkRWMutex/64 20000000 72.6 ns/op
BenchmarkRWMutex/64-4 10000000 137 ns/op
BenchmarkRWMutex/64-16 5000000 306 ns/op
BenchmarkRWMutex/64-64 3000000 517 ns/op
BenchmarkRWMutex/256 20000000 72.4 ns/op
BenchmarkRWMutex/256-4 20000000 137 ns/op
BenchmarkRWMutex/256-16 5000000 280 ns/op
BenchmarkRWMutex/256-64 3000000 602 ns/op
PASS
A "control" test, calling a no-op function instead of RWMutex
methods, displays no such degradation: the problem does not appear to be due to runtime scheduling overhead.