-
Notifications
You must be signed in to change notification settings - Fork 290
Optimize std_detect's caching #908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @Amanieu (or someone else) soon. If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes. Please see the contribution instructions for more information. |
This allows modifications made to it, such as env overrides, to be handled properly (immediately).
For clarity, here are the things I don't really like about the cache code right now (still), which I'm referring to a few places above when I say "I have other things I'd like to do later":
(It feels like this code hasn't gotten a lot of love/attention, tbh, which is fine, but yeah, would like to help fix that) |
I think you can save an additional instruction by using 0 instead of -1 for the uninitialized state. At least ARM/AArch64 and RISC-V have specific instructions for "branch if zero/non-zero". |
New asm:
So yes this saves an instruction on aarch64 (not 32-bit arm it seems, which it looks like still needs an explicit But I also think it's a little cleaner and more obvious to use 0 for uninitialized this way, but didn't want to change it for no reason. Edit:
|
(Hi, sorry if this is out of the blue. Also very sorry if I left something I meant to remove in there — I tried a lot of variations, and the diff looks clean to me, but I could easily just have selective blindness about it. I have a few more improvements I want to make after this too, and I'm even willing to do them as part of this patch set, but I wanted to test the water after doing this much).
Currently (prior to this patch),
std_detect
'sCache
usesSeqCst
orderings for all operations. It notes that it could actually useRelaxed
, but thatSeqCst
is safer. When testing the feature, it performs the following pattern of loads/stores:load(SeqCst)
to test for initialization.store(SeqCst)
load(SeqCst)
again to test for the feature.This performs either two loads, or two loads and a store (well, two stores because there are two
Cache
s, even whensize_of::<usize> == 8
— I hope to address this but not in this patch). Ideally, this would only perform one load in the fast path. The slow (need init) path also only needs one load and one store*.Now it performs:
load(Relaxed)
if initialized test for the feature using the same load.store(SeqCst)
, but test for the feature using the value we just stored in the cache (e.g, no need to load what was just written).Improving code size / branch prediction
In doing all this, I took a peek at the assembly for code that uses the feature test macros. Unfortunately, the cache initialization code is both getting inlined and inserted into the hot part of the code.
I played around with it a bit and found that some restructuring of how that module is called is needed to fix the issue entirely. In particular,
detect::cache
now has a#[cold]
detect_and_initialize
function, which callsdetect::os::detect_features
directly. Previously this was passed in as a function argument intocache::test
. I tried a lot of different ways of structuring this to keep thecache
from needing to know aboutdetect_features
, but ultimately couldn't (and in truth, this way doesn't seem that bad to me anyway, it just seemed like something that the code was intentionally trying to keep separate).Results
This kind of change is almost impossible to benchmark with normal tools. The issue is that these orderings are almost entirely about cache coherence protocols of the processor, so it only shows up when hammering it from a lot of threads**.
Additionally, for code this small, benchmark tools are better off measuring min cycle count (as the variation really should be quite small) as opposed to average wall time IME, but there's no good tool for this currently in Rust. In lieu of benchmark numbers, I've provided assembly before/after for x86_64 and aarch64, and llvm-mca output for x86_64.
Assembly for x86_64/aarch64
All generated using this from a secondary crate that depends on std_detect, for the following implementations:
After for `x86_64-unknown-linux-gnu`
Before for `x86_64-unknown-linux-gnu`
After for `aarch64-pc-windows-msvc`
Before for `aarch64-pc-windows-msvc`
But it's worth noting: if we used this from a context where inlining would be possible (eventually the stdlib?), the before becomes this (🙀)
Before for `aarch64-pc-windows-msvc` from inside `std_detect` (LLVM making bad inlining decisions)
(On the bright side, this is what made me realize I needed to include from a separate crate to generate this assembly, which also showed me that for some reason a well-placed
core::intrinsics::unlikely
isn't sufficient 😔).llvm-mca
outputllvm-mca
is fairly arcane, but does do a pretty good job at telling you which of two small hard-to-benchmark snippets is easier on the processor https://llvm.org/docs/CommandGuide/llvm-mca.htmlThe
llvm-mca
output for this identifies the resource interference from the aggressive synchronization in the old code, and spits out some numbers about cycles/uOps (which are always hard to really know what to make of, since it's modeling something out-of-order but without a model for branch prediction).That said, it's pretty confidently saying it's a decent improvement.
Old: Instructions: 2100, Total Cycles: 2789, Total uOps: 4800.
Iterations: 100 Instructions: 2100 Total Cycles: 2789 Total uOps: 4800New: Instructions: 1500, Total Cycles: 1497, Total uOps: 2600
* I don't care as much about the need-init path, and might consider re-loading with a stronger ordering to avoid any chance of re-querying features. That said, in practice on most hardware this shouldn't happen because the
SeqCst
writes in the store will effectively "publish" the cache lines containing the changes.(And to be clear: "in practice shouldn't happen" is probably good enough for a cache — if we need to guarantee that only one initialization happens at a time we'd need a lock anyway)
** Some motivation here is wanting to use it in optimized functions that any/every thread will conceivably call (Admittedly, the obvious contenders here are in libcore and not libstd, but one thing at a time...). Anyway, using overly strict ordering in those cases would absolutely cause performance issues.