-
Notifications
You must be signed in to change notification settings - Fork 7.9k
random: Optimize Randomizer::getBytes() #15228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This patch greatly improves the performance for the common case of using a 64-bit engine and requesting a length that is a multiple of 8. It does so by providing a fast path that will just `memcpy()` (which will be optimized out) the returned uint64_t directly into the output buffer, byteswapping it for big endian architectures. The existing byte-wise copying logic was mostly left alone. It only received an optimization of the shifting and masking that was previously applied to `Randomizer::getBytesFromString()` in 1fc2ddc. Co-authored-by: Saki Takamachi <saki@php.net>
} | ||
|
||
#ifdef WORDS_BIGENDIAN | ||
uint64_t swapped = ZEND_BYTES_SWAP64(result.result); | ||
memcpy(ZSTR_VAL(retval) + total_size, &swapped, 8); | ||
#else | ||
memcpy(ZSTR_VAL(retval) + total_size, &result.result, 8); | ||
#endif | ||
total_size += 8; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused, is it intended for this to run the second while loop again? Or should it skip it to return the value directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will enter the second loop if the requested length is not a multiple of 8 to handle the remaining 1-7 bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of goto is smart.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thank you!
This is my attempt at optimizing
Randomizer::getBytes()
, building on the work done by @SakiTakamachi in #14891, but modifying it so much that the spirit of Saki's version is no longer recognizable. Thus I'm filing this as a separate PR.This patch greatly improves the performance for the common case of using a 64-bit engine and requesting a length that is a multiple of 8.
It does so by providing a fast path that will just
memcpy()
(which will be optimized out) the returned uint64_t directly into the output buffer, byteswapping it for big endian architectures.The existing byte-wise copying logic was mostly left alone. It only received an optimization of the shifting and masking that was previously applied to
Randomizer::getBytesFromString()
in 1fc2ddc.Benchmarks:
The baseline commit is d6a75e1. I've rebased both Saki's and my branch onto that commit.
gcc is configured as:
./configure --enable-zend-test --enable-option-checking=fatal --enable-phpdbg --enable-fpm
clang is configured as:
./configure --enable-zend-test --enable-option-checking=fatal --enable-phpdbg --enable-fpm --enable-werror CC=clang-16 CXX=clang++-16
Summary: Results are unexpectedly dependent on the engine chosen even if both engines are 64-bit engines. They are also heavily dependent on the compiler. There is no clear winner. My version appears to be better for 32 bits engines, whereas Saki's version appears to be better for short lengths that are not a multiple of 8. For the important cases of 16 bytes they are equal and for 32 bytes they are equal with gcc. For clang it depends on the engine.
For Mt19937 and 1024 bytes. My version is the fastest for both clang and gcc.
For PcgOneseq128XslRr64 and 1024 bytes. Saki's and my version are equal for both clang and gcc.
For Secure and 16 bytes. All of them are equal.
For Pcg + 16 bytes. Saki's version is faster for gcc, for clang they are equal.
For Mt19937 + 16 bytes. The baseline version is faster than both Saki's and my version for gcc. For clang my version is faster.
For Pcg + 20 bytes. Saki's version is faster.
For Pcg + 32 bytes. For gcc they are equal, for clang Saki's version is faster.
For Xoshiro256** + 32 bytes. For gcc they are equal, for clang my version is faster.
For Xoshiro + 16 bytes. They are equal.
For Mt19937 + 32 bytes. The baseline slightly beats my version for gcc. For clang my version is the fastest.
Closes #14891