-
Notifications
You must be signed in to change notification settings - Fork 7.9k
[zend_hash]: Use AVX2 instructions for better performance #10858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Could you give a rough idea how much faster this is? |
@staabm I do not have a benchmark for this. So only an estimation for a slightly performance gain. Totally, compared to previous SSE2 instructions, AVX2 code can surely slightly be faster... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks correct to me. Dmitry has done extensive performance optimization for PHPs hash table, so let's see what he thinks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we speak about "better performance", it's better to provide some benchmark results.
I know, usage of AVX512 instructions may lead to CPU frequency drop (for all CPU cores) and therefore lead to performance degradation instead of increase. I hope, this is not the case here.
I don't think we will compile PHP with -mavx2
in the near future.
Anyway, this shouldn't make any harm, so it's better to accept this (after benchmarks).
just to expand on that, it's sometimes very difficult to detect, "the exact thing you benchmark actually get much faster, at the expense of everything else running on all other cores becoming slower" - CloudFlare avoids AVX512 instructions for this reason (anyone have a link to the blogpost where a CloudFlare developer explains how entire system throughput decreased on AVX512?) |
@divinity76 From a quick Google search, Interesting read. |
@dstogov Thanks Dmitry for the comments. As I don't have a practical benchmark for AVX2 instruction. Let's me do a micro-benchmark to see how much we can get with AVX2 vs. SSE2. |
Benchmark SummaryHere is a simple benchmark program (bench.c - see last section) benchmark buildgcc -mavx2 -o bench.avx2 bench.c
gcc -o bench.sse2 bench.c benchmark resultBenchmark AVX2 Benchmark SSE2 bench.c source code/* bench.c AVX2 vs. SSE2 */
#if defined(__AVX2__)
# include <immintrin.h>
#elif defined( __SSE2__)
# include <mmintrin.h>
# include <emmintrin.h>
#endif
#include <stdint.h>
#include <string.h>
int main(int argc, char* argv[])
{
#define HT_HASH_EX(data,idx) ((uint32_t*)(data))[(int32_t)(idx)]
uint32_t data[ 512 / 8 / sizeof(uint32_t) ]; /* Total: 512 bits */
int iter_count = atoi(argv[1]);
for (int i=0; i<iter_count; i++)
{
#if defined(__AVX2__)
__m256i ymm0 = _mm256_setzero_si256();
ymm0 = _mm256_cmpeq_epi64(ymm0, ymm0);
_mm256_storeu_si256((__m256i*)&HT_HASH_EX(data, 0), ymm0);
_mm256_storeu_si256((__m256i*)&HT_HASH_EX(data, 8), ymm0);
#elif defined (__SSE2__)
__m128i xmm0 = _mm_setzero_si128();
xmm0 = _mm_cmpeq_epi8(xmm0, xmm0);
_mm_storeu_si128((__m128i*)&HT_HASH_EX(data, 0), xmm0);
_mm_storeu_si128((__m128i*)&HT_HASH_EX(data, 4), xmm0);
_mm_storeu_si128((__m128i*)&HT_HASH_EX(data, 8), xmm0);
_mm_storeu_si128((__m128i*)&HT_HASH_EX(data, 12), xmm0);
#endif
}
return 0;
} |
We prefer to use AVX2 instructions for code efficiency improvement 1) Reduce instruction path length Generic x86 Instr: 16, SSE2: 6, AVX2: 4 2) Better ICache locality and density To enable AVX2 instructions, compile with '-mavx2' option via CFLAGS environment variable or command line argument. Note: '-mavx' option still leads to using SSE2 instructions. _mm256_cmpeq_epi64() requires AVX2 (-mavx2). Testing: Build with and without '-mavx2', 'make TEST_PHP_ARGS=-j8 test' presented the same test report. Signed-off-by: Tony Su <tao.su@intel.com>
Who knows how to trigger a recheck without git force push? |
Click on "Details" then "Re-Run Failed Jobs" button. I'm not sure if you have rights, so I made this for you. |
@dstogov Thanks for you help. |
Thanks @stkeke! |
We prefer to use AVX2 instructions for performance improvement
Generic x86 Instr: 16, SSE2: 6, AVX2: 4
To enable AVX2 instructions, compile with '-mavx2' option via CFLAGS environment variable or command line argument.
Note: '-mavx' option still leads to using SSE2 instructions.
_mm256_cmpeq_epi64() requires AVX2 (-mavx2).
Testing:
Build with and without '-mavx2', 'make TEST_PHP_ARGS=-j8 test'
presented the same test report.