[zend_hash]: Use AVX2 instructions for better performance #10858

stkeke · 2023-03-15T10:46:38Z

We prefer to use AVX2 instructions for performance improvement

Reduce instruction path length
Generic x86 Instr: 16, SSE2: 6, AVX2: 4
Better ICache locality and density

To enable AVX2 instructions, compile with '-mavx2' option via CFLAGS environment variable or command line argument.

Note: '-mavx' option still leads to using SSE2 instructions.
_mm256_cmpeq_epi64() requires AVX2 (-mavx2).

Testing:
Build with and without '-mavx2', 'make TEST_PHP_ARGS=-j8 test'
presented the same test report.

staabm · 2023-03-15T13:58:33Z

Could you give a rough idea how much faster this is?

stkeke · 2023-03-15T14:10:06Z

Could you give a rough idea how much faster this is?

@staabm I do not have a benchmark for this. So only an estimation for a slightly performance gain.
For every hash initialization hitting AVX2 code, 1) save 2 cycles due to two instruction less, and 2) get better cache locality, hence faster, due to binary instruction length of executable being also slightly shorter.

Totally, compared to previous SSE2 instructions, AVX2 code can surely slightly be faster...

iluuu1994

The code looks correct to me. Dmitry has done extensive performance optimization for PHPs hash table, so let's see what he thinks.

Zend/zend_hash.c

dstogov

When we speak about "better performance", it's better to provide some benchmark results.

I know, usage of AVX512 instructions may lead to CPU frequency drop (for all CPU cores) and therefore lead to performance degradation instead of increase. I hope, this is not the case here.

I don't think we will compile PHP with -mavx2 in the near future.
Anyway, this shouldn't make any harm, so it's better to accept this (after benchmarks).

divinity76 · 2023-03-16T10:05:43Z

usage of AVX512 instructions may lead to CPU frequency drop (for all CPU cores) and therefore lead to performance degradation instead of increase

just to expand on that, it's sometimes very difficult to detect, "the exact thing you benchmark actually get much faster, at the expense of everything else running on all other cores becoming slower" - CloudFlare avoids AVX512 instructions for this reason (anyone have a link to the blogpost where a CloudFlare developer explains how entire system throughput decreased on AVX512?)

iluuu1994 · 2023-03-16T10:29:13Z

@divinity76 From a quick Google search, Interesting read.
https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/

stkeke · 2023-03-16T12:19:48Z

@dstogov Thanks Dmitry for the comments. As I don't have a practical benchmark for AVX2 instruction. Let's me do a micro-benchmark to see how much we can get with AVX2 vs. SSE2.

stkeke · 2023-03-16T13:31:12Z

Benchmark Summary

Here is a simple benchmark program (bench.c - see last section)
AVX2 code (3.17s) is faster than SSE2 code (4.17s) for a loop count = 1,000,000,000 and I do not see CPU frequency drop with AVX2 on my machine.

benchmark build

gcc -mavx2 -o bench.avx2 bench.c
gcc        -o bench.sse2 bench.c

benchmark result

Benchmark AVX2
time ./bench.avx2 1000000000
3.17user 0.00system 0:03.18elapsed 99%CPU (0avgtext+0avgdata 1276maxresident)k
0inputs+0outputs (0major+63minor)pagefaults 0swaps

Benchmark SSE2
time ./bench.sse2 1000000000
4.17user 0.00system 0:04.17elapsed 100%CPU (0avgtext+0avgdata 1296maxresident)k
0inputs+0outputs (0major+64minor)pagefaults 0swaps

bench.c source code

To see disassembly

/* bench.c AVX2 vs. SSE2 */
#if defined(__AVX2__)
# include <immintrin.h>
#elif defined( __SSE2__)
# include <mmintrin.h>
# include <emmintrin.h>
#endif

#include <stdint.h>
#include <string.h>

int main(int argc, char* argv[])
{
#define HT_HASH_EX(data,idx) ((uint32_t*)(data))[(int32_t)(idx)]
        uint32_t data[ 512 / 8 / sizeof(uint32_t) ]; /* Total: 512 bits */

        int iter_count = atoi(argv[1]);
        for (int i=0; i<iter_count; i++)
        {
#if defined(__AVX2__)
                __m256i ymm0 = _mm256_setzero_si256();
                ymm0 = _mm256_cmpeq_epi64(ymm0, ymm0);
                _mm256_storeu_si256((__m256i*)&HT_HASH_EX(data,  0), ymm0);
                _mm256_storeu_si256((__m256i*)&HT_HASH_EX(data,  8), ymm0);
#elif defined (__SSE2__)
                __m128i xmm0 = _mm_setzero_si128();
                xmm0 = _mm_cmpeq_epi8(xmm0, xmm0);
                _mm_storeu_si128((__m128i*)&HT_HASH_EX(data,  0), xmm0);
                _mm_storeu_si128((__m128i*)&HT_HASH_EX(data,  4), xmm0);
                _mm_storeu_si128((__m128i*)&HT_HASH_EX(data,  8), xmm0);
                _mm_storeu_si128((__m128i*)&HT_HASH_EX(data, 12), xmm0);
#endif
        }
        return 0;
}

We prefer to use AVX2 instructions for code efficiency improvement 1) Reduce instruction path length Generic x86 Instr: 16, SSE2: 6, AVX2: 4 2) Better ICache locality and density To enable AVX2 instructions, compile with '-mavx2' option via CFLAGS environment variable or command line argument. Note: '-mavx' option still leads to using SSE2 instructions. _mm256_cmpeq_epi64() requires AVX2 (-mavx2). Testing: Build with and without '-mavx2', 'make TEST_PHP_ARGS=-j8 test' presented the same test report. Signed-off-by: Tony Su <tao.su@intel.com>

stkeke · 2023-03-17T09:02:12Z

Who knows how to trigger a recheck without git force push?
Same PR, last time Windows X86 failed, this time Win X64 failed.
I would like try a recheck...

dstogov · 2023-03-17T09:06:57Z

Who knows how to trigger a recheck without git force push? Same PR, last time Windows X86 failed, this time Win X64 failed. I would like try a recheck...

Click on "Details" then "Re-Run Failed Jobs" button. I'm not sure if you have rights, so I made this for you.

stkeke · 2023-03-17T09:17:23Z

Who knows how to trigger a recheck without git force push? Same PR, last time Windows X86 failed, this time Win X64 failed. I would like try a recheck...

Click on "Details" then "Re-Run Failed Jobs" button. I'm not sure if you have rights, so I made this for you.

@dstogov Thanks for you help.
I just checked, but I don't have 'Re-Run' right. I can only view the check log.

iluuu1994 · 2023-03-17T09:54:22Z

Thanks @stkeke!

github-actions bot added the Category: Engine label Mar 15, 2023

iluuu1994 reviewed Mar 15, 2023

View reviewed changes

Zend/zend_hash.c Show resolved Hide resolved

iluuu1994 requested a review from dstogov March 15, 2023 15:32

dstogov reviewed Mar 16, 2023

View reviewed changes

dstogov self-requested a review March 17, 2023 09:07

dstogov approved these changes Mar 17, 2023

View reviewed changes

iluuu1994 merged commit d835de1 into php:master Mar 17, 2023

stkeke deleted the zend_hash_avx2 branch March 18, 2023 00:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[zend_hash]: Use AVX2 instructions for better performance #10858

[zend_hash]: Use AVX2 instructions for better performance #10858

Uh oh!

stkeke commented Mar 15, 2023

Uh oh!

staabm commented Mar 15, 2023

Uh oh!

stkeke commented Mar 15, 2023

Uh oh!

iluuu1994 left a comment

Uh oh!

Uh oh!

dstogov left a comment

Uh oh!

divinity76 commented Mar 16, 2023 •

edited

Loading

Uh oh!

iluuu1994 commented Mar 16, 2023

Uh oh!

stkeke commented Mar 16, 2023

Uh oh!

stkeke commented Mar 16, 2023

Uh oh!

stkeke commented Mar 17, 2023

Uh oh!

dstogov commented Mar 17, 2023

Uh oh!

stkeke commented Mar 17, 2023

Uh oh!

iluuu1994 commented Mar 17, 2023

Uh oh!

Uh oh!

[zend_hash]: Use AVX2 instructions for better performance #10858

[zend_hash]: Use AVX2 instructions for better performance #10858

Uh oh!

Conversation

stkeke commented Mar 15, 2023

Uh oh!

staabm commented Mar 15, 2023

Uh oh!

stkeke commented Mar 15, 2023

Uh oh!

iluuu1994 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dstogov left a comment

Choose a reason for hiding this comment

Uh oh!

divinity76 commented Mar 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iluuu1994 commented Mar 16, 2023

Uh oh!

stkeke commented Mar 16, 2023

Uh oh!

stkeke commented Mar 16, 2023

Benchmark Summary

benchmark build

benchmark result

bench.c source code

Uh oh!

stkeke commented Mar 17, 2023

Uh oh!

dstogov commented Mar 17, 2023

Uh oh!

stkeke commented Mar 17, 2023

Uh oh!

iluuu1994 commented Mar 17, 2023

Uh oh!

Uh oh!

divinity76 commented Mar 16, 2023 •

edited

Loading