Skip to content

[zend_hash]: Use AVX2 instructions for better performance #10858

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 17, 2023
Merged

[zend_hash]: Use AVX2 instructions for better performance #10858

merged 1 commit into from
Mar 17, 2023

Conversation

stkeke
Copy link
Contributor

@stkeke stkeke commented Mar 15, 2023

We prefer to use AVX2 instructions for performance improvement

  1. Reduce instruction path length
    Generic x86 Instr: 16, SSE2: 6, AVX2: 4
  2. Better ICache locality and density

To enable AVX2 instructions, compile with '-mavx2' option via CFLAGS environment variable or command line argument.

Note: '-mavx' option still leads to using SSE2 instructions.
_mm256_cmpeq_epi64() requires AVX2 (-mavx2).

Testing:
Build with and without '-mavx2', 'make TEST_PHP_ARGS=-j8 test'
presented the same test report.

@staabm
Copy link
Contributor

staabm commented Mar 15, 2023

Could you give a rough idea how much faster this is?

@stkeke
Copy link
Contributor Author

stkeke commented Mar 15, 2023

Could you give a rough idea how much faster this is?

@staabm I do not have a benchmark for this. So only an estimation for a slightly performance gain.
For every hash initialization hitting AVX2 code, 1) save 2 cycles due to two instruction less, and 2) get better cache locality, hence faster, due to binary instruction length of executable being also slightly shorter.

Totally, compared to previous SSE2 instructions, AVX2 code can surely slightly be faster...

Copy link
Member

@iluuu1994 iluuu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks correct to me. Dmitry has done extensive performance optimization for PHPs hash table, so let's see what he thinks.

@iluuu1994 iluuu1994 requested a review from dstogov March 15, 2023 15:32
Copy link
Member

@dstogov dstogov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we speak about "better performance", it's better to provide some benchmark results.

I know, usage of AVX512 instructions may lead to CPU frequency drop (for all CPU cores) and therefore lead to performance degradation instead of increase. I hope, this is not the case here.

I don't think we will compile PHP with -mavx2 in the near future.
Anyway, this shouldn't make any harm, so it's better to accept this (after benchmarks).

@divinity76
Copy link
Contributor

divinity76 commented Mar 16, 2023

usage of AVX512 instructions may lead to CPU frequency drop (for all CPU cores) and therefore lead to performance degradation instead of increase

just to expand on that, it's sometimes very difficult to detect, "the exact thing you benchmark actually get much faster, at the expense of everything else running on all other cores becoming slower" - CloudFlare avoids AVX512 instructions for this reason (anyone have a link to the blogpost where a CloudFlare developer explains how entire system throughput decreased on AVX512?)

@iluuu1994
Copy link
Member

@divinity76 From a quick Google search, Interesting read.
https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/

@stkeke
Copy link
Contributor Author

stkeke commented Mar 16, 2023

@dstogov Thanks Dmitry for the comments. As I don't have a practical benchmark for AVX2 instruction. Let's me do a micro-benchmark to see how much we can get with AVX2 vs. SSE2.

@stkeke
Copy link
Contributor Author

stkeke commented Mar 16, 2023

Benchmark Summary

Here is a simple benchmark program (bench.c - see last section)
AVX2 code (3.17s) is faster than SSE2 code (4.17s) for a loop count = 1,000,000,000 and I do not see CPU frequency drop with AVX2 on my machine.

benchmark build

gcc -mavx2 -o bench.avx2 bench.c
gcc        -o bench.sse2 bench.c

benchmark result

Benchmark AVX2
time ./bench.avx2 1000000000
3.17user 0.00system 0:03.18elapsed 99%CPU (0avgtext+0avgdata 1276maxresident)k
0inputs+0outputs (0major+63minor)pagefaults 0swaps

Benchmark SSE2
time ./bench.sse2 1000000000
4.17user 0.00system 0:04.17elapsed 100%CPU (0avgtext+0avgdata 1296maxresident)k
0inputs+0outputs (0major+64minor)pagefaults 0swaps

bench.c source code

To see disassembly

/* bench.c AVX2 vs. SSE2 */
#if defined(__AVX2__)
# include <immintrin.h>
#elif defined( __SSE2__)
# include <mmintrin.h>
# include <emmintrin.h>
#endif

#include <stdint.h>
#include <string.h>

int main(int argc, char* argv[])
{
#define HT_HASH_EX(data,idx) ((uint32_t*)(data))[(int32_t)(idx)]
        uint32_t data[ 512 / 8 / sizeof(uint32_t) ]; /* Total: 512 bits */

        int iter_count = atoi(argv[1]);
        for (int i=0; i<iter_count; i++)
        {
#if defined(__AVX2__)
                __m256i ymm0 = _mm256_setzero_si256();
                ymm0 = _mm256_cmpeq_epi64(ymm0, ymm0);
                _mm256_storeu_si256((__m256i*)&HT_HASH_EX(data,  0), ymm0);
                _mm256_storeu_si256((__m256i*)&HT_HASH_EX(data,  8), ymm0);
#elif defined (__SSE2__)
                __m128i xmm0 = _mm_setzero_si128();
                xmm0 = _mm_cmpeq_epi8(xmm0, xmm0);
                _mm_storeu_si128((__m128i*)&HT_HASH_EX(data,  0), xmm0);
                _mm_storeu_si128((__m128i*)&HT_HASH_EX(data,  4), xmm0);
                _mm_storeu_si128((__m128i*)&HT_HASH_EX(data,  8), xmm0);
                _mm_storeu_si128((__m128i*)&HT_HASH_EX(data, 12), xmm0);
#endif
        }
        return 0;
}

We prefer to use AVX2 instructions for code efficiency improvement
1) Reduce instruction path length
   Generic x86 Instr: 16, SSE2: 6, AVX2: 4
2) Better ICache locality and density

To enable AVX2 instructions, compile with '-mavx2' option via CFLAGS
environment variable or command line argument.

Note: '-mavx' option still leads to using SSE2 instructions.
      _mm256_cmpeq_epi64() requires AVX2 (-mavx2).

Testing:
    Build with and without '-mavx2', 'make TEST_PHP_ARGS=-j8 test'
    presented the same test report.

Signed-off-by: Tony Su <tao.su@intel.com>
@stkeke
Copy link
Contributor Author

stkeke commented Mar 17, 2023

Who knows how to trigger a recheck without git force push?
Same PR, last time Windows X86 failed, this time Win X64 failed.
I would like try a recheck...

@dstogov
Copy link
Member

dstogov commented Mar 17, 2023

Who knows how to trigger a recheck without git force push? Same PR, last time Windows X86 failed, this time Win X64 failed. I would like try a recheck...

Click on "Details" then "Re-Run Failed Jobs" button. I'm not sure if you have rights, so I made this for you.

@dstogov dstogov self-requested a review March 17, 2023 09:07
@stkeke
Copy link
Contributor Author

stkeke commented Mar 17, 2023

Who knows how to trigger a recheck without git force push? Same PR, last time Windows X86 failed, this time Win X64 failed. I would like try a recheck...

Click on "Details" then "Re-Run Failed Jobs" button. I'm not sure if you have rights, so I made this for you.

@dstogov Thanks for you help.
I just checked, but I don't have 'Re-Run' right. I can only view the check log.

@iluuu1994 iluuu1994 merged commit d835de1 into php:master Mar 17, 2023
@iluuu1994
Copy link
Member

Thanks @stkeke!

@stkeke stkeke deleted the zend_hash_avx2 branch March 18, 2023 00:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants