JIT buffer relocation and 2~3% PHP performance gain #8618

stkeke · 2022-05-24T01:48:59Z

This is a JIT buffer relocation inspired by this blog
https://v8.dev/blog/short-builtin-calls

For 64-bit applications, branch prediction performance can be
negatively impacted when the target of a branch is more than
4 GB away from the branch.

We try to allocate opcache/JIT buffer just prior to PHP .text
segments through mmap() with a calculated preferred memory address
while creating segments.

In our benchmark, we found PHP interpreter archieved 2-3% performance
and much better branching performance with both 2MB huge pages and
ordinary 4KB pages.

Signed-off-by: Su, Tao tao.su@intel.com
Tested-by: Wang, Xue xue1.wang@intel.com
Reviewed-by: Chen, Hu hu1.chen@intel.com
Reviewed-by: You, Lizhen Lizhen.You@intel.com

cmb69 · 2022-05-24T11:25:09Z

@dstogov, do you think this is worth pursuing? (it can't work on Windows, but maybe on other systems)

dstogov · 2022-05-24T11:58:04Z

I read that blog. Unfortunately, this PR is just a very basic PoC.
In any case, thanks for trying this and sharing the benchmark results.

Currently opcache tries to allocate SHM in the low 2GB using MAP_32BIT.
.text segment of the main binary is also usually placed in the low 2GB memory and as result we are able to use "short" jumps in JIT code. Unfortunately, this doesn't work for PHP compiled as Apache module.

Probably, we may improve this approach by searching for the best candidate for SHM analysing /proc/self/maps.
We already use /proc/self/maps to remap .text segment into huge pages.

stkeke · 2022-05-25T00:51:49Z

@dstogov Thanks Dmitry for the comments and good hint for a better way.
@cmb69 @ramsey Thanks for taking care of this PoC PR.

I can work out a more workable patch. However, If you guys have the bandwidth to develop a quick mergeable patch, feel free to go ahead of me.

Reason: I am quite new to PHP and still ramping up PHP source code, so it probably takes me a few months for development and need to consult you experts from time to time.

stkeke · 2022-06-07T02:58:16Z

@dstogov @cmb69 @ramsey I just pushed the proposed patch which is ready for review and merge.
Here is the man idea for jit buffer relocation. We parse /proc/self/maps file and calculate the preferred starting address of opcache/jit buffer for mmap() in create_segments() function. After this relocation, we eventually got 2.7%~3.1% performance gain in our benchmark and much better branching performance.

ext/opcache/shared_alloc_mmap.c

This is a JIT buffer relocation inspired by this blog https://v8.dev/blog/short-builtin-calls For 64-bit applications, branch prediction performance can be negatively impacted when the target of a branch is more than 4 GB away from the branch. We try to allocate opcache/JIT buffer just within 4GB of PHP text segment through mmap() with a calculated preferred memory address while creating segments. In our benchmark, we found PHP interpreter archieved 2~3% performance and much better branching performance for both 2MB huge pages and ordinary 4KB pages. Signed-off-by: Su, Tao <tao.su@intel.com> Signed-off-by: Wang, Xue <xue1.wang@intel.com> Tested-by: Wang, Xue <xue1.wang@intel.com> Reviewed-by: Chen, Hu <hu1.chen@intel.com> Reviewed-by: You, Lizhen <Lizhen.You@intel.com>

stkeke · 2022-06-23T03:03:27Z

@dstogov @cmb69 @ramsey @arnaud-lb A brand-new patch has been uploaded and passed all CI checks. Ready for review.
The key ideas in new patch are

created a lighthouse function which helps us to locate PHP .text segment
search /proc/self/maps file for any candidate buffer close enough to PHP .text segment (with 4GB chunk)
a) combine adjacent and overlapped segments got from maps file
b) sort combined segments by start address in ascending order
c) search suitable unused holes as our opcache/jit buffer
Why we only search holes BEFORE PHP .text segment?

    /*  PHP segments diagram
               [seg0]     [seg1]      [php .text][heap]      [seg3]
        [hole0]     [hole1]     [hole2]                [hole3]
        We only search any candidates BEFORE PHP .text segment. E.g., [hole0],
        [hole1], and [hole2] might be good candidates. Here is why:
        in standalone PHP, we find that [heap] is closely following PHP .text
        segment and our buffer might block heap growth if we use [hole3].
    */

If our effort to move jit buffer closer to PHP .text segment fails, PHP will fall back the original way to allocate buffer. We won't get performance improvement, but program can go on.

stkeke · 2022-06-23T03:45:43Z

Our benchmark with the latest patch shows steadily performance gain 1) 4kb pages +2.6%, and 2) huge pages +3.0%

arnaud-lb · 2022-06-25T11:21:34Z

Thank you @stkeke. This was nice to review.

This looks good to me apart from a few questions and nit picks. (I want to see Dmitry's review as well)

For information, how do you run the benchmarks ?

ext/opcache/shared_alloc_mmap.c

stkeke · 2022-06-25T23:54:46Z

Thank you @stkeke. This was nice to review.

This looks good to me apart from a few questions and nit picks. (I want to see Dmitry's review as well)

For information, how do you run the benchmarks ?

Thanks for the careful review and catches/questions. They are valuable to code quality and maintainability.

I simply answered a few of them, and will give you more update next Monday (Beijing time).

1) fix bugs captured by arnad-lb and by ourselves 2) unify coding style and convert tab to space 3) remove unnecessary function declaration 4) eliminate duplicated code 5) clarify code with more comments Arnaud-lb's comments: #8618 Signed-off-by: Su, Tao <tao.su@intel.com> Reviewed-by: Wang, Xue xue1.wang@intel.com Tested-by: Wang, Xue xue1.wang@intel.com

stkeke · 2022-06-27T07:39:21Z

@arnaud-lb I created a new patch which includes all the corrections/updates according to your comments and minor issues found by us. No big program logic changes.

As of our benchmark, we are actually maintaining a PHP benchmark framework based on Wordpress/MediaWiki with our best-known PHP configurations. The performance indicator is TPS (transaction per second). Some of benchmarks are already open sourced at here: https://github.com/intel/iodlr/tree/master/containers/wordpress/; more on the way this year, we are speeding up. Sorry for not being able to provide detailed performance data here. so far, we have not got legal department approval.

dstogov · 2022-06-27T09:29:49Z

In my tests .text segment of PHP binary is started at 0x00600000, there is also a single 2MB read-only data segment before. These addresses seem the default on x86_64 Linux. There are just 4MB of free space before, and it's usually not enough for SHM.
So the patch just doesn't work for this case (create_preferred_segments() returns MAP_FAILED) and then mmap() with MAP_32BIT allocates SHM after the heap.

ext/opcache/shared_alloc_mmap.c

Zend/zend.c

stkeke · 2022-06-29T00:59:12Z

In my tests .text segment of PHP binary is started at 0x00600000, there is also a single 2MB read-only data segment before. These addresses seem the default on x86_64 Linux. There are just 4MB of free space before, and it's usually not enough for SHM. So the patch just doesn't work for this case (create_preferred_segments() returns MAP_FAILED) and then mmap() with MAP_32BIT allocates SHM after the heap.

@dstogov Thanks for the information. We have also written a simple test C program and verified that it will not block heap growth after mmap()'ing some memory immediately following [heap]. So we can now confidentially search BEFORE and AFTER PHP .text segment without worrying about heap things. We will update our patch and enhance searching soon...

…P-relative calls and jumps This implementation is based on php#8618

dstogov · 2022-06-29T14:11:32Z

@stkeke I made an attempt to implement your ideas in a simpler way. Please take a look at #8890

stkeke · 2022-06-29T14:20:08Z

@stkeke I made an attempt to implement your ideas in a simpler way. Please take a look at #8890

Thanks @dstogov You are at least 5x faster than us 😀. Let me check your patch out tomorrow morning.

wxue1 · 2022-06-30T03:37:45Z

@stkeke I made an attempt to implement your ideas in a simpler way. Please take a look at #8890

Thank you~, With this patch our workload has performance gain 1) 4kb pages +2.7%, and 2) huge pages +2.8%

…P-relative calls and jumps (#8890) This implementation is based on #8618 developed by Su Tao, Wang Xue, Chen Hu and Lizhen Lizhen.

dstogov · 2022-06-30T07:53:58Z

The same idea is implemented via 17aa81a

stkeke mentioned this pull request May 24, 2022

Relocate JIT buffer close to PHP and achieve 2% more performance #8619

Closed

ramsey added Extension: opcache Category: JIT Proof of Concept labels May 24, 2022

stkeke marked this pull request as ready for review June 7, 2022 02:48

stkeke changed the title ~~JIT buffer relocation and 2% PHP performance gain~~ JIT buffer relocation and 2~3% PHP performance gain Jun 7, 2022

arnaud-lb reviewed Jun 7, 2022

View reviewed changes

ext/opcache/shared_alloc_mmap.c Outdated Show resolved Hide resolved

ext/opcache/shared_alloc_mmap.c Outdated Show resolved Hide resolved

ext/opcache/shared_alloc_mmap.c Outdated Show resolved Hide resolved

dstogov requested changes Jun 14, 2022

View reviewed changes

ext/opcache/shared_alloc_mmap.c Outdated Show resolved Hide resolved

ext/opcache/shared_alloc_mmap.c Outdated Show resolved Hide resolved

github-actions bot added the Category: Engine label Jun 22, 2022

stkeke marked this pull request as draft June 22, 2022 10:11

stkeke marked this pull request as ready for review June 23, 2022 02:44

stkeke requested a review from dstogov June 23, 2022 03:03

arnaud-lb reviewed Jun 25, 2022

View reviewed changes

dstogov reviewed Jun 27, 2022

View reviewed changes

ext/opcache/shared_alloc_mmap.c Show resolved Hide resolved

ext/opcache/shared_alloc_mmap.c Show resolved Hide resolved

Zend/zend.c Show resolved Hide resolved

dstogov added a commit to dstogov/php-src that referenced this pull request Jun 29, 2022

Allocate JIT bufer close to PHP .text segment to allow using direct I…

41ffe38

…P-relative calls and jumps This implementation is based on php#8618

dstogov mentioned this pull request Jun 29, 2022

Allocate JIT bufer close to PHP .text segment to allow using direct IP-relative calls and jumps #8890

Merged

dstogov added a commit that referenced this pull request Jun 30, 2022

Allocate JIT bufer close to PHP .text segment to allow using direct I…

17aa81a

…P-relative calls and jumps (#8890) This implementation is based on #8618 developed by Su Tao, Wang Xue, Chen Hu and Lizhen Lizhen.

dstogov closed this Jun 30, 2022

JIT buffer relocation and 2~3% PHP performance gain #8618

JIT buffer relocation and 2~3% PHP performance gain #8618

Uh oh!

Conversation

stkeke commented May 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmb69 commented May 24, 2022

Uh oh!

dstogov commented May 24, 2022

Uh oh!

stkeke commented May 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stkeke commented Jun 7, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stkeke commented Jun 23, 2022

Uh oh!

stkeke commented Jun 23, 2022

Uh oh!

arnaud-lb commented Jun 25, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stkeke commented Jun 25, 2022

Uh oh!

stkeke commented Jun 27, 2022

Uh oh!

dstogov commented Jun 27, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stkeke commented Jun 29, 2022

Uh oh!

dstogov commented Jun 29, 2022

Uh oh!

stkeke commented Jun 29, 2022

Uh oh!

wxue1 commented Jun 30, 2022

Uh oh!

dstogov commented Jun 30, 2022

Uh oh!

Uh oh!

stkeke commented May 24, 2022 •

edited

Loading

stkeke commented May 25, 2022 •

edited

Loading