Skip to content

JIT buffer relocation and 2~3% PHP performance gain #8618

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

JIT buffer relocation and 2~3% PHP performance gain #8618

wants to merge 2 commits into from

Conversation

stkeke
Copy link
Contributor

@stkeke stkeke commented May 24, 2022

This is a JIT buffer relocation inspired by this blog
https://v8.dev/blog/short-builtin-calls

For 64-bit applications, branch prediction performance can be
negatively impacted when the target of a branch is more than
4 GB away from the branch.

We try to allocate opcache/JIT buffer just prior to PHP .text
segments through mmap() with a calculated preferred memory address
while creating segments.

In our benchmark, we found PHP interpreter archieved 2-3% performance
and much better branching performance with both 2MB huge pages and
ordinary 4KB pages.

Signed-off-by: Su, Tao tao.su@intel.com
Tested-by: Wang, Xue xue1.wang@intel.com
Reviewed-by: Chen, Hu hu1.chen@intel.com
Reviewed-by: You, Lizhen Lizhen.You@intel.com

@cmb69
Copy link
Member

cmb69 commented May 24, 2022

@dstogov, do you think this is worth pursuing? (it can't work on Windows, but maybe on other systems)

@dstogov
Copy link
Member

dstogov commented May 24, 2022

I read that blog. Unfortunately, this PR is just a very basic PoC.
In any case, thanks for trying this and sharing the benchmark results.

Currently opcache tries to allocate SHM in the low 2GB using MAP_32BIT.
.text segment of the main binary is also usually placed in the low 2GB memory and as result we are able to use "short" jumps in JIT code. Unfortunately, this doesn't work for PHP compiled as Apache module.

Probably, we may improve this approach by searching for the best candidate for SHM analysing /proc/self/maps.
We already use /proc/self/maps to remap .text segment into huge pages.

@stkeke
Copy link
Contributor Author

stkeke commented May 25, 2022

@dstogov Thanks Dmitry for the comments and good hint for a better way.
@cmb69 @ramsey Thanks for taking care of this PoC PR.

I can work out a more workable patch. However, If you guys have the bandwidth to develop a quick mergeable patch, feel free to go ahead of me.

Reason: I am quite new to PHP and still ramping up PHP source code, so it probably takes me a few months for development and need to consult you experts from time to time.

@stkeke stkeke marked this pull request as ready for review June 7, 2022 02:48
@stkeke stkeke changed the title JIT buffer relocation and 2% PHP performance gain JIT buffer relocation and 2~3% PHP performance gain Jun 7, 2022
@stkeke
Copy link
Contributor Author

stkeke commented Jun 7, 2022

@dstogov @cmb69 @ramsey I just pushed the proposed patch which is ready for review and merge.
Here is the man idea for jit buffer relocation. We parse /proc/self/maps file and calculate the preferred starting address of opcache/jit buffer for mmap() in create_segments() function. After this relocation, we eventually got 2.7%~3.1% performance gain in our benchmark and much better branching performance.

@stkeke stkeke marked this pull request as draft June 22, 2022 10:11
This is a JIT buffer relocation inspired by this blog
https://v8.dev/blog/short-builtin-calls
  For 64-bit applications, branch prediction performance can be
  negatively impacted when the target of a branch is more than
  4 GB away from the branch.

We try to allocate opcache/JIT buffer just within 4GB of PHP text
segment through mmap() with a calculated preferred memory address
while creating segments.

In our benchmark, we found PHP interpreter archieved 2~3% performance
and much better branching performance for both 2MB huge pages and
ordinary 4KB pages.

Signed-off-by: Su, Tao     <tao.su@intel.com>
Signed-off-by: Wang, Xue   <xue1.wang@intel.com>
Tested-by:     Wang, Xue   <xue1.wang@intel.com>
Reviewed-by:   Chen, Hu    <hu1.chen@intel.com>
Reviewed-by:   You, Lizhen <Lizhen.You@intel.com>
@stkeke stkeke marked this pull request as ready for review June 23, 2022 02:44
@stkeke
Copy link
Contributor Author

stkeke commented Jun 23, 2022

@dstogov @cmb69 @ramsey @arnaud-lb A brand-new patch has been uploaded and passed all CI checks. Ready for review.
The key ideas in new patch are

  1. created a lighthouse function which helps us to locate PHP .text segment
  2. search /proc/self/maps file for any candidate buffer close enough to PHP .text segment (with 4GB chunk)
    a) combine adjacent and overlapped segments got from maps file
    b) sort combined segments by start address in ascending order
    c) search suitable unused holes as our opcache/jit buffer
  3. Why we only search holes BEFORE PHP .text segment?
    /*  PHP segments diagram
               [seg0]     [seg1]      [php .text][heap]      [seg3]
        [hole0]     [hole1]     [hole2]                [hole3]
        We only search any candidates BEFORE PHP .text segment. E.g., [hole0],
        [hole1], and [hole2] might be good candidates. Here is why:
        in standalone PHP, we find that [heap] is closely following PHP .text
        segment and our buffer might block heap growth if we use [hole3].
    */
  1. If our effort to move jit buffer closer to PHP .text segment fails, PHP will fall back the original way to allocate buffer. We won't get performance improvement, but program can go on.

@stkeke stkeke requested a review from dstogov June 23, 2022 03:03
@stkeke
Copy link
Contributor Author

stkeke commented Jun 23, 2022

Our benchmark with the latest patch shows steadily performance gain 1) 4kb pages +2.6%, and 2) huge pages +3.0%

@arnaud-lb
Copy link
Member

Thank you @stkeke. This was nice to review.

This looks good to me apart from a few questions and nit picks. (I want to see Dmitry's review as well)

For information, how do you run the benchmarks ?

@stkeke
Copy link
Contributor Author

stkeke commented Jun 25, 2022

Thank you @stkeke. This was nice to review.

This looks good to me apart from a few questions and nit picks. (I want to see Dmitry's review as well)

For information, how do you run the benchmarks ?

Thanks for the careful review and catches/questions. They are valuable to code quality and maintainability.

I simply answered a few of them, and will give you more update next Monday (Beijing time).

1) fix bugs captured by arnad-lb and by ourselves
2) unify coding style and convert tab to space
3) remove unnecessary function declaration
4) eliminate duplicated code
5) clarify code with more comments

Arnaud-lb's comments: #8618

Signed-off-by: Su, Tao <tao.su@intel.com>
Reviewed-by:   Wang, Xue xue1.wang@intel.com
Tested-by:     Wang, Xue xue1.wang@intel.com
@stkeke
Copy link
Contributor Author

stkeke commented Jun 27, 2022

@arnaud-lb I created a new patch which includes all the corrections/updates according to your comments and minor issues found by us. No big program logic changes.

As of our benchmark, we are actually maintaining a PHP benchmark framework based on Wordpress/MediaWiki with our best-known PHP configurations. The performance indicator is TPS (transaction per second). Some of benchmarks are already open sourced at here: https://github.com/intel/iodlr/tree/master/containers/wordpress/; more on the way this year, we are speeding up. Sorry for not being able to provide detailed performance data here. so far, we have not got legal department approval.

@dstogov
Copy link
Member

dstogov commented Jun 27, 2022

In my tests .text segment of PHP binary is started at 0x00600000, there is also a single 2MB read-only data segment before. These addresses seem the default on x86_64 Linux. There are just 4MB of free space before, and it's usually not enough for SHM.
So the patch just doesn't work for this case (create_preferred_segments() returns MAP_FAILED) and then mmap() with MAP_32BIT allocates SHM after the heap.

@stkeke
Copy link
Contributor Author

stkeke commented Jun 29, 2022

In my tests .text segment of PHP binary is started at 0x00600000, there is also a single 2MB read-only data segment before. These addresses seem the default on x86_64 Linux. There are just 4MB of free space before, and it's usually not enough for SHM. So the patch just doesn't work for this case (create_preferred_segments() returns MAP_FAILED) and then mmap() with MAP_32BIT allocates SHM after the heap.

@dstogov Thanks for the information. We have also written a simple test C program and verified that it will not block heap growth after mmap()'ing some memory immediately following [heap]. So we can now confidentially search BEFORE and AFTER PHP .text segment without worrying about heap things. We will update our patch and enhance searching soon...

dstogov added a commit to dstogov/php-src that referenced this pull request Jun 29, 2022
…P-relative calls and jumps

This implementation is based on php#8618
@dstogov
Copy link
Member

dstogov commented Jun 29, 2022

@stkeke I made an attempt to implement your ideas in a simpler way. Please take a look at #8890

@stkeke
Copy link
Contributor Author

stkeke commented Jun 29, 2022

@stkeke I made an attempt to implement your ideas in a simpler way. Please take a look at #8890

Thanks @dstogov You are at least 5x faster than us 😀. Let me check your patch out tomorrow morning.

@wxue1
Copy link
Contributor

wxue1 commented Jun 30, 2022

@stkeke I made an attempt to implement your ideas in a simpler way. Please take a look at #8890

Thank you~, With this patch our workload has performance gain 1) 4kb pages +2.7%, and 2) huge pages +2.8%

dstogov added a commit that referenced this pull request Jun 30, 2022
…P-relative calls and jumps (#8890)

This implementation is based on #8618 developed by Su Tao, Wang Xue, Chen Hu and Lizhen Lizhen.
@dstogov
Copy link
Member

dstogov commented Jun 30, 2022

The same idea is implemented via 17aa81a

@dstogov dstogov closed this Jun 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants