Skip to content

Proposal: Run the Control Flow Graph pass twice #5097

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

TysonAndre
Copy link
Contributor

@TysonAndre TysonAndre commented Jan 20, 2020

Run the Control Flow Graph optimizations again,
after all of the type inference steps and optimizations (e.g. SCCP) have run.

Passes run after CFG eliminate various redundant conditions,
infer types and constants for values,
and can also change or remove opcodes,
ensuring that the Control Flow Graph optimizations
can optimize code further,
and lets those passes do transformations
relying on the subsequent optimizations
implemented in the Control Flow Graph code.
(without duplicating the implementation)

This is meant as a proof of concept,

  • I'm not sure of the best way to make the debug output
    (e.g. pass numbers) follow conventions.

  • I'm not familiar with the data structures used.
    I'm assuming it's safe to run pass 5 at that point.

  • There are at most 3 passes on the first run of CFG.
    (see #define PASSES 3 in block_pass.c)

    Maybe the second run of CFG should have a lower limit.

  • I'm not sure if the passes are done only once for a different reason (e.g. correctness, reduce startup time, etc).
    Hopefully, that's not the case (e.g. opcache shared memory cache and file cache help with performance).
    I didn't see any similar proposals after a quick search of PRs/issue trackers for "opcache"/"pass"/"cfg"

Tests were updated:

  • CFG can eliminate unnecessary jumps (left over after SCCP removes impossible conditions)
  • CFG optimized the calls to ECHO into a single call with a string.

EDIT: INIT_FCALL seems to have the wrong stack size after this patch, investigating (https://ci.appveyor.com/project/php/php-src/builds/30241892/job/1swrsqwxnxkcuhkd/tests)
It was previously 192 and 144, and now it's wrong.

317+ L2 (126):   INIT_FCALL 1 96 string("test_open_basedir_before")
318+ L3 (126):   SEND_VAR CV0($function) 1
319+ L4 (126):   DO_UCALL
320+ L5 (127):   INIT_FCALL 1 432 string("test_open_basedir_error")

Run the Control Flow Graph optimizations again,
after all of the type inference steps and optimizations have run.

Passes run after CFG eliminate various redundant conditions,
infer types and constants for values,
and can also change or remove opcodes,
ensuring that the Control Flow Graph optimizations
can optimize code further,
and lets those passes do transformations
relying on the subsequent optimizations
implemented in the Control Flow Graph code.
(without duplicating the implementation)

This is meant as a proof of concept,
- I'm not sure of the best way to make the debug output
  (e.g. pass numbers) follow conventions.
- I'm not familiar with the data structures used.
  I'm assuming it's safe to run pass 5 at that point.
- There are at most 3 passes on the first run of CFG.
  (see `#define PASSES 3` in block_pass.c)

  Maybe the second run of CFG should have a lower limit.

Tests were updated:
- CFG can eliminate unnecessary jumps
- CFG optimized the calls to ECHO into a single call with a string.
@dstogov
Copy link
Member

dstogov commented Jan 20, 2020

It may be better, to extend DFA pass to perform the missed optimizations.
This will require keeping SSA consistency, but anyway, should be cheaper than the whole CFG pass.

@nikic you had some thoughts about Optimizer improvement...?
I made some cleanup last year, but a lot of more may be done.
Also, it would be great to move optimizer and JIT into core, to make them available even without opcode caching.

@TysonAndre
Copy link
Contributor Author

It may be better, to extend DFA pass to perform the missed optimizations.
This will require keeping SSA consistency, but anyway, should be cheaper than the whole CFG pass.

I'll look into that.

I found out why the tests fail - it's probably because pass 5 doesn't adjust all of the addresses of caller_init_opline - It mallocs a new array of oplines, but doesn't update the pointers to the old one.

There may be other issues.

==1888== Invalid read of size 1
==1888==    at 0x6606563: zend_adjust_fcall_stack_size_graph (zend_optimizer.c:1348)
==1888==    by 0x6606C57: zend_optimize_script (zend_optimizer.c:1459)
==1888==    by 0x65D23FE: cache_script_in_shared_memory (ZendAccelerator.c:1450)
==1888==    by 0x65D43D3: persistent_compile_file (ZendAccelerator.c:2135)
==1888==    by 0x79C027: zend_include_or_eval (zend_execute.c:4185)
==1888==    by 0x7A5796: ZEND_INCLUDE_OR_EVAL_SPEC_CONST_HANDLER (zend_vm_execute.h:3980)
==1888==    by 0x7FFCB6: execute_ex (zend_vm_execute.h:52196)
==1888==    by 0x8038DB: zend_execute (zend_vm_execute.h:55983)
==1888==    by 0x730EC5: zend_execute_scripts (zend.c:1666)
==1888==    by 0x69E528: php_execute_script (main.c:2584)
==1888==    by 0x80624B: do_cli (php_cli.c:959)
==1888==    by 0x80726B: main (php_cli.c:1350)
==1888==  Address 0x5fcf67c is 2,172 bytes inside a block of size 2,272 free'd
==1888==    at 0x4C2EDEB: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==1888==    by 0x6FA664: _efree_custom (zend_alloc.c:2425)
==1888==    by 0x6FA7A5: _efree (zend_alloc.c:2545)
==1888==    by 0x6612E30: assemble_code_blocks (block_pass.c:985)
==1888==    by 0x661609D: zend_optimize_cfg (block_pass.c:1893)
==1888==    by 0x6606AA1: zend_optimize_script (zend_optimizer.c:1432)
==1888==    by 0x65D23FE: cache_script_in_shared_memory (ZendAccelerator.c:1450)
==1888==    by 0x65D43D3: persistent_compile_file (ZendAccelerator.c:2135)
==1888==    by 0x79C027: zend_include_or_eval (zend_execute.c:4185)
==1888==    by 0x7A5796: ZEND_INCLUDE_OR_EVAL_SPEC_CONST_HANDLER (zend_vm_execute.h:3980)
==1888==    by 0x7FFCB6: execute_ex (zend_vm_execute.h:52196)
==1888==    by 0x8038DB: zend_execute (zend_vm_execute.h:55983)
==1888==  Block was alloc'd at
==1888==    at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==1888==    by 0x6FB667: __zend_malloc (zend_alloc.c:2975)
==1888==    by 0x6FA5FD: _malloc_custom (zend_alloc.c:2416)
==1888==    by 0x6FA72B: _emalloc (zend_alloc.c:2535)
==1888==    by 0x6612D45: assemble_code_blocks (block_pass.c:972)
==1888==    by 0x661609D: zend_optimize_cfg (block_pass.c:1893)
==1888==    by 0x660575B: zend_optimize (zend_optimizer.c:967)
==1888==    by 0x6606747: zend_optimize_script (zend_optimizer.c:1390)
==1888==    by 0x65D23FE: cache_script_in_shared_memory (ZendAccelerator.c:1450)
==1888==    by 0x65D43D3: persistent_compile_file (ZendAccelerator.c:2135)
==1888==    by 0x79C027: zend_include_or_eval (zend_execute.c:4185)
==1888==    by 0x7A5796: ZEND_INCLUDE_OR_EVAL_SPEC_CONST_HANDLER (zend_vm_execute.h:3980)
==1888== 

@TysonAndre
Copy link
Contributor Author

It may be better, to extend DFA pass to perform the missed optimizations.
This will require keeping SSA consistency, but anyway, should be cheaper than the whole CFG pass.

That would have some drawbacks which made me hesitant to propose doing it that way:

  • More effort to maintain, and patches might get merged to one pass but not the other (depending on how it gets implemented)
  • More possibility of bugs (e.g. incorrect assumptions) in the duplicated definitions
  • Would need to add the shift list to the CFG phase, to account for optimizations that increase the number of opcodes

Another option would be to add another ini directive, e.g. opcache.repeated_opt_passes=0x...,
and document that it would increase startup time slightly more, but have the benefit of slightly more speedup after startup in some functions. (or change opcache.opt_level's default)

TysonAndre referenced this pull request Jan 20, 2020
Instead directly use return_value. Hopefully this also works
around the "may be uninitialized" warning...
@TysonAndre
Copy link
Contributor Author

The remaining build failure looks like it wasn't caused by my PR itself, but instead by the fact that opcache had to recompile instead of using file_cache (or some other reason) - the implementation of array_reduce changed.

make test TESTS=ext/standard/tests/array/bug76778.phpt\ --show-diff\ -m\ --show-mem fails locally for me, both for master and this branch, after clearing the local opcache file cache

@nikic
Copy link
Member

nikic commented Jan 21, 2020

My general preference would be to move optimizations from block pass to SSA pass where possible. There are some things that need to happen before SSA construction (e.g. optimization of &&) because we get a better control flow graph from that, but I think many of the block pass optimizations shouldn't really be part of the block pass. For example, I see no reason why the "echo" merging optimization that seems to cause most of the diff here is part of that pass.

@dstogov
Copy link
Member

dstogov commented Jan 21, 2020

  1. Optimizations in block_pass should be cheaper, because they don't require SSA reconstruction.
  2. The early optimization passes significantly reduce the cost of SSA construction.
  3. DFA optimization pass have more limitations (reducible CFG, no catch/finally blocks).

So, I don't think, it makes sense to "move" optimizations, but better "duplicate" them, or separate valuable patterns and call them from both passes.

@nikic
Copy link
Member

nikic commented Jan 21, 2020

@dstogov I don't remember the details, but IIRC block pass can be much more expensive than the DFA passes, because the variable source tracking it does is so inefficent. Just zeroing out the source buffer for each block is very expensive if there are many temporaries. (We could compact temporaries earlier, but I think block pass relies on unique temporaries for some of its optimizations.) That's why I think that any optimizations related to variables should be in DFA. Of course, jump optimizations should be in block pass.

@TysonAndre
Copy link
Contributor Author

TysonAndre commented Jan 22, 2020

For example, I see no reason why the "echo" merging optimization that seems to cause most of the diff here is part of that pass.

I'm stuck on how that optimization would best be implemented outside of the block pass.

Is there a way to (efficiently) tell (during the SCCP phase) if an opcode other than the preceding opcode will also jump to a given opcode? If so, does this information get updated after no-ops are removed?

  • I guess a one-off pass could find all opcodes that do jumps, if there's currently no way. EDIT: e.g. call zend_build_cfg from sccp to build the basic blocks.

I see mentions of predecessors and successors, but it looks like it's for data flow, not for code flow. (and I'd guess IS_CONST wouldn't have data flow)

E.g. how would I tell if the zend_opline for L3 has more than one jump (via control flow, goto, etc).

  • zend_optimize_block does it because it knows the start opline and end opline of a zend_basic_block.
<?php
function main(bool $flag) {
    if ($flag) {
        echo "True";
    }
mid:
    echo "Suffix";
}
L0 (2):     CV0($flag) = RECV 1
L1 (3):     JMPZ CV0($flag) L3
L2 (4):     ECHO string("True")
L3 (7):     ECHO string("Suffix")
L4 (8):     RETURN null

@nikic
Copy link
Member

nikic commented Jan 22, 2020

@TysonAndre The DFA pass also builds a CFG, so it should have access to the same information as the block pass. The CFG is stored here:

zend_cfg cfg; /* control flow graph */

TysonAndre added a commit to TysonAndre/php-src that referenced this pull request Jan 26, 2020
And filter out echoes of the empty string (e.g. false/null)

Split out of php#5097
TysonAndre added a commit to TysonAndre/php-src that referenced this pull request Jan 26, 2020
And filter out echoes of the empty string (e.g. false/null)

Split out of php#5097
TysonAndre added a commit to TysonAndre/php-src that referenced this pull request Jan 26, 2020
And filter out echoes of the empty string (e.g. false/null)

Split out of php#5097
@TysonAndre
Copy link
Contributor Author

Looking again, some of the optimizations in the block pass seem unsound to use the way this PR will start using them, such as merging ZEND_ECHO calls. (not just less effective)

e.g. if $x = self::SOME_CONST, and echo $x; is used multiple times, then ZEND_OP1_LITERAL(last_op) might be referring to the same id.

if (Z_TYPE(ZEND_OP1_LITERAL(opline)) != IS_STRING) {
	convert_to_string(&ZEND_OP1_LITERAL(opline));
}
if (Z_TYPE(ZEND_OP1_LITERAL(last_op)) != IS_STRING) {
	convert_to_string(&ZEND_OP1_LITERAL(last_op));
}
old_len = Z_STRLEN(ZEND_OP1_LITERAL(last_op));
l = old_len + Z_STRLEN(ZEND_OP1_LITERAL(opline));
if (!Z_REFCOUNTED(ZEND_OP1_LITERAL(last_op))) {
	zend_string *tmp = zend_string_alloc(l, 0);
	memcpy(ZSTR_VAL(tmp), Z_STRVAL(ZEND_OP1_LITERAL(last_op)), old_len);
	Z_STR(ZEND_OP1_LITERAL(last_op)) = tmp;
} else {
	Z_STR(ZEND_OP1_LITERAL(last_op)) = zend_string_extend(Z_STR(ZEND_OP1_LITERAL(last_op)), l, 0);
}

If I'm reading it right, that pass assumes a given literal id for some opcode types gets used in exactly one place, and that it's safe to modify the zval of the constant.

Also, closing this PR because this PR isn't the way to go, and some of the work is started. If you think any of these tasks from the PR discussion should get done, file an issue.

  • Converting echo's operand to a string is in progress
  • Combining echo calls after SCCP and other phases isn't started
  • Optimizing jumps/control flow isn't started
  • Figuring out which optimizations from the block pass (if run twice) would take effect on commonly used projects hasn't been done (e.g. compare opcache.opt_debug_level=0x20000 output before/after this PR)

@TysonAndre TysonAndre closed this Jan 26, 2020
TysonAndre added a commit to TysonAndre/php-src that referenced this pull request Jan 27, 2020
And filter out echoes of the empty string (e.g. false/null)

Split out of php#5097
TysonAndre added a commit to TysonAndre/php-src that referenced this pull request Jan 27, 2020
And filter out echoes of the empty string (e.g. false/null)

Split out of php#5097
php-pulls pushed a commit that referenced this pull request Jan 28, 2020
And filter out echoes of the empty string (e.g. false/null)

Split out of #5097 (on GitHub)

Closes GH-5118
TysonAndre added a commit to TysonAndre/php-src that referenced this pull request May 1, 2020
Currently, it isn't possible to enable optimizations without enabling caching.
They should be orthogonal features - it's already possible to cache without
optimization passes (`opcache.optimization_level=0`)

Being forced to either enable caching or file_cache_only causes these problems:

- For file_cache_only, users would be forced to manage the file cache,
  and users may not be confident in properly handling all potential edge cases.

  (Running out of disk space, need to clear stale entries, concerns about opcode
  corruption not being fixed after restarting a process, etc)
- The shared memory block uses up a lot of RAM when individual PHP processes
  are known to be long-lived and don't have a common memory address space.
  (e.g. multiple long-lived php scripts managed by something that is not
  a php script (e.g. `supervisord`, daemons, tools run in the background in
  IDEs, etc))

  The opcodes are duplicated in shmem, which wastes memory in use cases
  where they're stored in the cache but never retrieved.

  The default for opcache.memory_consumption is 128M.
  Even when this isn't used, the virtual memory to track the shared
  memory(shmem) segment seems to add 2MB extra per **independent** php process in
  "shared memory" segments, reducing the free RAM available for other processes.

  (starting a large number of php CLI scripts that sleep() in a loop,
  `free` (linux program to report free memory) reports that `shared`
  increases by 2MB per process without opcache.no_cache, but barely increases
  with opcache.no_cache=1

On an unrelated PR php#5097 (comment)
dstogov mentioned that

> Also, it would be great to move optimizer and JIT into core, to make them
> available even without opcode caching.

This PR is one potential approach for making the optimizer and JIT available
without opcode caching.

This is currently a proof of concept - feel free to point out combinations of
settings that should be rejected.

Potential areas for future work:

- Normally, opcache optimizes a file based only on that one file's contents.
  When `opcache.no_cache=1` is used, it may be possible to use all of the
  class, function, constant, etc. definitions parsed from previously parsed
  files (to eliminate dead code, inline function calls, etc).
TysonAndre added a commit to TysonAndre/php-src that referenced this pull request May 16, 2020
Currently, it isn't possible to enable optimizations without enabling caching.
They should be orthogonal features - it's already possible to cache without
optimization passes (`opcache.optimization_level=0`)

(In use cases such as running a web server in apache,
it's very desirable to both cache and optimize opcodes.
For short-lived cli scripts, either file_cache or disabling opcache
(to reduce optimization overhead) is often useful.
In a few use cases, it's desirable to optimize without any caching)

Being forced to either enable shared memory caching or file_cache_only
causes these problems:

- **For file_cache_only, users would be forced to manage the file cache.**

  - Users may not be confident in properly handling all potential edge cases.
  - End users of applications may not be familiar with opcache.

  (Concerns include running out of disk space, needing to clear stale entries,
  concerns about opcode corruption or opcodes for the wrong file contents
  not being fixed after restarting a process, etc)
- **The shared memory block uses up a lot of RAM when individual PHP processes
  are known to be long-lived and don't have a common memory address space.**
  (e.g. multiple long-lived php scripts managed by something that is not
  a php script (e.g. `supervisord`, daemons, tools run in the background in
  IDEs, etc))

  The opcodes are duplicated in shmem, which wastes memory in use cases
  where opcodes are stored in the cache but never retrieved.

  The default for opcache.memory_consumption is 128M.
  Even when this isn't used, the virtual memory to track the shared
  memory(shmem) segment seems to add 2MB extra per **independent** php process in
  "shared memory" segments, reducing the free RAM available for other processes.

  (starting a large number of php CLI scripts that sleep() in a loop,
  `free` (linux program to report free memory) reports that `shared`
  increases by 2MB per process without opcache.no_cache, but barely increases
  with opcache.no_cache=1

`opcache.no_cache` takes precedence over `opcache.file_cache*` settings.
(If enabled, neither the shared memory cache nor the file cache will be used)

On an unrelated PR php#5097 (comment)
dstogov mentioned that

> Also, it would be great to move optimizer and JIT into core, to make them
> available even without opcode caching.

This PR is one potential approach for making the optimizer and JIT available
without opcode caching.

- Even if everything except caching was moved into core,
  It would still be useful to have a configurable way to disable opcode caching
  via a system ini setting (`php -d opcache.no_cache=1 my_script.php`)

This is currently a proof of concept - feel free to point out combinations of
settings that should be rejected.

Potential areas for future work:

- Normally, opcache optimizes a file based only on that one file's contents.
  When `opcache.no_cache=1` is used, it may be possible to use all of the
  class, function, constant, etc. definitions parsed from previously parsed
  files (to eliminate dead code, inline function calls, etc).
TysonAndre added a commit to TysonAndre/php-src that referenced this pull request May 16, 2020
Currently, it isn't possible to enable optimizations without enabling caching.
They should be orthogonal features - it's already possible to cache without
optimization passes (`opcache.optimization_level=0`)

(In use cases such as running a web server in apache,
it's very desirable to both cache and optimize opcodes.
For short-lived cli scripts, either file_cache or disabling opcache
(to reduce optimization overhead) is often useful.
In a few use cases, it's desirable to optimize without any caching)

Being forced to either enable shared memory caching or file_cache_only
causes these problems:

- **For file_cache_only, users would be forced to manage the file cache.**

  - Users may not be confident in properly handling all potential edge cases.
  - End users of applications may not be familiar with opcache.

  (Concerns include running out of disk space, needing to clear stale entries,
  concerns about opcode corruption or opcodes for the wrong file contents
  not being fixed after restarting a process, etc)
- **The shared memory block uses up a lot of RAM when individual PHP processes
  are known to be long-lived and don't have a common memory address space.**
  (e.g. multiple long-lived php scripts managed by something that is not
  a php script (e.g. `supervisord`, daemons, tools run in the background in
  IDEs, etc))

  The opcodes are duplicated in shmem, which wastes memory in use cases
  where opcodes are stored in the cache but never retrieved.

  The default for opcache.memory_consumption is 128M.
  Even when this isn't used, the virtual memory to track the shared
  memory(shmem) segment seems to add 2MB extra per **independent** php process in
  "shared memory" segments, reducing the free RAM available for other processes.

  (starting a large number of php CLI scripts that sleep() in a loop,
  `free` (linux program to report free memory) reports that `shared`
  increases by 2MB per process without opcache.no_cache, but barely increases
  with opcache.no_cache=1

`opcache.no_cache` takes precedence over `opcache.file_cache*` settings.
(If enabled, neither the shared memory cache nor the file cache will be used.
It is an error to use both opcache.no_cache and opcache.preload.)

On an unrelated PR php#5097 (comment)
dstogov mentioned that

> Also, it would be great to move optimizer and JIT into core, to make them
> available even without opcode caching.

This PR is one potential approach for making the optimizer and JIT available
without opcode caching.

- Even if everything except caching was moved into core,
  It would still be useful to have a configurable way to disable opcode caching
  via a system ini setting (`php -d opcache.no_cache=1 my_script.php`)

This is currently a proof of concept - feel free to point out combinations of
settings that should be rejected.

Potential areas for future work:

- Normally, opcache optimizes a file based only on that one file's contents.
  When `opcache.no_cache=1` is used, it may be possible to use all of the
  class, function, constant, etc. definitions parsed from previously parsed
  files (to eliminate dead code, inline function calls, etc).
TysonAndre added a commit to TysonAndre/php-src that referenced this pull request May 20, 2020
Currently, it isn't possible to enable optimizations without enabling caching.
They should be orthogonal features - it's already possible to cache without
optimization passes (`opcache.optimization_level=0`)

(In use cases such as running a web server in apache,
it's very desirable to both cache and optimize opcodes.
For short-lived cli scripts, either file_cache or disabling opcache
(to reduce optimization overhead) is often useful.
In a few use cases, it's desirable to optimize without any caching)

Being forced to either enable shared memory caching or file_cache_only
causes these problems:

- **For file_cache_only, users would be forced to manage the file cache.**

  - Users may not be confident in properly handling all potential edge cases.
  - End users of applications may not be familiar with opcache.

  (Concerns include running out of disk space, needing to clear stale entries,
  concerns about opcode corruption or opcodes for the wrong file contents
  not being fixed after restarting a process, etc)
- **The shared memory block uses up a lot of RAM when individual PHP processes
  are known to be long-lived and don't have a common memory address space.**
  (e.g. multiple long-lived php scripts managed by something that is not
  a php script (e.g. `supervisord`, daemons, tools run in the background in
  IDEs, etc))

  The opcodes are duplicated in shmem, which wastes memory in use cases
  where opcodes are stored in the cache but never retrieved.

  The default for opcache.memory_consumption is 128M.
  Even when this isn't used, the virtual memory to track the shared
  memory(shmem) segment seems to add 2MB extra per **independent** php process in
  "shared memory" segments, reducing the free RAM available for other processes.

  (starting a large number of php CLI scripts that sleep() in a loop,
  `free` (linux program to report free memory) reports that `shared`
  increases by 2MB per process without opcache.no_cache, but barely increases
  with opcache.no_cache=1

`opcache.no_cache` takes precedence over `opcache.file_cache*` settings.
(If enabled, neither the shared memory cache nor the file cache will be used.
It is an error to use both opcache.no_cache and opcache.preload.)

On an unrelated PR php#5097 (comment)
dstogov mentioned that

> Also, it would be great to move optimizer and JIT into core, to make them
> available even without opcode caching.

This PR is one potential approach for making the optimizer and JIT available
without opcode caching.

- Even if everything except caching was moved into core,
  It would still be useful to have a configurable way to disable opcode caching
  via a system ini setting (`php -d opcache.no_cache=1 my_script.php`)

This is currently a proof of concept - feel free to point out combinations of
settings that should be rejected.

Potential areas for future work:

- Normally, opcache optimizes a file based only on that one file's contents.
  When `opcache.no_cache=1` is used, it may be possible to use all of the
  class, function, constant, etc. definitions parsed from previously parsed
  files (to eliminate dead code, inline function calls, etc).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants