Proposal: Run the Control Flow Graph pass twice #5097

TysonAndre · 2020-01-20T19:22:25Z

Run the Control Flow Graph optimizations again,
after all of the type inference steps and optimizations (e.g. SCCP) have run.

Passes run after CFG eliminate various redundant conditions,
infer types and constants for values,
and can also change or remove opcodes,
ensuring that the Control Flow Graph optimizations
can optimize code further,
and lets those passes do transformations
relying on the subsequent optimizations
implemented in the Control Flow Graph code.
(without duplicating the implementation)

This is meant as a proof of concept,

I'm not sure of the best way to make the debug output
(e.g. pass numbers) follow conventions.
I'm not familiar with the data structures used.
I'm assuming it's safe to run pass 5 at that point.
There are at most 3 passes on the first run of CFG.
(see #define PASSES 3 in block_pass.c)

Maybe the second run of CFG should have a lower limit.
I'm not sure if the passes are done only once for a different reason (e.g. correctness, reduce startup time, etc).
Hopefully, that's not the case (e.g. opcache shared memory cache and file cache help with performance).
I didn't see any similar proposals after a quick search of PRs/issue trackers for "opcache"/"pass"/"cfg"

Tests were updated:

CFG can eliminate unnecessary jumps (left over after SCCP removes impossible conditions)
CFG optimized the calls to ECHO into a single call with a string.

EDIT: INIT_FCALL seems to have the wrong stack size after this patch, investigating (https://ci.appveyor.com/project/php/php-src/builds/30241892/job/1swrsqwxnxkcuhkd/tests)
It was previously 192 and 144, and now it's wrong.

317+ L2 (126):   INIT_FCALL 1 96 string("test_open_basedir_before")
318+ L3 (126):   SEND_VAR CV0($function) 1
319+ L4 (126):   DO_UCALL
320+ L5 (127):   INIT_FCALL 1 432 string("test_open_basedir_error")

Run the Control Flow Graph optimizations again, after all of the type inference steps and optimizations have run. Passes run after CFG eliminate various redundant conditions, infer types and constants for values, and can also change or remove opcodes, ensuring that the Control Flow Graph optimizations can optimize code further, and lets those passes do transformations relying on the subsequent optimizations implemented in the Control Flow Graph code. (without duplicating the implementation) This is meant as a proof of concept, - I'm not sure of the best way to make the debug output (e.g. pass numbers) follow conventions. - I'm not familiar with the data structures used. I'm assuming it's safe to run pass 5 at that point. - There are at most 3 passes on the first run of CFG. (see `#define PASSES 3` in block_pass.c) Maybe the second run of CFG should have a lower limit. Tests were updated: - CFG can eliminate unnecessary jumps - CFG optimized the calls to ECHO into a single call with a string.

dstogov · 2020-01-20T20:32:21Z

It may be better, to extend DFA pass to perform the missed optimizations.
This will require keeping SSA consistency, but anyway, should be cheaper than the whole CFG pass.

@nikic you had some thoughts about Optimizer improvement...?
I made some cleanup last year, but a lot of more may be done.
Also, it would be great to move optimizer and JIT into core, to make them available even without opcode caching.

TysonAndre · 2020-01-20T20:44:55Z

It may be better, to extend DFA pass to perform the missed optimizations.
This will require keeping SSA consistency, but anyway, should be cheaper than the whole CFG pass.

I'll look into that.

I found out why the tests fail - it's probably because pass 5 doesn't adjust all of the addresses of caller_init_opline - It mallocs a new array of oplines, but doesn't update the pointers to the old one.

There may be other issues.

==1888== Invalid read of size 1
==1888==    at 0x6606563: zend_adjust_fcall_stack_size_graph (zend_optimizer.c:1348)
==1888==    by 0x6606C57: zend_optimize_script (zend_optimizer.c:1459)
==1888==    by 0x65D23FE: cache_script_in_shared_memory (ZendAccelerator.c:1450)
==1888==    by 0x65D43D3: persistent_compile_file (ZendAccelerator.c:2135)
==1888==    by 0x79C027: zend_include_or_eval (zend_execute.c:4185)
==1888==    by 0x7A5796: ZEND_INCLUDE_OR_EVAL_SPEC_CONST_HANDLER (zend_vm_execute.h:3980)
==1888==    by 0x7FFCB6: execute_ex (zend_vm_execute.h:52196)
==1888==    by 0x8038DB: zend_execute (zend_vm_execute.h:55983)
==1888==    by 0x730EC5: zend_execute_scripts (zend.c:1666)
==1888==    by 0x69E528: php_execute_script (main.c:2584)
==1888==    by 0x80624B: do_cli (php_cli.c:959)
==1888==    by 0x80726B: main (php_cli.c:1350)
==1888==  Address 0x5fcf67c is 2,172 bytes inside a block of size 2,272 free'd
==1888==    at 0x4C2EDEB: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==1888==    by 0x6FA664: _efree_custom (zend_alloc.c:2425)
==1888==    by 0x6FA7A5: _efree (zend_alloc.c:2545)
==1888==    by 0x6612E30: assemble_code_blocks (block_pass.c:985)
==1888==    by 0x661609D: zend_optimize_cfg (block_pass.c:1893)
==1888==    by 0x6606AA1: zend_optimize_script (zend_optimizer.c:1432)
==1888==    by 0x65D23FE: cache_script_in_shared_memory (ZendAccelerator.c:1450)
==1888==    by 0x65D43D3: persistent_compile_file (ZendAccelerator.c:2135)
==1888==    by 0x79C027: zend_include_or_eval (zend_execute.c:4185)
==1888==    by 0x7A5796: ZEND_INCLUDE_OR_EVAL_SPEC_CONST_HANDLER (zend_vm_execute.h:3980)
==1888==    by 0x7FFCB6: execute_ex (zend_vm_execute.h:52196)
==1888==    by 0x8038DB: zend_execute (zend_vm_execute.h:55983)
==1888==  Block was alloc'd at
==1888==    at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==1888==    by 0x6FB667: __zend_malloc (zend_alloc.c:2975)
==1888==    by 0x6FA5FD: _malloc_custom (zend_alloc.c:2416)
==1888==    by 0x6FA72B: _emalloc (zend_alloc.c:2535)
==1888==    by 0x6612D45: assemble_code_blocks (block_pass.c:972)
==1888==    by 0x661609D: zend_optimize_cfg (block_pass.c:1893)
==1888==    by 0x660575B: zend_optimize (zend_optimizer.c:967)
==1888==    by 0x6606747: zend_optimize_script (zend_optimizer.c:1390)
==1888==    by 0x65D23FE: cache_script_in_shared_memory (ZendAccelerator.c:1450)
==1888==    by 0x65D43D3: persistent_compile_file (ZendAccelerator.c:2135)
==1888==    by 0x79C027: zend_include_or_eval (zend_execute.c:4185)
==1888==    by 0x7A5796: ZEND_INCLUDE_OR_EVAL_SPEC_CONST_HANDLER (zend_vm_execute.h:3980)
==1888==

TysonAndre · 2020-01-20T21:17:11Z

It may be better, to extend DFA pass to perform the missed optimizations.
This will require keeping SSA consistency, but anyway, should be cheaper than the whole CFG pass.

That would have some drawbacks which made me hesitant to propose doing it that way:

More effort to maintain, and patches might get merged to one pass but not the other (depending on how it gets implemented)
More possibility of bugs (e.g. incorrect assumptions) in the duplicated definitions
Would need to add the shift list to the CFG phase, to account for optimizations that increase the number of opcodes

Another option would be to add another ini directive, e.g. opcache.repeated_opt_passes=0x...,
and document that it would increase startup time slightly more, but have the benefit of slightly more speedup after startup in some functions. (or change opcache.opt_level's default)

Instead directly use return_value. Hopefully this also works around the "may be uninitialized" warning...

TysonAndre · 2020-01-20T22:29:06Z

The remaining build failure looks like it wasn't caused by my PR itself, but instead by the fact that opcache had to recompile instead of using file_cache (or some other reason) - the implementation of array_reduce changed.

make test TESTS=ext/standard/tests/array/bug76778.phpt\ --show-diff\ -m\ --show-mem fails locally for me, both for master and this branch, after clearing the local opcache file cache

nikic · 2020-01-21T17:28:57Z

My general preference would be to move optimizations from block pass to SSA pass where possible. There are some things that need to happen before SSA construction (e.g. optimization of &&) because we get a better control flow graph from that, but I think many of the block pass optimizations shouldn't really be part of the block pass. For example, I see no reason why the "echo" merging optimization that seems to cause most of the diff here is part of that pass.

dstogov · 2020-01-21T21:47:07Z

Optimizations in block_pass should be cheaper, because they don't require SSA reconstruction.
The early optimization passes significantly reduce the cost of SSA construction.
DFA optimization pass have more limitations (reducible CFG, no catch/finally blocks).

So, I don't think, it makes sense to "move" optimizations, but better "duplicate" them, or separate valuable patterns and call them from both passes.

nikic · 2020-01-21T21:59:34Z

@dstogov I don't remember the details, but IIRC block pass can be much more expensive than the DFA passes, because the variable source tracking it does is so inefficent. Just zeroing out the source buffer for each block is very expensive if there are many temporaries. (We could compact temporaries earlier, but I think block pass relies on unique temporaries for some of its optimizations.) That's why I think that any optimizations related to variables should be in DFA. Of course, jump optimizations should be in block pass.

TysonAndre · 2020-01-22T01:42:35Z

For example, I see no reason why the "echo" merging optimization that seems to cause most of the diff here is part of that pass.

I'm stuck on how that optimization would best be implemented outside of the block pass.

Is there a way to (efficiently) tell (during the SCCP phase) if an opcode other than the preceding opcode will also jump to a given opcode? If so, does this information get updated after no-ops are removed?

I guess a one-off pass could find all opcodes that do jumps, if there's currently no way. EDIT: e.g. call zend_build_cfg from sccp to build the basic blocks.

I see mentions of predecessors and successors, but it looks like it's for data flow, not for code flow. (and I'd guess IS_CONST wouldn't have data flow)

E.g. how would I tell if the zend_opline for L3 has more than one jump (via control flow, goto, etc).

zend_optimize_block does it because it knows the start opline and end opline of a zend_basic_block.

<?php
function main(bool $flag) {
    if ($flag) {
        echo "True";
    }
mid:
    echo "Suffix";
}

L0 (2):     CV0($flag) = RECV 1
L1 (3):     JMPZ CV0($flag) L3
L2 (4):     ECHO string("True")
L3 (7):     ECHO string("Suffix")
L4 (8):     RETURN null

nikic · 2020-01-22T17:52:19Z

@TysonAndre The DFA pass also builds a CFG, so it should have access to the same information as the block pass. The CFG is stored here:

php-src/ext/opcache/Optimizer/zend_ssa.h

Line 133 in 3987a31

zend_cfg cfg; /* control flow graph */

And filter out echoes of the empty string (e.g. false/null) Split out of php#5097

TysonAndre · 2020-01-26T21:55:04Z

Looking again, some of the optimizations in the block pass seem unsound to use the way this PR will start using them, such as merging ZEND_ECHO calls. (not just less effective)

e.g. if $x = self::SOME_CONST, and echo $x; is used multiple times, then ZEND_OP1_LITERAL(last_op) might be referring to the same id.

if (Z_TYPE(ZEND_OP1_LITERAL(opline)) != IS_STRING) {
	convert_to_string(&ZEND_OP1_LITERAL(opline));
}
if (Z_TYPE(ZEND_OP1_LITERAL(last_op)) != IS_STRING) {
	convert_to_string(&ZEND_OP1_LITERAL(last_op));
}
old_len = Z_STRLEN(ZEND_OP1_LITERAL(last_op));
l = old_len + Z_STRLEN(ZEND_OP1_LITERAL(opline));
if (!Z_REFCOUNTED(ZEND_OP1_LITERAL(last_op))) {
	zend_string *tmp = zend_string_alloc(l, 0);
	memcpy(ZSTR_VAL(tmp), Z_STRVAL(ZEND_OP1_LITERAL(last_op)), old_len);
	Z_STR(ZEND_OP1_LITERAL(last_op)) = tmp;
} else {
	Z_STR(ZEND_OP1_LITERAL(last_op)) = zend_string_extend(Z_STR(ZEND_OP1_LITERAL(last_op)), l, 0);
}

If I'm reading it right, that pass assumes a given literal id for some opcode types gets used in exactly one place, and that it's safe to modify the zval of the constant.

Also, closing this PR because this PR isn't the way to go, and some of the work is started. If you think any of these tasks from the PR discussion should get done, file an issue.

Converting echo's operand to a string is in progress
Combining echo calls after SCCP and other phases isn't started
Optimizing jumps/control flow isn't started
Figuring out which optimizations from the block pass (if run twice) would take effect on commonly used projects hasn't been done (e.g. compare opcache.opt_debug_level=0x20000 output before/after this PR)

And filter out echoes of the empty string (e.g. false/null) Split out of php#5097

And filter out echoes of the empty string (e.g. false/null) Split out of #5097 (on GitHub) Closes GH-5118

Currently, it isn't possible to enable optimizations without enabling caching. They should be orthogonal features - it's already possible to cache without optimization passes (`opcache.optimization_level=0`) Being forced to either enable caching or file_cache_only causes these problems: - For file_cache_only, users would be forced to manage the file cache, and users may not be confident in properly handling all potential edge cases. (Running out of disk space, need to clear stale entries, concerns about opcode corruption not being fixed after restarting a process, etc) - The shared memory block uses up a lot of RAM when individual PHP processes are known to be long-lived and don't have a common memory address space. (e.g. multiple long-lived php scripts managed by something that is not a php script (e.g. `supervisord`, daemons, tools run in the background in IDEs, etc)) The opcodes are duplicated in shmem, which wastes memory in use cases where they're stored in the cache but never retrieved. The default for opcache.memory_consumption is 128M. Even when this isn't used, the virtual memory to track the shared memory(shmem) segment seems to add 2MB extra per **independent** php process in "shared memory" segments, reducing the free RAM available for other processes. (starting a large number of php CLI scripts that sleep() in a loop, `free` (linux program to report free memory) reports that `shared` increases by 2MB per process without opcache.no_cache, but barely increases with opcache.no_cache=1 On an unrelated PR php#5097 (comment) dstogov mentioned that > Also, it would be great to move optimizer and JIT into core, to make them > available even without opcode caching. This PR is one potential approach for making the optimizer and JIT available without opcode caching. This is currently a proof of concept - feel free to point out combinations of settings that should be rejected. Potential areas for future work: - Normally, opcache optimizes a file based only on that one file's contents. When `opcache.no_cache=1` is used, it may be possible to use all of the class, function, constant, etc. definitions parsed from previously parsed files (to eliminate dead code, inline function calls, etc).

Currently, it isn't possible to enable optimizations without enabling caching. They should be orthogonal features - it's already possible to cache without optimization passes (`opcache.optimization_level=0`) (In use cases such as running a web server in apache, it's very desirable to both cache and optimize opcodes. For short-lived cli scripts, either file_cache or disabling opcache (to reduce optimization overhead) is often useful. In a few use cases, it's desirable to optimize without any caching) Being forced to either enable shared memory caching or file_cache_only causes these problems: - **For file_cache_only, users would be forced to manage the file cache.** - Users may not be confident in properly handling all potential edge cases. - End users of applications may not be familiar with opcache. (Concerns include running out of disk space, needing to clear stale entries, concerns about opcode corruption or opcodes for the wrong file contents not being fixed after restarting a process, etc) - **The shared memory block uses up a lot of RAM when individual PHP processes are known to be long-lived and don't have a common memory address space.** (e.g. multiple long-lived php scripts managed by something that is not a php script (e.g. `supervisord`, daemons, tools run in the background in IDEs, etc)) The opcodes are duplicated in shmem, which wastes memory in use cases where opcodes are stored in the cache but never retrieved. The default for opcache.memory_consumption is 128M. Even when this isn't used, the virtual memory to track the shared memory(shmem) segment seems to add 2MB extra per **independent** php process in "shared memory" segments, reducing the free RAM available for other processes. (starting a large number of php CLI scripts that sleep() in a loop, `free` (linux program to report free memory) reports that `shared` increases by 2MB per process without opcache.no_cache, but barely increases with opcache.no_cache=1 `opcache.no_cache` takes precedence over `opcache.file_cache*` settings. (If enabled, neither the shared memory cache nor the file cache will be used) On an unrelated PR php#5097 (comment) dstogov mentioned that > Also, it would be great to move optimizer and JIT into core, to make them > available even without opcode caching. This PR is one potential approach for making the optimizer and JIT available without opcode caching. - Even if everything except caching was moved into core, It would still be useful to have a configurable way to disable opcode caching via a system ini setting (`php -d opcache.no_cache=1 my_script.php`) This is currently a proof of concept - feel free to point out combinations of settings that should be rejected. Potential areas for future work: - Normally, opcache optimizes a file based only on that one file's contents. When `opcache.no_cache=1` is used, it may be possible to use all of the class, function, constant, etc. definitions parsed from previously parsed files (to eliminate dead code, inline function calls, etc).

Currently, it isn't possible to enable optimizations without enabling caching. They should be orthogonal features - it's already possible to cache without optimization passes (`opcache.optimization_level=0`) (In use cases such as running a web server in apache, it's very desirable to both cache and optimize opcodes. For short-lived cli scripts, either file_cache or disabling opcache (to reduce optimization overhead) is often useful. In a few use cases, it's desirable to optimize without any caching) Being forced to either enable shared memory caching or file_cache_only causes these problems: - **For file_cache_only, users would be forced to manage the file cache.** - Users may not be confident in properly handling all potential edge cases. - End users of applications may not be familiar with opcache. (Concerns include running out of disk space, needing to clear stale entries, concerns about opcode corruption or opcodes for the wrong file contents not being fixed after restarting a process, etc) - **The shared memory block uses up a lot of RAM when individual PHP processes are known to be long-lived and don't have a common memory address space.** (e.g. multiple long-lived php scripts managed by something that is not a php script (e.g. `supervisord`, daemons, tools run in the background in IDEs, etc)) The opcodes are duplicated in shmem, which wastes memory in use cases where opcodes are stored in the cache but never retrieved. The default for opcache.memory_consumption is 128M. Even when this isn't used, the virtual memory to track the shared memory(shmem) segment seems to add 2MB extra per **independent** php process in "shared memory" segments, reducing the free RAM available for other processes. (starting a large number of php CLI scripts that sleep() in a loop, `free` (linux program to report free memory) reports that `shared` increases by 2MB per process without opcache.no_cache, but barely increases with opcache.no_cache=1 `opcache.no_cache` takes precedence over `opcache.file_cache*` settings. (If enabled, neither the shared memory cache nor the file cache will be used. It is an error to use both opcache.no_cache and opcache.preload.) On an unrelated PR php#5097 (comment) dstogov mentioned that > Also, it would be great to move optimizer and JIT into core, to make them > available even without opcode caching. This PR is one potential approach for making the optimizer and JIT available without opcode caching. - Even if everything except caching was moved into core, It would still be useful to have a configurable way to disable opcode caching via a system ini setting (`php -d opcache.no_cache=1 my_script.php`) This is currently a proof of concept - feel free to point out combinations of settings that should be rejected. Potential areas for future work: - Normally, opcache optimizes a file based only on that one file's contents. When `opcache.no_cache=1` is used, it may be possible to use all of the class, function, constant, etc. definitions parsed from previously parsed files (to eliminate dead code, inline function calls, etc).

Fix the pointers in func_info->callee_info

a51bf0a

Fix windows build, use uint8_t for pointer arithmetic

c29f52f

TysonAndre referenced this pull request Jan 20, 2020

Remove unnecessary result variable

0fbdc5a

Instead directly use return_value. Hopefully this also works around the "may be uninitialized" warning...

TysonAndre added a commit to TysonAndre/php-src that referenced this pull request Jan 26, 2020

Convert ZEND_ECHO operand to string during sccp

de806ec

And filter out echoes of the empty string (e.g. false/null) Split out of php#5097

TysonAndre added a commit to TysonAndre/php-src that referenced this pull request Jan 26, 2020

Convert ZEND_ECHO operand to string during sccp

596d4c8

And filter out echoes of the empty string (e.g. false/null) Split out of php#5097

TysonAndre mentioned this pull request Jan 26, 2020

Convert ZEND_ECHO operand to string #5118

Closed

TysonAndre added a commit to TysonAndre/php-src that referenced this pull request Jan 26, 2020

Convert ZEND_ECHO operand to string during sccp

64387ef

And filter out echoes of the empty string (e.g. false/null) Split out of php#5097

TysonAndre closed this Jan 26, 2020

TysonAndre added a commit to TysonAndre/php-src that referenced this pull request Jan 27, 2020

Convert ZEND_ECHO operand to string after sccp

cb97922

And filter out echoes of the empty string (e.g. false/null) Split out of php#5097

TysonAndre added a commit to TysonAndre/php-src that referenced this pull request Jan 27, 2020

Convert ZEND_ECHO operand to string after sccp

bbf16d9

And filter out echoes of the empty string (e.g. false/null) Split out of php#5097

php-pulls pushed a commit that referenced this pull request Jan 28, 2020

Convert ZEND_ECHO operand to string after sccp

5f92a08

And filter out echoes of the empty string (e.g. false/null) Split out of #5097 (on GitHub) Closes GH-5118

TysonAndre mentioned this pull request May 1, 2020

[RFC] opcache.allow_cache=0: Opcode optimization without caching #5504

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: Run the Control Flow Graph pass twice #5097

Proposal: Run the Control Flow Graph pass twice #5097

Uh oh!

TysonAndre commented Jan 20, 2020 •

edited

Loading

Uh oh!

dstogov commented Jan 20, 2020

Uh oh!

TysonAndre commented Jan 20, 2020

Uh oh!

TysonAndre commented Jan 20, 2020

Uh oh!

TysonAndre commented Jan 20, 2020

Uh oh!

nikic commented Jan 21, 2020

Uh oh!

dstogov commented Jan 21, 2020

Uh oh!

nikic commented Jan 21, 2020

Uh oh!

TysonAndre commented Jan 22, 2020 •

edited

Loading

Uh oh!

nikic commented Jan 22, 2020

Uh oh!

TysonAndre commented Jan 26, 2020

Uh oh!

Uh oh!

Proposal: Run the Control Flow Graph pass twice #5097

Proposal: Run the Control Flow Graph pass twice #5097

Uh oh!

Conversation

TysonAndre commented Jan 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dstogov commented Jan 20, 2020

Uh oh!

TysonAndre commented Jan 20, 2020

Uh oh!

TysonAndre commented Jan 20, 2020

Uh oh!

TysonAndre commented Jan 20, 2020

Uh oh!

nikic commented Jan 21, 2020

Uh oh!

dstogov commented Jan 21, 2020

Uh oh!

nikic commented Jan 21, 2020

Uh oh!

TysonAndre commented Jan 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nikic commented Jan 22, 2020

Uh oh!

TysonAndre commented Jan 26, 2020

Uh oh!

Uh oh!

TysonAndre commented Jan 20, 2020 •

edited

Loading

TysonAndre commented Jan 22, 2020 •

edited

Loading