Skip to content

Fix GH-8065: opcache.consistency_checks > 0 causes segfaults in PHP >= 8.1.5 in fpm context (by reworking checksum skip list in a pragmatic way) #10624

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

nielsdos
Copy link
Member

Fixes #8065 (see also for analysis)

/cc @iluuu1994 since you worked on this before.

Introduction

The class inheritance cache must be excluded from the checksum
computation because it may change (legally) in between the computation
and recomputation.
Furthermore the handler pointer in the zend_op structure may be modified
by the JIT, and the op_array->reserved[] could be changed as well.

Design

This approach is based on iluuu1994's original approach, with some
slight changes. First let's explain the main idea. The idea is to
keep a "skip list" of offsets to skip in the checksum computation.
This will include the offsets described above
(class inheritance cache, handler, reserved). These skips are of size
pointer_size (equals to the native pointer size).

It differs in the sense that this is a pragmatic approach. It was found
that in the original approach it was very difficult to figure out what
can be exactly changed when, and let to redundant computation of the CFG
for example. Furthermore, such a process is also error-prone and complex
to maintain. Instead of that, we just always skip the handler pointer
and the reserved area. The positive is that it leads to code which is
easier to understand and maintain. The negative is that we won't catch
corruptions in those memory areas. However, I believe this is acceptable
because of the following reasons:

  • We'd have to be very unlucky to only have corruptions in those
    areas. Usually, corruptions are not that well-targetted.
  • A checksum isn't a fully error-proof thing either, it's very possible
    that there is a corruption in other areas that the checksum doesn't
    capture.

Implementation

Let's now talk about the implementation. Keeping all those entries in a
skip list wouldn't be efficient: we'd need an entry for every opcode,
which would lead to many many entries. However, this approach actually
implements a "compression" scheme. Instead of just storing the offset,
we actually store a triple:
<offset, size of the area to be checked, amount of repetitions>.
The offset indicates which pointer needs to be skipped. If the number of
repetitions is greater than zero, it will do the following
repetitions times:

  • Checksum the bytes at
    [offset + pointer_size, offset + pointer_size + size of area to be checked]
  • Move the offset to after the area that was just checksummed +
    pointer_size.
    This allows us to only use one skip list entry for all the opcodes in an
    op_array.

The offsets also aren't checked in the zend_adler32() function itself,
the zend_accel_script_checksum() function will checksum in "blocks"
of bytes that should not be skipped, by setting the size for
zend_alder32() in such a way that it computes the checksum until the
next offset to skip. The main advantage of this is better performance,
since there are fewer checks, and the accelerated SSE2 (or future other
accelerated) routines can keep being used.

Finally some other minor differences:

  • We precompute the exact size needed for the skip list. This avoids
    reallocations and also fixes one issue where there was a warning about
    the computed size not being equals to the actual size.

Final words

I'm not sure about the last commit: it was said that the opcode handler can only be changed by the JIT, but what about extension? Can't they also change that? If so, we can just drop the last commit (which is the reason I put it in a separate commit)

Also documentation may need updating? (#8065 (comment))

@nielsdos nielsdos changed the title Fix GH-8065: Rework checksum skip list in a pragmatic way Fix GH-8065: opcache.consistency_checks > 0 causes segfaults in PHP >= 8.1.5 in fpm context (by reworking checksum skip list in a pragmatic way) Feb 19, 2023
@iluuu1994
Copy link
Member

@nielsdos Thank you for working on this! The approach looks like a good improvement over mine. I will look at this in detail tomorrow.

I'm not sure about the last commit: it was said that the opcode handler can only be changed by the JIT, but what about extension? Can't they also change that? If so, we can just drop the last commit (which is the reason I put it in a separate commit)

AFAIK, persistent script in shm was always considered immutable once it's shared with other processes (except for dynamic_members). This was completely true until inheritance cache and tracing JIT were added. I don't think there's technically anything stopping extensions from mutating shm if they lock it (and unprotect it if opcache.protect_memory is enabled), but in that case any memory is subject to being mutated, not just zend_op.handler.

Furthermore, the shm lock only protects from concurrent writes but not reads. This is because it is considered "add only", where reading existing scripts during a lock is fine as it should not be modified. Thus, any changes to existing scripts in shm would have to be atomic to avoid race conditions. So, any extension making changes to shm would have to be extremely careful.

I think the pragmatic approach for JIT (i.e. skipping opcodes completely) is good enough. This could be improved in the future if desired. At least with this approach the feature is not completely broken anymore.

@nielsdos
Copy link
Member Author

nielsdos commented Feb 19, 2023

Just making sure we are on the same page: the opcodes are not skipped completely, only the handler field in zend_op is skipped (see also the section on compression in my opening post). So if there is corruption in any other field the checksum would still catch that :)

@iluuu1994
Copy link
Member

Ah great then 🙂 So we're skipping all handlers instead of just the ones potentially modified, that makes sense. That seems reasonable.

@nielsdos
Copy link
Member Author

nielsdos commented Feb 19, 2023

Hmmmm there are failures in Windows CI: 4 FFI tests. Looking at them I don't see how those failures can be related to this PR though :/. Kinda hoping those are just some flaky tests... Sadly the assertion messages are not printed in that one job because it's compiled with ZEND_DEBUG=0, so I can't easily check that.

@iluuu1994
Copy link
Member

@nielsdos -1073741819 is STATUS_ACCESS_VIOLATION, so most likely legit. You can temporarily modify the CI config it that helps.

@nielsdos
Copy link
Member Author

nielsdos commented Feb 19, 2023

@iluuu1994 Okay, I see. I pushed a commit to use ZEND_DEBUG=1 on Windows. Let's see what that gives, but that's for tomorrow to fix. It's probably something silly I missed.

@nielsdos
Copy link
Member Author

nielsdos commented Feb 20, 2023

Looks like I hit an already-existing(?) bug on zts=1 opcache=1 debug=1 windows build, the test runner crashes with:

Assertion failed: (zval_gc_flags((class_name)->gc.u.type_info) & (1<<6)), file Zend\zend_enum.c, line 436

This is in the compilation step of that php script.
Makes it kinda difficult to debug my crash issue...

@nielsdos
Copy link
Member Author

I'm looking more into the assertion failure. What a weird assert to hit, there's only one code path towards that function and that one does intern the class name. I'm afraid this is some sort of memory corruption. I pushed a commit to at least print me the class name so I can get a clue if the string is corrupted.
In the worst case I'll set up a Windows VM ig :/

@iluuu1994
Copy link
Member

We still have an open PR for Windows on GitHub actions. Maybe we can integrate ASAN there. I'll look into that. @cmb69 might also be able to help but I think he's pretty busy lately.

@nielsdos
Copy link
Member Author

@iluuu1994 Great!

My previously pushed change adds a printf so that the class name is printed if the assert is hit. The assert is hit again, but the class name is not printed. So either printf is not outputting anything, or something else unexplainable is going on.

@iluuu1994
Copy link
Member

You might need an fflush.

@nielsdos
Copy link
Member Author

I used stderr instead of stdout and I get this now:

class name is not interned: ZendTestUnitEnum
class name is not interned: ZendTestUnitEnum
class name is not interned: ZendTestStringEnum
class name is not interned: ZendTestStringEnum
class name is not interned: ZendTestStringEnum
class name is not interned: ZendTestStringEnum

I commented out the assert and now I get a new one:

Assertion failed: !(zval_gc_flags((str)->gc.u.type_info) & (1<<7)), file Zend\zend_variables.c, line 66

I wonder if I really did something wrong in my PR which corrupts memory, or if this was already an issue somehow. It's my understanding though that at the point of these failures my added code isn't running yet.

@nielsdos
Copy link
Member Author

I pushed reverts of my fix to see if the problem already existed before my fixes.
Sorry for the spam and the messed up history, I'll clean the history later when we have the answers we need.

@nielsdos
Copy link
Member Author

nielsdos commented Feb 20, 2023

Okay. The problem already existed apparently, because with my reverts the crash still occurs.
(link to report: https://ci.appveyor.com/project/php/php-src/builds/46287409/job/o63kdk2noux2s619)
We should make an issue report about this probably.
Now I also begin to wonder whether the original 4 failing tests are actually caused by my PR, or were also related to the Windows failures we are experiencing now.

@nielsdos
Copy link
Member Author

nielsdos commented Feb 20, 2023

The non-opcache build shows the 3 of the 4 same failures as in my initial commit list. The difference is that now my commits are not applied at all. So the original problem also already existed (Although at least the memleak one might be only revealed due to being a debug build, whereas previously it was not a debug build. The other2 FFI tests are because of a missing DLL but that might be also debugging related?). Idk why this PR triggers it tho.

FFI failures this time:

FFI 301: FFI loading on Windows [C:\projects\php-src\ext\ffi\tests\301-win32.phpt]
FR #78270 (Usage of __vectorcall convention with FFI) [C:\projects\php-src\ext\ffi\tests\bug78270_2.phpt]
Bug #78714 (funcs returning pointer can't use call convention spec) [C:\projects\php-src\ext\ffi\tests\bug78714.phpt]

There's one other test that failed too this time.

@nielsdos nielsdos force-pushed the attempt-8065 branch 2 times, most recently from 007a4e6 to de48506 Compare February 22, 2023 08:57
@nielsdos
Copy link
Member Author

@iluuu1994 Good news: I set up a Windows VM and I managed to reproduce the FFI Windows crashes. They are stack corruptions apparently. Interestingly, they can already be triggered without my patches. I will spend some time investigating them.

@iluuu1994
Copy link
Member

@nielsdos Fantastic! Weird, not sure why this is triggered by your PR. As always, thank you for taking the time to look at this!

@nielsdos
Copy link
Member Author

@iluuu1994 I figured it out. The crash about stack corruption happens in the MSVC runtime libraries. However, this is a false positive error that was fixed in libffi upstream in 2021: libffi/libffi@f88add1.
Looking at the version of libffi we are using: https://github.com/winlibs/libffi this one is last updated in 2020, so it doesn't contain that fix. I cherry-picked that upstream commit on top of winlibs/libffi and every test works as expected now (it works both with and without this PR).
We should either cherry-pick that fix, or ideally update winlibs/libffi to the latest version of upstream libffi.

not sure why this is triggered by your PR

I'm not sure either. I know those stack corruption detections are probabilistic, so even things like stack address alignments do matter. I can consistently trigger the FFI test failures on my local machine both with and without my PR before I cherry-picked the upstream commit. After cherry-picking that commit, it all works perfectly.

@iluuu1994
Copy link
Member

Great! I think cherry-picking is enough for now. Unfortunately, I don't have access to that repo. Christoph does most of the Windows stuff, but he seems very busy lately. I'd try opening a PR with the cherry-picked commit and see if he responds.

@nielsdos
Copy link
Member Author

For reference, I created the PR: winlibs/libffi#6

Copy link
Member

@iluuu1994 iluuu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review. This looks very good to me!

@nielsdos
Copy link
Member Author

nielsdos commented Mar 1, 2023

Sorry for the late review. This looks very good to me!

No worries, thanks for reviewing :)

I have a question though: what happens if a user upgrades their PHP 8.1 to the PHP version that will contain this PR? Won't the binary layout for the file cache for example change and cause problems? Or is there something in PHP that ensures that the caches are invalidated when updating PHP?
Note: it's been a while since I worked on this so maybe my question doesn't make sense :p

@iluuu1994
Copy link
Member

@nielsdos I think those are different code paths. We're just changing the layout in shm which can't survive different versions of PHP. I don't think we're modifying the layout of the file cache, although I have never looked at how that works in detail. Dmitry can verify in a final review.

nielsdos added 5 commits March 1, 2023 20:50
…P >= 8.1.5 in fpm context

Fixes phpGH-8065 (see also for analysis)

The class inheritance cache must be excluded from the checksum
computation because it may change (legally) in between the computation
and recomputation.
Furthermore the handler pointer in the zend_op structure may be modified
by the JIT, and the op_array->reserved[] could be changed as well.

This approach is based on iluuu1994's original approach, with some
slight changes. First let's explain the main idea. The idea is to
keep a "skip list" of offsets to skip in the checksum computation.
This will include the offsets described above
(class inheritance cache, handler, reserved). These skips are of size
pointer_size (equals to the native pointer size).

It differs in the sense that this is a pragmatic approach. It was found
that in the original approach it was very difficult to figure out what
can be exactly changed when, and let to redundant computation of the CFG
for example. Furthermore, such a process is also error-prone and complex
to maintain. Instead of that, we just always skip the handler pointer
and the reserved area. The positive is that it leads to code which is
easier to understand and maintain. The negative is that we won't catch
corruptions in those memory areas. However, I believe this is acceptable
because of the following reasons:
* We'd have to be very unlucky to *only* have corruptions in *those*
  areas. Usually, corruptions are not that well-targetted.
* A checksum isn't a fully error-proof thing either, it's very possible
  that there is a corruption in other areas that the checksum doesn't
  capture.

Let's now talk about the implementation. Keeping all those entries in a
skip list wouldn't be efficient: we'd need an entry for every opcode,
which would lead to many many entries. However, this approach actually
implements a "compression" scheme. Instead of just storing the offset,
we actually store a triple:
<offset, size of the area to be checked, amount of repetitions>.
The offset indicates which pointer needs to be skipped. If the number of
repetitions is greater than zero, it will do the following
`repetitions` times:
* Checksum the bytes at
  [offset + pointer_size, offset + pointer_size + size of area to be checked]
* Move the offset to after the area that was just checksummed +
  pointer_size.
This allows us to only use one skip list entry for all the opcodes in an
op_array.

The offsets also aren't checked in the zend_adler32() function itself,
the zend_accel_script_checksum() function will checksum in "blocks"
of bytes that should not be skipped, by setting the size for
zend_alder32() in such a way that it computes the checksum until the
next offset to skip. The main advantage of this is better performance,
since there are fewer checks, and the accelerated SSE2 (or future other
accelerated) routines can keep being used.

Finally some other minor differences:
* We precompute the exact size needed for the skip list. This avoids
  reallocations and also fixes one issue where there was a warning about
  the computed size not being equals to the actual size.
…nd on

The JIT is the only place where the handlers are overwritten.
@nielsdos
Copy link
Member Author

nielsdos commented Mar 1, 2023

Pushed the requested changes, thanks.

Copy link
Member

@iluuu1994 iluuu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, thank you! Let's see what Dmitry thinks 🙂

@iluuu1994 iluuu1994 requested a review from dstogov March 1, 2023 20:19
@dstogov
Copy link
Member

dstogov commented Mar 2, 2023

Thanks. I'll play the with the patch and make a deep review on Monday.


static zend_accel_skip_list_entry *mem_checksum_skip_list_persist(void)
{
zend_sort(ZCG(mem_checksum_skip_list), ZCG(mem_checksum_skip_list_count), sizeof(zend_accel_skip_list_entry), (compare_func_t) mem_checksum_skip_list_compare, (swap_func_t) mem_checksum_skip_list_swap);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need sorting? zend_persist.c stores data in a preallocated linear space, so the initial skip-list should be already sorted. I may be wrong. Please check this.

@iluuu1994
Copy link
Member

I found one more issue.

make test TESTS="-j18 -g FAIL,BORK -d opcache.enable_cli=1 -d opcache.consistency_checks=1 -d opcache.log_verbosity_level=2 -d opcache.protect_memory=1 ext/opcache"

Warning Internal error: wrong size calculation: %sinit_fcall_002.php

This happens for many of the JIT tests.

@@ -607,7 +622,9 @@ uint32_t zend_accel_script_persist_calc(zend_persistent_script *new_persistent_s
zend_persist_warnings_calc(
new_persistent_script->num_warnings, new_persistent_script->warnings);

new_persistent_script->corrupted = 0;
ADD_SIZE(ZCG(mem_checksum_skip_list_count) * sizeof(zend_accel_skip_list_entry));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should lead to request of memory amount to keep uncompressed skip list together with the script. Right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems, the compression doesn't reduce SHM memory consumption.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is the reason of "wrong size calculation" you just found.

@@ -1282,6 +1340,13 @@ zend_persistent_script *zend_accel_script_persist(zend_persistent_script *script
{
Bucket *p;

/* The skip list count is still set by the persist_calc routines, which always precedes a call to this function. */
ZCG(mem_checksum_skip_list) = safe_emalloc(ZCG(mem_checksum_skip_list_count), sizeof(zend_accel_skip_list_entry), 0);
Copy link
Member

@dstogov dstogov Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand the purpose of this allocation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, we may store entries directly in SHM.

Copy link
Member

@dstogov dstogov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think dropping opcache.consistency_checks completely is going to be safer.

We may also try to make opcache.consistency_checks to be incompatible with JIT. This won't solve inheritance cache problem, but this could be fixed by turining zend_class_etry.zend_inheritance_cache_entry into zend_inheritance_cache_entry **inheritance_cache; and allocating the actual pointer in the memory area that is not included in checksum calculation.

@@ -123,6 +136,10 @@ typedef struct _zend_persistent_script {
void *mem; /* shared memory area used by script structures */
size_t size; /* size of used shared memory */

/* Some bytes must be skipped in the checksum calculation because they could've changed (legally)
* since the last checksum calculation. */
zend_accel_skip_list_entry *mem_checksum_skip_list;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this in the previous review, I don't think we can even do this in a patch due to the ABI break.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean zend_inheritance_cache_entry **inheritance_cache;? We at least may try this for master. Fix for PHP-8.1 should keep binary compatibility, but I agree, it's dangerous. On the other hand targeting this PR to PHP-8.1 is also dangerous, because this is not a simple fix and it starts making changes even if opcache.consistency_checks == 0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this is breaking since sizeof(zend_persistent_script) and offsetof(zend_persistent_script, dynamic_members) change. Maybe that's not the case for dynamic libraries, I don't know the details of what dynamic linking does.

@dstogov
Copy link
Member

dstogov commented Mar 6, 2023

I suggest disabling opcache.consistency_checks in PHP-8.1 and above.
Now it leads to crashes and it's too risky to commit complex changes like this into old stable branch.

if opcache.consistency_checks is set above zero, php should emit warning message and reset opcache.consistency_checks to zero.

Probably, we may provide some partial fix only for master or may decide to completely remove opcache.consistency_checks at all.

@nielsdos
Copy link
Member Author

nielsdos commented Mar 6, 2023

First of all, thank you both for the extensive review!
I agree with your comments that this is not acceptable given the ABI break, the old version and the complex changes.

I suggest disabling opcache.consistency_checks in PHP-8.1 and above. Now it leads to crashes and it's too risky to commit complex changes like this into old stable branch.

if opcache.consistency_checks is set above zero, php should emit warning message and reset opcache.consistency_checks to zero.

I completely agree with this. This feature is causing issues for end users and I think the protect_memory feature can be used in place of this to uphold some guarantee of consistency. It's probably best to disable this feature in the way you described.

Probably, we may provide some partial fix only for master or may decide to completely remove opcache.consistency_checks at all.

Personally I believe I would drop the feature as it would be either complex to fix, or would only be partially fixed (i.e. no JIT compatibility).

@nielsdos
Copy link
Member Author

nielsdos commented Mar 6, 2023

Also...
The file cache consistency checks are not broken as far as I know. So we could keep supporting that use case.

@iluuu1994
Copy link
Member

The file cache consistency checks are not broken as far as I know. So we could keep supporting that use case.

That makes sense, since the check is performed right after loading the info and the content into memory, before it gets a chance to be changed / get corrupted. I think this check is more useful than the shm one because other things on the system can actually modify the files and thus corrupt them. Corruption from PHP itself is still unlikely we're just reading but not mapping the file into memory.

@dstogov
Copy link
Member

dstogov commented Mar 6, 2023

@iluuu1994 @nielsdos Could you please take care about this disabling patch.

@iluuu1994
Copy link
Member

@dstogov Sure! @nielsdos Let me know if you're still interested in working on this. If not, I can take over (giving you credit for your work still, of course).

@nielsdos
Copy link
Member Author

nielsdos commented Mar 6, 2023

@iluuu1994 I'll give it a shot.

nielsdos added a commit to nielsdos/php-src that referenced this pull request Mar 6, 2023
This feature does not work right now and leads to memory leaks and other
problems. For analysis and discussion see phpGH-8065. In phpGH-10624 it was
decided to disable the feature to prevent problems for end users.
If end users which to get some consistency guarantees, they can rely on
opcache.protect_memory.
@nielsdos
Copy link
Member Author

nielsdos commented Mar 6, 2023

Replaced by GH-10798.

@nielsdos nielsdos closed this Mar 6, 2023
nielsdos added a commit to nielsdos/php-src that referenced this pull request Mar 6, 2023
This feature does not work right now and leads to memory leaks and other
problems. For analysis and discussion see phpGH-8065. In phpGH-10624 it was
decided to disable the feature to prevent problems for end users.
If end users which to get some consistency guarantees, they can rely on
opcache.protect_memory.
nielsdos added a commit that referenced this pull request Mar 7, 2023
…= 8.1.5 in fpm context

Disable opcache.consistency_checks.

This feature does not work right now and leads to memory leaks and other
problems. For analysis and discussion see GH-8065. In GH-10624 it was
decided to disable the feature to prevent problems for end users.
If end users which to get some consistency guarantees, they can rely on
opcache.protect_memory.

Closes GH-10798.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants