-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Fix GH-8065: opcache.consistency_checks > 0 causes segfaults in PHP >= 8.1.5 in fpm context (by reworking checksum skip list in a pragmatic way) #10624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c498f20
to
33481dd
Compare
@nielsdos Thank you for working on this! The approach looks like a good improvement over mine. I will look at this in detail tomorrow.
AFAIK, persistent script in shm was always considered immutable once it's shared with other processes (except for Furthermore, the shm lock only protects from concurrent writes but not reads. This is because it is considered "add only", where reading existing scripts during a lock is fine as it should not be modified. Thus, any changes to existing scripts in shm would have to be atomic to avoid race conditions. So, any extension making changes to shm would have to be extremely careful. I think the pragmatic approach for JIT (i.e. skipping opcodes completely) is good enough. This could be improved in the future if desired. At least with this approach the feature is not completely broken anymore. |
Just making sure we are on the same page: the opcodes are not skipped completely, only the handler field in zend_op is skipped (see also the section on compression in my opening post). So if there is corruption in any other field the checksum would still catch that :) |
Ah great then 🙂 So we're skipping all handlers instead of just the ones potentially modified, that makes sense. That seems reasonable. |
Hmmmm there are failures in Windows CI: 4 FFI tests. Looking at them I don't see how those failures can be related to this PR though :/. Kinda hoping those are just some flaky tests... Sadly the assertion messages are not printed in that one job because it's compiled with ZEND_DEBUG=0, so I can't easily check that. |
@nielsdos |
@iluuu1994 Okay, I see. I pushed a commit to use ZEND_DEBUG=1 on Windows. Let's see what that gives, but that's for tomorrow to fix. It's probably something silly I missed. |
cff606e
to
e9abad5
Compare
Looks like I hit an already-existing(?) bug on zts=1 opcache=1 debug=1 windows build, the test runner crashes with:
This is in the compilation step of that php script. |
I'm looking more into the assertion failure. What a weird assert to hit, there's only one code path towards that function and that one does intern the class name. I'm afraid this is some sort of memory corruption. I pushed a commit to at least print me the class name so I can get a clue if the string is corrupted. |
We still have an open PR for Windows on GitHub actions. Maybe we can integrate ASAN there. I'll look into that. @cmb69 might also be able to help but I think he's pretty busy lately. |
@iluuu1994 Great! My previously pushed change adds a printf so that the class name is printed if the assert is hit. The assert is hit again, but the class name is not printed. So either printf is not outputting anything, or something else unexplainable is going on. |
You might need an fflush. |
I used stderr instead of stdout and I get this now:
I commented out the assert and now I get a new one:
I wonder if I really did something wrong in my PR which corrupts memory, or if this was already an issue somehow. It's my understanding though that at the point of these failures my added code isn't running yet. |
I pushed reverts of my fix to see if the problem already existed before my fixes. |
Okay. The problem already existed apparently, because with my reverts the crash still occurs. |
The non-opcache build shows the 3 of the 4 same failures as in my initial commit list. The difference is that now my commits are not applied at all. So the original problem also already existed (Although at least the memleak one might be only revealed due to being a debug build, whereas previously it was not a debug build. The other2 FFI tests are because of a missing DLL but that might be also debugging related?). Idk why this PR triggers it tho. FFI failures this time:
There's one other test that failed too this time. |
007a4e6
to
de48506
Compare
@iluuu1994 Good news: I set up a Windows VM and I managed to reproduce the FFI Windows crashes. They are stack corruptions apparently. Interestingly, they can already be triggered without my patches. I will spend some time investigating them. |
@nielsdos Fantastic! Weird, not sure why this is triggered by your PR. As always, thank you for taking the time to look at this! |
@iluuu1994 I figured it out. The crash about stack corruption happens in the MSVC runtime libraries. However, this is a false positive error that was fixed in libffi upstream in 2021: libffi/libffi@f88add1.
I'm not sure either. I know those stack corruption detections are probabilistic, so even things like stack address alignments do matter. I can consistently trigger the FFI test failures on my local machine both with and without my PR before I cherry-picked the upstream commit. After cherry-picking that commit, it all works perfectly. |
Great! I think cherry-picking is enough for now. Unfortunately, I don't have access to that repo. Christoph does most of the Windows stuff, but he seems very busy lately. I'd try opening a PR with the cherry-picked commit and see if he responds. |
For reference, I created the PR: winlibs/libffi#6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review. This looks very good to me!
No worries, thanks for reviewing :) I have a question though: what happens if a user upgrades their PHP 8.1 to the PHP version that will contain this PR? Won't the binary layout for the file cache for example change and cause problems? Or is there something in PHP that ensures that the caches are invalidated when updating PHP? |
@nielsdos I think those are different code paths. We're just changing the layout in shm which can't survive different versions of PHP. I don't think we're modifying the layout of the file cache, although I have never looked at how that works in detail. Dmitry can verify in a final review. |
…P >= 8.1.5 in fpm context Fixes phpGH-8065 (see also for analysis) The class inheritance cache must be excluded from the checksum computation because it may change (legally) in between the computation and recomputation. Furthermore the handler pointer in the zend_op structure may be modified by the JIT, and the op_array->reserved[] could be changed as well. This approach is based on iluuu1994's original approach, with some slight changes. First let's explain the main idea. The idea is to keep a "skip list" of offsets to skip in the checksum computation. This will include the offsets described above (class inheritance cache, handler, reserved). These skips are of size pointer_size (equals to the native pointer size). It differs in the sense that this is a pragmatic approach. It was found that in the original approach it was very difficult to figure out what can be exactly changed when, and let to redundant computation of the CFG for example. Furthermore, such a process is also error-prone and complex to maintain. Instead of that, we just always skip the handler pointer and the reserved area. The positive is that it leads to code which is easier to understand and maintain. The negative is that we won't catch corruptions in those memory areas. However, I believe this is acceptable because of the following reasons: * We'd have to be very unlucky to *only* have corruptions in *those* areas. Usually, corruptions are not that well-targetted. * A checksum isn't a fully error-proof thing either, it's very possible that there is a corruption in other areas that the checksum doesn't capture. Let's now talk about the implementation. Keeping all those entries in a skip list wouldn't be efficient: we'd need an entry for every opcode, which would lead to many many entries. However, this approach actually implements a "compression" scheme. Instead of just storing the offset, we actually store a triple: <offset, size of the area to be checked, amount of repetitions>. The offset indicates which pointer needs to be skipped. If the number of repetitions is greater than zero, it will do the following `repetitions` times: * Checksum the bytes at [offset + pointer_size, offset + pointer_size + size of area to be checked] * Move the offset to after the area that was just checksummed + pointer_size. This allows us to only use one skip list entry for all the opcodes in an op_array. The offsets also aren't checked in the zend_adler32() function itself, the zend_accel_script_checksum() function will checksum in "blocks" of bytes that should not be skipped, by setting the size for zend_alder32() in such a way that it computes the checksum until the next offset to skip. The main advantage of this is better performance, since there are fewer checks, and the accelerated SSE2 (or future other accelerated) routines can keep being used. Finally some other minor differences: * We precompute the exact size needed for the skip list. This avoids reallocations and also fixes one issue where there was a warning about the computed size not being equals to the actual size.
…nd on The JIT is the only place where the handlers are overwritten.
Pushed the requested changes, thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, thank you! Let's see what Dmitry thinks 🙂
Thanks. I'll play the with the patch and make a deep review on Monday. |
|
||
static zend_accel_skip_list_entry *mem_checksum_skip_list_persist(void) | ||
{ | ||
zend_sort(ZCG(mem_checksum_skip_list), ZCG(mem_checksum_skip_list_count), sizeof(zend_accel_skip_list_entry), (compare_func_t) mem_checksum_skip_list_compare, (swap_func_t) mem_checksum_skip_list_swap); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need sorting? zend_persist.c
stores data in a preallocated linear space, so the initial skip-list should be already sorted. I may be wrong. Please check this.
I found one more issue.
This happens for many of the JIT tests. |
@@ -607,7 +622,9 @@ uint32_t zend_accel_script_persist_calc(zend_persistent_script *new_persistent_s | |||
zend_persist_warnings_calc( | |||
new_persistent_script->num_warnings, new_persistent_script->warnings); | |||
|
|||
new_persistent_script->corrupted = 0; | |||
ADD_SIZE(ZCG(mem_checksum_skip_list_count) * sizeof(zend_accel_skip_list_entry)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should lead to request of memory amount to keep uncompressed skip list together with the script. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems, the compression doesn't reduce SHM memory consumption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this is the reason of "wrong size calculation" you just found.
@@ -1282,6 +1340,13 @@ zend_persistent_script *zend_accel_script_persist(zend_persistent_script *script | |||
{ | |||
Bucket *p; | |||
|
|||
/* The skip list count is still set by the persist_calc routines, which always precedes a call to this function. */ | |||
ZCG(mem_checksum_skip_list) = safe_emalloc(ZCG(mem_checksum_skip_list_count), sizeof(zend_accel_skip_list_entry), 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't understand the purpose of this allocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, we may store entries directly in SHM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think dropping opcache.consistency_checks
completely is going to be safer.
We may also try to make opcache.consistency_checks
to be incompatible with JIT. This won't solve inheritance cache problem, but this could be fixed by turining zend_class_etry.zend_inheritance_cache_entry
into zend_inheritance_cache_entry **inheritance_cache;
and allocating the actual pointer in the memory area that is not included in checksum calculation.
@@ -123,6 +136,10 @@ typedef struct _zend_persistent_script { | |||
void *mem; /* shared memory area used by script structures */ | |||
size_t size; /* size of used shared memory */ | |||
|
|||
/* Some bytes must be skipped in the checksum calculation because they could've changed (legally) | |||
* since the last checksum calculation. */ | |||
zend_accel_skip_list_entry *mem_checksum_skip_list; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed this in the previous review, I don't think we can even do this in a patch due to the ABI break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean zend_inheritance_cache_entry **inheritance_cache;
? We at least may try this for master. Fix for PHP-8.1 should keep binary compatibility, but I agree, it's dangerous. On the other hand targeting this PR to PHP-8.1 is also dangerous, because this is not a simple fix and it starts making changes even if opcache.consistency_checks == 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this is breaking since sizeof(zend_persistent_script)
and offsetof(zend_persistent_script, dynamic_members)
change. Maybe that's not the case for dynamic libraries, I don't know the details of what dynamic linking does.
I suggest disabling if Probably, we may provide some partial fix only for master or may decide to completely remove |
First of all, thank you both for the extensive review!
I completely agree with this. This feature is causing issues for end users and I think the
Personally I believe I would drop the feature as it would be either complex to fix, or would only be partially fixed (i.e. no JIT compatibility). |
Also... |
That makes sense, since the check is performed right after loading the info and the content into memory, before it gets a chance to be changed / get corrupted. I think this check is more useful than the shm one because other things on the system can actually modify the files and thus corrupt them. Corruption from PHP itself is still unlikely we're just reading but not mapping the file into memory. |
@iluuu1994 @nielsdos Could you please take care about this disabling patch. |
@iluuu1994 I'll give it a shot. |
This feature does not work right now and leads to memory leaks and other problems. For analysis and discussion see phpGH-8065. In phpGH-10624 it was decided to disable the feature to prevent problems for end users. If end users which to get some consistency guarantees, they can rely on opcache.protect_memory.
Replaced by GH-10798. |
This feature does not work right now and leads to memory leaks and other problems. For analysis and discussion see phpGH-8065. In phpGH-10624 it was decided to disable the feature to prevent problems for end users. If end users which to get some consistency guarantees, they can rely on opcache.protect_memory.
…= 8.1.5 in fpm context Disable opcache.consistency_checks. This feature does not work right now and leads to memory leaks and other problems. For analysis and discussion see GH-8065. In GH-10624 it was decided to disable the feature to prevent problems for end users. If end users which to get some consistency guarantees, they can rely on opcache.protect_memory. Closes GH-10798.
Fixes #8065 (see also for analysis)
/cc @iluuu1994 since you worked on this before.
Introduction
The class inheritance cache must be excluded from the checksum
computation because it may change (legally) in between the computation
and recomputation.
Furthermore the handler pointer in the zend_op structure may be modified
by the JIT, and the op_array->reserved[] could be changed as well.
Design
This approach is based on iluuu1994's original approach, with some
slight changes. First let's explain the main idea. The idea is to
keep a "skip list" of offsets to skip in the checksum computation.
This will include the offsets described above
(class inheritance cache, handler, reserved). These skips are of size
pointer_size (equals to the native pointer size).
It differs in the sense that this is a pragmatic approach. It was found
that in the original approach it was very difficult to figure out what
can be exactly changed when, and let to redundant computation of the CFG
for example. Furthermore, such a process is also error-prone and complex
to maintain. Instead of that, we just always skip the handler pointer
and the reserved area. The positive is that it leads to code which is
easier to understand and maintain. The negative is that we won't catch
corruptions in those memory areas. However, I believe this is acceptable
because of the following reasons:
areas. Usually, corruptions are not that well-targetted.
that there is a corruption in other areas that the checksum doesn't
capture.
Implementation
Let's now talk about the implementation. Keeping all those entries in a
skip list wouldn't be efficient: we'd need an entry for every opcode,
which would lead to many many entries. However, this approach actually
implements a "compression" scheme. Instead of just storing the offset,
we actually store a triple:
<offset, size of the area to be checked, amount of repetitions>.
The offset indicates which pointer needs to be skipped. If the number of
repetitions is greater than zero, it will do the following
repetitions
times:[offset + pointer_size, offset + pointer_size + size of area to be checked]
pointer_size.
This allows us to only use one skip list entry for all the opcodes in an
op_array.
The offsets also aren't checked in the zend_adler32() function itself,
the zend_accel_script_checksum() function will checksum in "blocks"
of bytes that should not be skipped, by setting the size for
zend_alder32() in such a way that it computes the checksum until the
next offset to skip. The main advantage of this is better performance,
since there are fewer checks, and the accelerated SSE2 (or future other
accelerated) routines can keep being used.
Finally some other minor differences:
reallocations and also fixes one issue where there was a warning about
the computed size not being equals to the actual size.
Final words
I'm not sure about the last commit: it was said that the opcode handler can only be changed by the JIT, but what about extension? Can't they also change that? If so, we can just drop the last commit (which is the reason I put it in a separate commit)
Also documentation may need updating? (#8065 (comment))