-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Add retrying mechanism for IO tests in run-tests.php #10892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@bukka Thanks for the review! I'll keep this open for another day just in case somebody else has feedback. |
3131b67
to
50b9e92
Compare
This PR might decrease the CI quality. Imagine a test with failure 5% failure rate (or 10k tests with 5ppm failure rate). If they are retried twice, instead of such random failure being detected in 1 in 20 runs, in will be detected in 1 in 400 tests. I know it can display WARN log, but not many people read it, one example is #10838 which noone noticed/reported for 1/2 year. My question is if "timeout CI failures" like https://github.com/php/php-src/actions/runs/4502125622/jobs/7923508402#step:13:132 can be fixed better. From my experience, they can, like using |
Unless there's memory corruption (for which he have ASAN and MSAN) such a test is highly unlikely and would be a useless test regardless of a retry mechanism.
It's not just curl, these kinds of issues can happen with any IO test (db, file access, network, etc). If you can find a universal solution then great! But for the time being I don't want to spend more time on this, but I also don't want to keep checking nightly for failures that aren't actually failures. |
Alternative to phpGH-10892. I'm not sure if this actually helps.
Alternative to phpGH-10892. This is somewhat unfortunate since these are also the slow tests. I'm also not sure if this actually helps.
I've now added a |
I checked the CI history and most of the timeouts problems are gone, but there is other problem - https://github.com/php/php-src/actions/runs/4757306600/jobs/8453959070#step:13:136 - And https://github.com/php/php-src/actions/runs/4761790709/jobs/8463358838#step:13:129 can indicate looping problems in copy optimization for example. And in the past #9650 problem for example was occuring about in 1 in 10k runs, any retry mechanism (even with |
If you look through the last nightly runs, there are still many failing jobs due to spurious errors (up to 4 jobs per workflow). I don't understand where the 404 errors are coming from, I can't reproduce them locally.
I don't understand what you mean, can you elaborate? |
I looked into the nightly logs and it seems there are many problems.
Let say there is a problem in 1 of 10k runs. With any retry mechanism, even with 1 retry, the problem will be effectively supressed. |
How, when the problem is not reproducible locally and very rarely in CI?
A test that is failing 1/10,000 times is already completely useless in detecting said error. |
adjust run-tests.php to log strace for each test and save strace in CI artifacts in case a test failed
1/10k problem when present in many tests is still detectable currently, but you get the idea, sensitivity with 1 retry would decrease by a square. |
This approach should minimize false negatives since we only allow retrying tests that contain "timed out" or "connection refused", and just once. We're also marking them as
WARN
to catch false negatives if we notice that the same test keeps being repeated. One downside to this approach for file IO is that most of those contain aCLEAN
section, but those tests are not repeatable. I'm not sure if we could allow this and guarantee running theCLEAN
section, which should be written to be executable under any condition anyway.@bukka Let me know what you think.