Skip to content

Add retrying mechanism for IO tests in run-tests.php #10892

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

iluuu1994
Copy link
Member

@iluuu1994 iluuu1994 commented Mar 20, 2023

This approach should minimize false negatives since we only allow retrying tests that contain "timed out" or "connection refused", and just once. We're also marking them as WARN to catch false negatives if we notice that the same test keeps being repeated. One downside to this approach for file IO is that most of those contain a CLEAN section, but those tests are not repeatable. I'm not sure if we could allow this and guarantee running the CLEAN section, which should be written to be executable under any condition anyway.

@bukka Let me know what you think.

@iluuu1994 iluuu1994 requested a review from bukka March 20, 2023 23:48
@iluuu1994
Copy link
Member Author

@bukka Thanks for the review! I'll keep this open for another day just in case somebody else has feedback.

@iluuu1994 iluuu1994 changed the base branch from master to PHP-8.1 March 22, 2023 13:36
@mvorisek
Copy link
Contributor

This PR might decrease the CI quality. Imagine a test with failure 5% failure rate (or 10k tests with 5ppm failure rate). If they are retried twice, instead of such random failure being detected in 1 in 20 runs, in will be detected in 1 in 400 tests. I know it can display WARN log, but not many people read it, one example is #10838 which noone noticed/reported for 1/2 year.

My question is if "timeout CI failures" like https://github.com/php/php-src/actions/runs/4502125622/jobs/7923508402#step:13:132 can be fixed better. From my experience, they can, like using nice for the (web)server to have higher CPU priority (than the client).

@iluuu1994
Copy link
Member Author

Imagine a test with failure 5% failure rate (or 10k tests with 5ppm failure rate)

Unless there's memory corruption (for which he have ASAN and MSAN) such a test is highly unlikely and would be a useless test regardless of a retry mechanism.

My question is if "timeout CI failures" like https://github.com/php/php-src/actions/runs/4502125622/jobs/7923508402#step:13:132 can be fixed better. From my experience, they can, like using nice for the (web)server to have higher CPU priority (than the client).

It's not just curl, these kinds of issues can happen with any IO test (db, file access, network, etc). If you can find a universal solution then great! But for the time being I don't want to spend more time on this, but I also don't want to keep checking nightly for failures that aren't actually failures.

iluuu1994 added a commit to iluuu1994/php-src that referenced this pull request Mar 27, 2023
Alternative to phpGH-10892. I'm not sure if this actually helps.
iluuu1994 added a commit to iluuu1994/php-src that referenced this pull request Mar 27, 2023
Alternative to phpGH-10892. This is somewhat unfortunate since these are also the
slow tests. I'm also not sure if this actually helps.
iluuu1994 added a commit that referenced this pull request Mar 27, 2023
Alternative to GH-10892. This is somewhat unfortunate since these are also the
slow tests. I'm also not sure if this actually helps.

Closes GH-10953
@iluuu1994
Copy link
Member Author

I've now added a ext/curl/tests/CONFLICTS file with all and there are still spurious failures. I don't see any other solution than to add a retry mechanism. Thus, I will most likely merge this next week.

@mvorisek
Copy link
Contributor

mvorisek commented Apr 30, 2023

I checked the CI history and most of the timeouts problems are gone, but there is other problem - https://github.com/php/php-src/actions/runs/4757306600/jobs/8453959070#step:13:136 - 404: Page Not Found

And https://github.com/php/php-src/actions/runs/4761790709/jobs/8463358838#step:13:129 can indicate looping problems in copy optimization for example.

And in the past #9650 problem for example was occuring about in 1 in 10k runs, any retry mechanism (even with timed out|connection refused text only) will hide such quite major problem quite reliably. So this PR should really not be integrated as the CI quality can be degraded.

@iluuu1994
Copy link
Member Author

I checked the CI history and most of the timeouts problems are gone

If you look through the last nightly runs, there are still many failing jobs due to spurious errors (up to 4 jobs per workflow). I don't understand where the 404 errors are coming from, I can't reproduce them locally.

can indicate looping problems in copy optimization for example.

I don't understand what you mean, can you elaborate?

@mvorisek
Copy link
Contributor

mvorisek commented May 1, 2023

If you look through the last nightly runs, there are still many failing jobs due to spurious errors (up to 4 jobs per workflow). I don't understand where the 404 errors are coming from, I can't reproduce them locally.

strace should be able to explain the problems

I looked into the nightly logs and it seems there are many problems.

I don't understand what you mean, can you elaborate?

Let say there is a problem in 1 of 10k runs. With any retry mechanism, even with 1 retry, the problem will be effectively supressed.

@iluuu1994
Copy link
Member Author

strace should be able to explain the problems

How, when the problem is not reproducible locally and very rarely in CI?

Let say there is a problem in 1 of 10k runs. With any retry mechanism, even with 1 retry, the problem will be effectively supressed.

A test that is failing 1/10,000 times is already completely useless in detecting said error.

@mvorisek
Copy link
Contributor

mvorisek commented May 1, 2023

How, when the problem is not reproducible locally and very rarely in CI?

adjust run-tests.php to log strace for each test and save strace in CI artifacts in case a test failed

A test that is failing 1/10,000 times is already completely useless in detecting said error.

1/10k problem when present in many tests is still detectable currently, but you get the idea, sensitivity with 1 retry would decrease by a square.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants