Add retrying mechanism for IO tests in run-tests.php #10892

iluuu1994 · 2023-03-20T23:48:22Z

This approach should minimize false negatives since we only allow retrying tests that contain "timed out" or "connection refused", and just once. We're also marking them as WARN to catch false negatives if we notice that the same test keeps being repeated. One downside to this approach for file IO is that most of those contain a CLEAN section, but those tests are not repeatable. I'm not sure if we could allow this and guarantee running the CLEAN section, which should be written to be executable under any condition anyway.

@bukka Let me know what you think.

iluuu1994 · 2023-03-21T18:40:32Z

@bukka Thanks for the review! I'll keep this open for another day just in case somebody else has feedback.

mvorisek · 2023-03-23T20:45:19Z

This PR might decrease the CI quality. Imagine a test with failure 5% failure rate (or 10k tests with 5ppm failure rate). If they are retried twice, instead of such random failure being detected in 1 in 20 runs, in will be detected in 1 in 400 tests. I know it can display WARN log, but not many people read it, one example is #10838 which noone noticed/reported for 1/2 year.

My question is if "timeout CI failures" like https://github.com/php/php-src/actions/runs/4502125622/jobs/7923508402#step:13:132 can be fixed better. From my experience, they can, like using nice for the (web)server to have higher CPU priority (than the client).

iluuu1994 · 2023-03-23T23:03:20Z

Imagine a test with failure 5% failure rate (or 10k tests with 5ppm failure rate)

Unless there's memory corruption (for which he have ASAN and MSAN) such a test is highly unlikely and would be a useless test regardless of a retry mechanism.

My question is if "timeout CI failures" like https://github.com/php/php-src/actions/runs/4502125622/jobs/7923508402#step:13:132 can be fixed better. From my experience, they can, like using nice for the (web)server to have higher CPU priority (than the client).

It's not just curl, these kinds of issues can happen with any IO test (db, file access, network, etc). If you can find a universal solution then great! But for the time being I don't want to spend more time on this, but I also don't want to keep checking nightly for failures that aren't actually failures.

Alternative to phpGH-10892. I'm not sure if this actually helps.

Alternative to phpGH-10892. This is somewhat unfortunate since these are also the slow tests. I'm also not sure if this actually helps.

Alternative to GH-10892. This is somewhat unfortunate since these are also the slow tests. I'm also not sure if this actually helps. Closes GH-10953

iluuu1994 · 2023-04-30T11:50:18Z

I've now added a ext/curl/tests/CONFLICTS file with all and there are still spurious failures. I don't see any other solution than to add a retry mechanism. Thus, I will most likely merge this next week.

mvorisek · 2023-04-30T11:54:30Z

I checked the CI history and most of the timeouts problems are gone, but there is other problem - https://github.com/php/php-src/actions/runs/4757306600/jobs/8453959070#step:13:136 - 404: Page Not Found

And https://github.com/php/php-src/actions/runs/4761790709/jobs/8463358838#step:13:129 can indicate looping problems in copy optimization for example.

And in the past #9650 problem for example was occuring about in 1 in 10k runs, any retry mechanism (even with timed out|connection refused text only) will hide such quite major problem quite reliably. So this PR should really not be integrated as the CI quality can be degraded.

iluuu1994 · 2023-05-01T12:01:53Z

I checked the CI history and most of the timeouts problems are gone

If you look through the last nightly runs, there are still many failing jobs due to spurious errors (up to 4 jobs per workflow). I don't understand where the 404 errors are coming from, I can't reproduce them locally.

can indicate looping problems in copy optimization for example.

I don't understand what you mean, can you elaborate?

mvorisek · 2023-05-01T12:17:54Z

If you look through the last nightly runs, there are still many failing jobs due to spurious errors (up to 4 jobs per workflow). I don't understand where the 404 errors are coming from, I can't reproduce them locally.

strace should be able to explain the problems

I looked into the nightly logs and it seems there are many problems.

I don't understand what you mean, can you elaborate?

Let say there is a problem in 1 of 10k runs. With any retry mechanism, even with 1 retry, the problem will be effectively supressed.

iluuu1994 · 2023-05-01T12:58:32Z

strace should be able to explain the problems

How, when the problem is not reproducible locally and very rarely in CI?

Let say there is a problem in 1 of 10k runs. With any retry mechanism, even with 1 retry, the problem will be effectively supressed.

A test that is failing 1/10,000 times is already completely useless in detecting said error.

mvorisek · 2023-05-01T13:06:02Z

How, when the problem is not reproducible locally and very rarely in CI?

adjust run-tests.php to log strace for each test and save strace in CI artifacts in case a test failed

A test that is failing 1/10,000 times is already completely useless in detecting said error.

1/10k problem when present in many tests is still detectable currently, but you get the idea, sensitivity with 1 retry would decrease by a square.

iluuu1994 requested a review from bukka March 20, 2023 23:48

bukka approved these changes Mar 21, 2023

View reviewed changes

Add retrying mechanism for IO tests in run-tests.php

50b9e92

iluuu1994 changed the base branch from master to PHP-8.1 March 22, 2023 13:36

iluuu1994 force-pushed the run-tests-retry branch from 3131b67 to 50b9e92 Compare March 22, 2023 13:36

iluuu1994 added a commit to iluuu1994/php-src that referenced this pull request Mar 27, 2023

Unparallelize IO heavy tests

5d62207

Alternative to phpGH-10892. I'm not sure if this actually helps.

iluuu1994 added a commit to iluuu1994/php-src that referenced this pull request Mar 27, 2023

Unparallelize IO heavy tests

88ecbb5

Alternative to phpGH-10892. This is somewhat unfortunate since these are also the slow tests. I'm also not sure if this actually helps.

iluuu1994 mentioned this pull request Mar 27, 2023

Unparallelize IO heavy tests #10953

Closed

iluuu1994 added a commit that referenced this pull request Mar 27, 2023

Unparallelize IO heavy tests

e1ec67a

Alternative to GH-10892. This is somewhat unfortunate since these are also the slow tests. I'm also not sure if this actually helps. Closes GH-10953

iluuu1994 closed this in 11597d1 May 3, 2023

pronskiy mentioned this pull request Jun 6, 2023

Add Roundup #13 ThePHPF/thephp.foundation#90

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add retrying mechanism for IO tests in run-tests.php #10892

Add retrying mechanism for IO tests in run-tests.php #10892

Uh oh!

iluuu1994 commented Mar 20, 2023 •

edited

Loading

Uh oh!

iluuu1994 commented Mar 21, 2023

Uh oh!

mvorisek commented Mar 23, 2023

Uh oh!

iluuu1994 commented Mar 23, 2023

Uh oh!

iluuu1994 commented Apr 30, 2023

Uh oh!

mvorisek commented Apr 30, 2023 •

edited

Loading

Uh oh!

iluuu1994 commented May 1, 2023

Uh oh!

mvorisek commented May 1, 2023

Uh oh!

iluuu1994 commented May 1, 2023

Uh oh!

mvorisek commented May 1, 2023

Uh oh!

Uh oh!

Add retrying mechanism for IO tests in run-tests.php #10892

Add retrying mechanism for IO tests in run-tests.php #10892

Uh oh!

Conversation

iluuu1994 commented Mar 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iluuu1994 commented Mar 21, 2023

Uh oh!

mvorisek commented Mar 23, 2023

Uh oh!

iluuu1994 commented Mar 23, 2023

Uh oh!

iluuu1994 commented Apr 30, 2023

Uh oh!

mvorisek commented Apr 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iluuu1994 commented May 1, 2023

Uh oh!

mvorisek commented May 1, 2023

Uh oh!

iluuu1994 commented May 1, 2023

Uh oh!

mvorisek commented May 1, 2023

Uh oh!

Uh oh!

iluuu1994 commented Mar 20, 2023 •

edited

Loading

mvorisek commented Apr 30, 2023 •

edited

Loading