Skip to content

Remove nightly builds of external extensions #16100

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

cmb69
Copy link
Member

@cmb69 cmb69 commented Sep 28, 2024

While the idea to notice inadvertent API breaks early by building some external extensions is nice in theory, it doesn't make much sense in practice:

  • we would need to check many more extensions (e.g. parallel and uopz come to mind)
  • we also should check external SAPIs for a better coverage
  • we also should run that on platforms other than Linux for better coverage
  • if we introduce a deliberate API break, we need to disable breaking extensions, and to re-enable the builds once the issue if fixed downstream (e.g. the redis build has been disabled in January, but apparently it has been fixed without us having noticed)
  • all the extensions we currently test have their own GH actions set up; these jobs are likely better suited to notice API break, since they can run respective tests
  • if an extension build breaks due to downstream issues, we're probably puzzled what's going on, while downstream maintainers likely know what broke, and how to fix it
  • we would need nightly runs which are at least occasionally green; otherwise it is not that likely that someone even looks at individual issues there (the last green nightly was more than a month ago)

All in all, the costs of actually doing these builds seem to outweigh the benefits. As such we drop the night PECL builds.


Note that I'm not sure which branch that needs to target. But maybe we want to keep these builds anyway, so that wouldn't matter.

While the idea to notice inadvertent API breaks early by building some
external extensions is nice in theory, it doesn't make much sense in
practice:

* we would need to check many more extensions (e.g. parallel and uopz
  come to mind)
* we also should check external SAPIs for a better coverage
* we also should run that on platforms other than Linux for better
  coverage
* if we introduce a deliberate API break, we need to disable breaking
  extensions, and to re-enable the builds once the issue if fixed
  downstream (e.g. the redis build has been disabled in January, but
  apparently it has been fixed without us having noticed)
* all the extensions we currently test have their own GH actions set
  up; these jobs are likely better suited to notice API break, since
  they can run respective tests
* if an extension build breaks due to downstream issues, we're probably
  puzzled what's going on, while downstream maintainers likely know
  what broke, and how to fix it
* we would need nightly runs which are at least occasionally green;
  otherwise it is not that likely that someone even looks at individual
  issues there (the last green nightly was more than a month ago)

All in all, the costs of actually doing these builds seem to outweigh
the benefits.  As such we drop the night PECL builds.
@cmb69 cmb69 requested a review from TimWolla as a code owner September 28, 2024 14:13
@cmb69 cmb69 requested review from iluuu1994 and removed request for TimWolla September 28, 2024 14:14
@iluuu1994
Copy link
Member

iluuu1994 commented Sep 29, 2024

I don't have a big preference either way.

we would need to check many more extensions (e.g. parallel and uopz come to mind)

Sure, but "there could be more tests" is not a great argument for dropping the existing tests.

all the extensions we currently test have their own GH actions set up; these jobs are likely better suited to notice API break, since they can run respective tests

That's true, but do they test against master? They might not even test against RC.

we would need nightly runs which are at least occasionally green

It was until we branched 8.4. I already notified Derick on the failure when it happened (he's on holiday until next week), and asked if we can prevent it from occurring every year.

@cmb69
Copy link
Member Author

cmb69 commented Sep 29, 2024

Sure, but "there could be more tests" is not a great argument for dropping the existing tests.

Fair enough.

That's true, but do they test against master? They might not even test against RC.

Ah, right, very good point regarding testing dev (RC should be done by extensions, though).

It was until we branched 8.4. I already notified Derick on the failure when it happened (he's on holiday until next week), and asked if we can prevent it from occurring every year.

Thank you! However, with this I was referring to the overall nightly workflow (not the PECL job in particular). The more jobs are failing, the less likely anyone looks into that, I'm afraid.

After reconsideration, I don't think this single job is a real problem, so I'm closing this PR.

@cmb69 cmb69 closed this Sep 29, 2024
@cmb69 cmb69 deleted the cmb/drop-pecl-builds branch September 29, 2024 09:47
@iluuu1994
Copy link
Member

However, with this I was referring to the overall nightly workflow (not the PECL job in particular). The more jobs are failing, the less likely anyone looks into that, I'm afraid.

I completely agree. I've essentially been struggling with keeping nightly green since it was moved to GitHub. Apart from the obvious issues, we have the macOS build fails sometimes with this error:

The hosted runner: GitHub Actions 7 lost communication with the server. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.

I haven't found a solution to this problem, and unfortunately it's something GitHub support couldn't help me with. I already freed one CPU core, so that shouldn't be the cause. It might be a particular test, or set thereof, hogging memory or network. Unfortunately, the runner crashes with the output log not being available. Hence, it's not easy to identify which test is to blame.

Anyway, long story short, I look at nightly daily because of these intermittent errors. If we can find a solution for them, I would be very happy ofc. It's also the reason I'm currently not logging this to some public channel on Slack. If the build was more reliable, that would be an option, so more people look at it and I don't have to ping them manually. 🙂

@cmb69
Copy link
Member Author

cmb69 commented Sep 29, 2024

I haven't found a solution to this problem, and unfortunately it's something GitHub support couldn't help me with. I already freed one CPU core, so that shouldn't be the cause. It might be a particular test, or set thereof, hogging memory or network. Unfortunately, the runner crashes with the output log not being available. Hence, it's not easy to identify which test is to blame.

Looks like this is a common problem with self hosted runners, but maybe even for hosted runners, it is a GH issue. On the other hand, does this also happen for our push workflow? If not, maybe we're running too many nightly jobs, or it's just a bad time.

Anyhow, I'm not too much concerned about an occassionally failing job (of course, that's annoying, though). But when I looked at yesterday's nightlies, I was a bit shocked to see so many failing jobs. PR #16107 and PR #16097 should be a good step. Now I was looking at https://github.com/php/php-src/actions/runs/11088570760, and surprised that only few jobs had failed. On closer investigation I've noticed that only branches with new commits are tested. In my opinion, that's not the best idea, since jobs may fail for other reasons (see e.g. PR #16107), and when a new release of a security branch is made, all over sudden CI may be completely red. In my opinion, we need to keep CI green even for security branches.

And I'm completely aware that the two paragraphs are contradictory ("too many jobs" vs. "always run everything"). Maybe a solution is to have alternating nightlies (run one half of the jobs on one day, and the other the other day); might still be good enough to catch regressions.

@iluuu1994
Copy link
Member

On the other hand, does this also happen for our push workflow?

I've seen it on occasion, but it's more noticeable in nightly because it runs many more jobs.

In my opinion, that's not the best idea, since jobs may fail for other reasons (see e.g. PR #16107), and when a new release of a security branch is made, all over sudden CI may be completely red

Nightly is already a ton of work. Keeping security branches green all the time (not just when we do a security release) essentially means 5 years of support per branch (1 for development, 2 active support, 2 security support). Since CI branches often diverge it's not just a matter of merging one branch lower. This is especially problematic when there are actual code changes required to keep the build green, as we're not supported to do that in security branches. Even Ubuntu LTS will run out for every other PHP release.

Also note that with security support now being extended, we would essentially be going from 3 branches to 5 tested in nightly, increasing the number of jobs by 66%.

@cmb69
Copy link
Member Author

cmb69 commented Sep 30, 2024

That's true, but do they test against master? They might not even test against RC.

Ah, right, very good point regarding testing dev (RC should be done by extensions, though).

I've noticed that none of the extension run CI for PHP 8.4 on Windows (some even don't for other OSs); submitted PR about that.

Nightly is already a ton of work.

I know. Thanks for working on that! However, 16:0 votes on https://wiki.php.net/rfc/travis_ci, and 34:0 on https://wiki.php.net/rfc/release_cycle_update, and apparently almost nobody of the voters cares about that. Sad.

@iluuu1994
Copy link
Member

I wasn't aware of the Travis CI RFC/policy. It does say:

The policy would apply for all branches in active development or maintenance, starting with PHP 5.5.

Which doesn't seem to include security branches.

Don't get me wrong, I can see your point. Of course, it would be optimal to have all branches that still get releases passing. I'm just afraid this is going to be a significant pain point.

I would say that I'd object to running all branches nightly. I don't think that's necessary and wasteful. If we want to keep security branches green, it should be enough to run them weekly. That would reduce build times by a lot. After some time, I can check in my work logs how much time I invested in it.

We might also try to get to the bottom of #16100 (comment) by logging to some external service, so that the logs persist. It's annoying but I don't see any other way to get to the bottom of this, especially since it occurs intermittently.

I would appreciate help on these though.

@cmb69
Copy link
Member Author

cmb69 commented Oct 1, 2024

If we want to keep security branches green, it should be enough to run them weekly.

After reconsideration, I think it would even be okay to have workflow_dispatch triggers for the push actions of all branches. These can be run from time to time for those branches which don't get any commits (in practise, only security branches), and that should be sufficient to detect the most serious issues with CI. See #16148.

I'll try to help with getting nightly green; last night's run doesn't look too bad. And I doubt that we can resolve all these intermittent test failures anyway, especially when nightly runs so many jobs on hosted runners.

@cmb69
Copy link
Member Author

cmb69 commented Oct 1, 2024

@iluuu1994, the nightly MSAN issue should now be fixed via 4a16940.

@iluuu1994
Copy link
Member

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants