Description
Describe the Bug
Doing an oc apply of the attached yaml for a RayCluster
broken-raycluster.yaml.txt
where the head pod has a user-provided imagePullSecret results in an endless reconciliation loop with the head pod being rapidly deleted and recreated. It looks to me like this is due to the logic added in #601 is not acting as intended.
Codeflare Stack Component Versions
Please specify the component versions in which you have encountered this bug.
Running RHOAI 2.16 on OpenShift 4.14.
Steps to Reproduce the Bug
- oc apply the yaml for the RayCluster
- observe an endless cascade of head pods being rapidly deleted
What Have You Already Tried to Debug the Issue?
Have recreated the problem multiple times. Happens reliably.
Have verified that removing the imagePullSecret from the head pod specification results in the RayCluster being created successful as expected.
Expected Behavior
Users should be able to provide imagePullSecrets for the head node of a RayCluster to enable the use of private registries.
Screenshots, Console Output, Logs, etc.
I've attached the relevant log snippets from the codeflare-operator codeflare-log.txt and kuberay-operator
kuberay.txt.