Owls88781 continue retry start a namespace in non-fullcheck case after hitting 403 #2315
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
DomainNamespaces keeps the status of each namespace that the operator manages, which includes a flag to indicate the namespace is starting already. During a regular recheck domain cycle (fullCheck = false), the isNamespaceStarting flag (AtomicBoolean) is set using getAndSet method, and once it is set, the operator never unset it until the namespace is removed from the list of namespaces that the operator manages.
As a result, when there is an "unrecoverable" failure, such as API server returns a http return code 403, when the operator starts a namespace, the namespace will be skipped from the subsequent regular retry cycles because it is considered isStarting already. Eventually, a skipped namespace may be started in a fullCheck recheck. But the creation of NamespaceWatchingStarted event is only generated in the regular recheck cycle when the flag is not set (because we only needs to create it once for each namespace), so it can be missed if there is a temporary failure condition.
The change in this PR clears the isNamespaceStarting flag after the creation of NamespaceWatchingStarted event fails, and set it after the creation of the event succeeds. So the operator will attempt to start a previously failed namespace, as well as to generate the NamespaceWatchingStarted event, in each regular recheck cycle.
The PR also adds a new type of event in the operator's namespace to record that the creation of NamespaceWatchingStarted event failed in a domain's namespace with 403 error.
Integration test run on Jenkins (hit a known intermittent issue in ItMonitoringExporter): https://build.weblogick8s.org:8443/job/weblogic-kubernetes-operator-kind-new/4734/.
A rerun of ItMonitoringExporter has passed: https://build.weblogick8s.org:8443/job/weblogic-kubernetes-operator-kind-new/4739/.
A full rerun has passed: https://build.weblogick8s.org:8443/job/weblogic-kubernetes-operator-kind-new/4742/.