Skip to content

Owls88781 continue retry start a namespace in non-fullcheck case after hitting 403 #2315

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 19, 2021

Conversation

doxiao
Copy link
Member

@doxiao doxiao commented Apr 15, 2021

DomainNamespaces keeps the status of each namespace that the operator manages, which includes a flag to indicate the namespace is starting already. During a regular recheck domain cycle (fullCheck = false), the isNamespaceStarting flag (AtomicBoolean) is set using getAndSet method, and once it is set, the operator never unset it until the namespace is removed from the list of namespaces that the operator manages.

As a result, when there is an "unrecoverable" failure, such as API server returns a http return code 403, when the operator starts a namespace, the namespace will be skipped from the subsequent regular retry cycles because it is considered isStarting already. Eventually, a skipped namespace may be started in a fullCheck recheck. But the creation of NamespaceWatchingStarted event is only generated in the regular recheck cycle when the flag is not set (because we only needs to create it once for each namespace), so it can be missed if there is a temporary failure condition.

The change in this PR clears the isNamespaceStarting flag after the creation of NamespaceWatchingStarted event fails, and set it after the creation of the event succeeds. So the operator will attempt to start a previously failed namespace, as well as to generate the NamespaceWatchingStarted event, in each regular recheck cycle.

The PR also adds a new type of event in the operator's namespace to record that the creation of NamespaceWatchingStarted event failed in a domain's namespace with 403 error.

Integration test run on Jenkins (hit a known intermittent issue in ItMonitoringExporter): https://build.weblogick8s.org:8443/job/weblogic-kubernetes-operator-kind-new/4734/.
A rerun of ItMonitoringExporter has passed: https://build.weblogick8s.org:8443/job/weblogic-kubernetes-operator-kind-new/4739/.
A full rerun has passed: https://build.weblogick8s.org:8443/job/weblogic-kubernetes-operator-kind-new/4742/.

Copy link
Member

@russgold russgold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, other than one issue, which needs refactoring.

@doxiao doxiao requested a review from russgold April 19, 2021 13:22
@rjeberhard rjeberhard merged commit dd4ecaa into main Apr 19, 2021
@rjeberhard rjeberhard deleted the owls88781-retry-on-403 branch January 31, 2022 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants