Skip to content

docs: stop or not operator at startup in case of informer errors #1577

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 31, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 18 additions & 2 deletions docs/documentation/patterns-best-practices.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ possible to completely deactivate the feature, though we advise against it. The
configure automatic retries for your `Reconciler` is due to the fact that errors occur quite
often due to the distributed nature of Kubernetes: transient network errors can be easily dealt
with by automatic retries. Similarly, resources can be modified by different actors at the same
time so it's not unheard of to get conflicts when working with Kubernetes resources. Such
time, so it's not unheard of to get conflicts when working with Kubernetes resources. Such
conflicts can usually be quite naturally resolved by reconciling the resource again. If it's
done automatically, the whole process can be completely transparent.

Expand All @@ -94,7 +94,7 @@ Thanks to the declarative nature of Kubernetes resources, operators that deal on
Kubernetes resources can operator in a stateless fashion, i.e. they do not need to maintain
information about the state of these resources, as it should be possible to completely rebuild
the resource state from its representation (that's what declarative means, after all).
However, this usually doesn't hold true anymore when dealing with external resources and it
However, this usually doesn't hold true anymore when dealing with external resources, and it
might be necessary for the operator to keep track of this external state so that it is available
when another reconciliation occurs. While such state could be put in the primary resource's
status sub-resource, this could become quickly difficult to manage if a lot of state needs to be
Expand All @@ -105,3 +105,19 @@ advised to put such state into a separate resource meant for this purpose such a
Kubernetes Secret or ConfigMap or even a dedicated Custom Resource, which structure can be more
easily validated.

## Stopping (or not) Operator in case of Informer Errors

It can
be [configured](https://github.com/java-operator-sdk/java-operator-sdk/blob/2cb616c4c4fd0094ee6e3a0ef2a0ea82173372bf/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/api/config/ConfigurationService.java#L168-L168)
if the operator should stop in case of any informer error happens on startup. By default, if there ia an error on
startup and the informer for example has no permissions list the target resources (both the primary resource or
secondary resources) the operator will stop instantly. This behavior can be altered by setting the mentioned flag
to `false`, so operator will start even some informers are not started. In this case - same as in case when an informer
is started at first but experienced problems later - will continuously retry the connection indefinitely with an
exponential backoff. The operator will just stop if there is a fatal
error, [currently](https://github.com/java-operator-sdk/java-operator-sdk/blob/0e55c640bf8be418bc004e51a6ae2dcf7134c688/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/event/source/informer/InformerWrapper.java#L64-L66)
that is when a resource cannot be deserialized. The typical use case for changing this flag is when a list of namespaces
is watched by a controller. In is better to start up the operator, so it can handle other namespaces while there
might be a permission issue for some resources in another namespace.


Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,8 @@
import static org.junit.jupiter.api.Assertions.assertThrows;

/**
* The test relies on a special minikube configuration: "min-request-timeout" to have a very low
* value, see: "minikube start --extra-config=apiserver.min-request-timeout=3"
* The test relies on a special api server configuration: "min-request-timeout" to have a very low
* value, use: "minikube start --extra-config=apiserver.min-request-timeout=3"
*
* <p>
* This is important when tests are affected by permission changes, since the watch permissions are
Expand Down