Skip to content

Document restarts #478

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Nov 2, 2023
54 changes: 51 additions & 3 deletions modules/concepts/pages/operations/cluster_operations.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ Stackable operators offer different cluster operations to control the reconcilia
* `reconciliationPaused` - Stop the operator from reconciling the cluster spec. The status will still be updated.
* `stopped` - Stop all running pods but keep updating all deployed resources like `ConfigMaps`, `Services` and the cluster status.

If not specified, `clusterOperation.reconciliationPaused` and `clusterOperation.stopped` default to `false`.

== Example

[source,yaml]
Expand All @@ -15,8 +17,54 @@ include::example$cluster-operations.yaml[]
<1> The `clusterOperation.reconciliationPaused` flag set to `true` stops the operator from reconciling any changes to the cluster spec. The cluster status is still updated.
<2> The `clusterOperation.stopped` flag set to `true` stops all pods in the cluster. This is done by setting all deployed `StatefulSet` replicas to 0.

== Notes

If not specified, `clusterOperation.reconciliationPaused` and `clusterOperation.stopped` default to `false`.

IMPORTANT: When setting `clusterOperation.reconciliationPaused` and `clusterOperation.stopped` to true in the same step, `clusterOperation.reconciliationPaused` will take precedence. This means the cluster will stop reconciling immediately and the `stopped` field is ignored. To avoid this, the cluster should first be stopped and then paused.

== Service Restarts

=== Manual Restarts

Sometimes it is necessary to restart services deployed in Kubernetes. A service restart should induce as little disruption as possible, ideally none.

Most operators create StatefulSet objects for the products they manage and Kubernetes offers a rollout mechanism to restart them. You can use `kubectl rollout restart statefulset` to restart a StatefulSet previously created by an operator.

To illustrate how to use the command line to restart one or more Pods, we will assume you used the Stackable HDFS Operator to deploy an HDFS stacklet called `dumbo`.

This stacklet will consist, among other things, of three StatefulSets created for each HDFS role: `namenode`, `datanode` and `journalnode`. Let's list them:

[source,shell]
----
❯ kubectl get sts -l app.kubernetes.io/instance=dumbo
NAME READY AGE
dumbo-datanode-default 2/2 4m41s
dumbo-journalnode-default 1/1 4m41s
dumbo-namenode-default 2/2 4m41s
----

To restart the HDFS data node Pods, run:

[source,shell]
----
❯ kubectl rollout restart statefulset dumbo-datanode-default
statefulset.apps/dumbo-datanode-default restarted
----

Sometimes you want to restart all Pods of a stacklet and not just individual roles. This can be achieved in a similar manner by using labels instead of StatefulSet names. Continuing with the example above, to restart all HDFS Pods you would have to run:

[source,shell]
----
❯ kubectl rollout restart statefulset --selector app.kubernetes.io/instance=dumbo
----

To wait for all Pods to be running again:

[source,shell]
----
❯ kubectl rollout status statefulset --selector app.kubernetes.io/instance=dumbo
----

Here we used the label `app.kubernetes.io/instance=dumbo` to select all Pods that belong to a specific HDFS stacklet. This label is created by the operator and `dumbo` is the name of the HDFS stacklet as specified in the custom resource. You can add more labels to make finer grained restarts.

== Automatic Restarts

The Commons Operator of the Stackable Platform may restart Pods automatically, for purposes such as ensuring that security certificates are up-to-date. For details, see the xref:commons-operator:index.adoc[Commons Operator documentation].