-
-
Notifications
You must be signed in to change notification settings - Fork 12
Add ADR on Pod disruptions #452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Changes from 17 commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
12be3a6
Add ADR on Pod disruptions
sbernauer 8865326
WIP
sbernauer e0b6042
docs
sbernauer d546256
WIP
sbernauer a20dfa1
WIP
sbernauer 01d49ee
typo
sbernauer cb08ab3
todo
sbernauer 2efcd8c
Update ADR030-reduce-pod-disruptions.adoc
sbernauer 7e292c0
Merge remote-tracking branch 'origin/main' into adr-pdb
sbernauer 4c0e74e
fixes
sbernauer 713b5ea
Add to nav
sbernauer 4d5de77
WIP
sbernauer ff692c7
WIP
sbernauer 04d4247
updated ADR
fhennig 1d4836d
renamed options
fhennig 25dba57
Update modules/contributor/pages/adr/ADR030-reduce-pod-disruptions.adoc
sbernauer 8474bf8
Rename to "Allowed Pod disruptions"
sbernauer ba3a65f
Merge remote-tracking branch 'origin/main' into adr-pdb
sbernauer ca7abd3
Link from concepts apge to ADR
sbernauer File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
351 changes: 351 additions & 0 deletions
351
modules/contributor/pages/adr/ADR030-allowed-pod-disruptions.adoc
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,351 @@ | ||
= ADR030: Allowed Pod disruptions | ||
Sebastian Bernauer <sebastian.bernauer.tech> | ||
v0.1, 2023-09-15 | ||
:status: accepted | ||
|
||
* Status: {status} | ||
* Deciders: | ||
** Felix Hennig | ||
** Lars Francke | ||
** Sascha Lautenschläger | ||
** Sebastian Bernauer | ||
** Sönke Liebau | ||
* Date: 2023-09-15 | ||
|
||
== Context and problem statement | ||
|
||
Downtime of products is always bad, but sometimes Pods need to be restarted to roll out updates or renew certificates. | ||
To prevent services from becoming unavailable we need to make sure that there is always a certain number of Pods still online when restarting Pods. | ||
Kubernetes has a concept called https://kubernetes.io/docs/tasks/run-application/configure-pdb/[PodDisruptionBudget] (PDB) to define the number of Pods | ||
that need to be kept online or the number of Pods that can safely be taken offline. | ||
We want to use this functionary to either prevent services outages entirely or try to keep them to a minimum. | ||
PDBs are defined not on a StatefulSet or Deployment, but with a selector over labels, so they can also span Pods from multiple StatefulSets. | ||
|
||
=== Example use-cases | ||
|
||
1. As a user I want an HDFS and it (or parts) should not be disturbed by planned pod evictions (for example for a certificate renewal). I expect this to be the default behaviour. | ||
2. As a user I want to configure maxUnavailable on the role (e.g. datanode) across all rolegroups (e.g. dfs replicas 3 and only a single datanode is allowed to go down - regardless of the number of rolegroups), so that no datanode is a single point of failure. Similarly for ZooKeeper, I want to define PDBs at role level as ZK quorum is independent of role groups. | ||
3. As a user I want to override defaults to maybe have less availability but faster rollout times in rolling redeployments; for example a Trino cluster that could take more than 6 hours to rolling redeploy, as the graceful shutdown of Trino workers takes a considerable amount of time - depended on the queries getting executed. | ||
4. As a user I want to configure maxUnavailable on rolegroups individually, as I e.g. have some fast datanodes using SSDs and some slow datanodes using HDDs. I want to have always X number of fast datanodes online for performance reasons. | ||
5. As a user I want a Superset/NiFi/Kafka and they (or parts) should not be disturbed by planned pod evictions. | ||
6. As a user I might want to define PDBs across roles or on other specific Pod selections, in that case I want to be able to disable the Stackable generated PDBs. | ||
|
||
We expect the majority of users to either use default PDB settings or define PDBs at a role level. Role group configuration like in use-case 4 has merit but seems like a more niche usage scenario. | ||
|
||
=== Technical considerations | ||
|
||
We have the following constraints: | ||
|
||
If we use https://kubernetes.io/docs/tasks/run-application/configure-pdb/#arbitrary-controllers-and-selectors[arbitrary workloads and arbitrary selectors] (for example when selecting Pods from multiple StatefulSets) we have the following constraints: | ||
* only `.spec.minAvailable` can be used, not `.spec.maxUnavailable`. | ||
* only an integer value can be used with `.spec.minAvailable`, not a percentage. | ||
|
||
This means that if we select any Pods that are not part of a StatefulSet or Deployment etc. then we are bound by these constraints. Preliminary testing showed that `.spec.maxUnavailable` works with multiple StatefulSets. | ||
|
||
You can use a selector which selects a subset or superset of the pods belonging to a workload resource. The eviction API will disallow eviction of any pod covered by multiple PDBs, so most users will want to avoid overlapping selectors. | ||
|
||
We need to create PDBs in such a way that every Pod is only selected once. This is easiest if a selector is defined per role or for all role groups individually. Excluding certain labels is also possible my using match expressions, but we did not test whether is conflicts with the first constraint about arbitrary selectors. | ||
To support the user creating their own custom PDBs we need to support disabling PDB generation to prevent overlapping selectors. | ||
|
||
== Decision drivers | ||
|
||
* Common use-cases should be easy to configure. | ||
* Principle of least surprise: CRD configuration settings and their interactions in case of multiple settings need to be easy to comprehend to prevent user error. | ||
* Extendable design, so that we can later non-breaking add new functionality, such as giving the chance to configure PDBs on roleGroup level as well. | ||
* Simple implementation (far less important) | ||
|
||
== Decision outcome | ||
|
||
Option 1 was picked. | ||
|
||
== Considered options | ||
|
||
=== Option 1 | ||
|
||
Introduce a new `roleConfig` at role level and put PDBs in there. Only role level PDBs are supported, for role group level the PDBs should be disabled and the user needs to create PDBs manually. The `roleConfig` is put in place to not put the PDB setting directly in the role. | ||
|
||
[source,yaml] | ||
---- | ||
spec: | ||
nameNodes: | ||
roleConfig: # <<< | ||
podDisruptionBudget: # optional | ||
enabled: true # optional, defaults to true | ||
maxUnavailable: 1 # optional, defaults to our "smart" calculation | ||
roleGroups: | ||
default: | ||
replicas: 2 | ||
dataNodes: | ||
# use pdb defaults | ||
roleGroups: | ||
default: | ||
replicas: 2 | ||
---- | ||
|
||
==== Pros | ||
|
||
* simple to understand | ||
* covers the majority of use cases | ||
* still leaves the option to disable and roll your own | ||
|
||
==== Cons | ||
|
||
* Yet another "config" (config, clusterConfig and now roleConfig as well) | ||
** That's kind of the way the real world is: There are some thing you can configure on cluster level (e.g. ldap), role level (pdbs) and role group level (resources). This models this the closest. | ||
* Its not possible to define PDBs on rolegroups without the user deploying it's own PDBs. | ||
|
||
NOTE: In the discussion the option of having the PDB directly in the role without a `roleConfig` was briefly discussed but not considered as an option due to being too messy, so it is not listed as an explicit option here. | ||
|
||
=== Option 2 - PDB in `config`, but only at role level | ||
|
||
Instead of inventing a new `roleConfig` setting, put the PDB in the `config`. This might seem better at first, but usually settings in `config` can also be set at role group level, and in this case, that would not be true. | ||
|
||
[source,yaml] | ||
---- | ||
spec: | ||
nameNodes: | ||
config: # <<< | ||
podDisruptionBudget: | ||
enabled: true | ||
maxUnavailable: 1 | ||
roleGroups: | ||
default: | ||
replicas: 2 | ||
config: {} | ||
# no such field as podDisruptionBudget | ||
---- | ||
|
||
==== Pros | ||
|
||
* Everything configurable is below `config`, no new `roleConfig` | ||
* Like Option 1, covers configuration of the most important use cases | ||
|
||
==== Cons | ||
|
||
* `spec.nameNodes.config` is *not* similar to `spec.nameNodes.roleGroups.default.config` => Confusing to the user | ||
** thinking more about it, it might be confusing that the setting is not "copied" to all role groups like other settings like resources or affinities. | ||
* Still no option to configure role group level PDBs | ||
* Possibly complicated to implement, due to `config` usually being identical at role and role group level | ||
|
||
=== Option 3: PDB in config with elaborate merge mechanism | ||
|
||
Similar to Option 2, the PDB setting is located in the `config` but it is actually possible to use it at both role and role group level. | ||
We develop a semantic merge mechanism that would prevent overlapping PDBs. | ||
|
||
.CRD Example | ||
[%collapsible] | ||
==== | ||
[source,yaml] | ||
---- | ||
apiVersion: hdfs.stackable.tech/v1alpha1 | ||
kind: HdfsCluster | ||
metadata: | ||
name: simple-hdfs | ||
spec: | ||
image: | ||
productVersion: 3.3.4 | ||
clusterConfig: | ||
zookeeperConfigMapName: simple-hdfs-znode | ||
nameNodes: | ||
config: | ||
podDisruptionBudget: | ||
enabled: true | ||
maxUnavailable: 2 | ||
roleGroups: | ||
hdd: | ||
replicas: 16 | ||
config: | ||
podDisruptionBudget: | ||
maxUnavailable: 4 | ||
ssd: | ||
replicas: 8 | ||
config: | ||
podDisruptionBudget: | ||
enabled: false | ||
in-memory: | ||
replicas: 4 | ||
---- | ||
|
||
would end up with something like | ||
|
||
[source,yaml] | ||
---- | ||
apiVersion: policy/v1 | ||
kind: PodDisruptionBudget | ||
metadata: | ||
name: simple-hdfs-datanodes-hdds | ||
spec: | ||
maxUnavailable: 4 | ||
selector: | ||
matchLabels: | ||
app.kubernetes.io/name: hdfs | ||
app.kubernetes.io/instance: simple-hdfs | ||
app.kubernetes.io/component: datanode | ||
app.kubernetes.io/rolegroup: hdd | ||
--- | ||
apiVersion: policy/v1 | ||
kind: PodDisruptionBudget | ||
metadata: | ||
name: simple-hdfs-datanodes-not-hdds | ||
spec: | ||
maxUnavailable: 2 | ||
selector: | ||
matchLabels: | ||
app.kubernetes.io/name: hdfs | ||
app.kubernetes.io/instance: simple-hdfs | ||
app.kubernetes.io/component: datanode | ||
matchExpressions: | ||
- key: app.kubernetes.io/rolegroup | ||
operator: NotIn | ||
values: | ||
- hdd | ||
- key: app.kubernetes.io/rolegroup | ||
operator: NotIn | ||
values: | ||
- in-memory | ||
---- | ||
==== | ||
|
||
==== Pros | ||
|
||
* Fits into the existing config structure | ||
* Allows configuring role config level PDBs and even hybrid configs | ||
|
||
==== Cons | ||
|
||
* Complex merge mechanism possibly difficult to understand and therefore easy to use the wrong way | ||
* Complex mechanism also not trivial to implement | ||
|
||
=== Option 4 - PDB in config with normal "shared role group config" behaviour | ||
|
||
Again we put the PDB in the `config` section but simply use the normal "copy" behaviour for this setting. | ||
This would be simple and easy to understand, but does not allow for true role level PDBs. | ||
|
||
|
||
.CRD Example | ||
[%collapsible] | ||
==== | ||
[source,yaml] | ||
---- | ||
spec: | ||
dataNodes: | ||
config: | ||
podDisruptionBudget: | ||
maxUnavailable: 2 | ||
roleGroups: | ||
hdd: | ||
replicas: 16 | ||
ssd: | ||
replicas: 8 | ||
in-memory: | ||
replicas: 4 | ||
---- | ||
|
||
would end up with | ||
|
||
[source,yaml] | ||
---- | ||
apiVersion: policy/v1 | ||
kind: PodDisruptionBudget | ||
metadata: | ||
name: simple-hdfs-datanodes-hdds | ||
spec: | ||
maxUnavailable: 2 | ||
selector: | ||
matchLabels: | ||
app.kubernetes.io/name: hdfs | ||
app.kubernetes.io/instance: simple-hdfs | ||
app.kubernetes.io/component: datanode | ||
app.kubernetes.io/rolegroup: hdd | ||
--- | ||
apiVersion: policy/v1 | ||
kind: PodDisruptionBudget | ||
metadata: | ||
name: simple-hdfs-datanodes-hdds | ||
spec: | ||
maxUnavailable: 2 | ||
selector: | ||
matchLabels: | ||
app.kubernetes.io/name: hdfs | ||
app.kubernetes.io/instance: simple-hdfs | ||
app.kubernetes.io/component: datanode | ||
app.kubernetes.io/rolegroup: ssd | ||
--- | ||
apiVersion: policy/v1 | ||
kind: PodDisruptionBudget | ||
metadata: | ||
name: simple-hdfs-datanodes-hdds | ||
spec: | ||
maxUnavailable: 2 | ||
selector: | ||
matchLabels: | ||
app.kubernetes.io/name: hdfs | ||
app.kubernetes.io/instance: simple-hdfs | ||
app.kubernetes.io/component: datanode | ||
app.kubernetes.io/rolegroup: in-memory | ||
---- | ||
|
||
[source,yaml] | ||
---- | ||
spec: | ||
nameNodes: | ||
config: | ||
podDisruptionBudget: | ||
enabled: true | ||
maxUnavailable: 2 | ||
roleGroups: | ||
hdd: | ||
replicas: 16 | ||
config: | ||
podDisruptionBudget: | ||
maxUnavailable: 4 | ||
ssd: | ||
replicas: 8 | ||
config: | ||
podDisruptionBudget: | ||
enabled: false | ||
in-memory: | ||
replicas: 4 | ||
---- | ||
|
||
would end up with | ||
|
||
[source,yaml] | ||
---- | ||
apiVersion: policy/v1 | ||
kind: PodDisruptionBudget | ||
metadata: | ||
name: simple-hdfs-datanodes-hdds | ||
spec: | ||
maxUnavailable: 4 | ||
selector: | ||
matchLabels: | ||
app.kubernetes.io/name: hdfs | ||
app.kubernetes.io/instance: simple-hdfs | ||
app.kubernetes.io/component: datanode | ||
app.kubernetes.io/rolegroup: hdd | ||
--- | ||
apiVersion: policy/v1 | ||
kind: PodDisruptionBudget | ||
metadata: | ||
name: simple-hdfs-datanodes-hdds | ||
spec: | ||
maxUnavailable: 2 | ||
selector: | ||
matchLabels: | ||
app.kubernetes.io/name: hdfs | ||
app.kubernetes.io/instance: simple-hdfs | ||
app.kubernetes.io/component: datanode | ||
app.kubernetes.io/rolegroup: in-memory | ||
---- | ||
==== | ||
|
||
==== Pros | ||
|
||
* easy to understand | ||
* easy to implement | ||
* works the same as all other config | ||
|
||
==== Cons | ||
|
||
* Does not support the common use case of role level PDBs |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.