You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Kubernetes has a a concepts called https://kubernetes.io/docs/tasks/run-application/configure-pdb/[PodDisruptionBudget] (PDB) to prevent this.
15
-
We want to use this functionary to try to reduce the downtime to an absolute minimum.
13
+
Downtime of products is always bad, but sometimes Pods need to be restarted to roll out updates or renew certificates.
14
+
To prevent services from becoming unavailable we need to make sure that there is always a certain number of Pods still online when restarting Pods.
15
+
Kubernetes has a concept called https://kubernetes.io/docs/tasks/run-application/configure-pdb/[PodDisruptionBudget] (PDB) to define the number of Pods
16
+
that need to be kept online or the number of Pods that can safely be taken offline.
17
+
We want to use this functionary to either prevent services outages entirely or try to keep them to a minimum.
18
+
PDBs are defined not on a StatefulSet or Deployment, but with a selector over labels, so they can also span Pods from multiple StatefulSets.
16
19
17
-
== Decision Drivers
20
+
=== Example use-cases
18
21
19
-
* Ease of use and comprehensibility for the user
20
-
** Principle of least surprise
21
-
* Easy implementation (far less important)
22
+
1. As a user I want an HDFS and it (or parts) should not be disturbed by planned pod evictions (for example for a certificate renewal). I expect this to be the default behaviour.
23
+
2. As a user I want to configure maxUnavailable on the role (e.g. datanode) across all rolegroups (e.g. dfs replicas 3 and only a single datanode is allowed to go down - regardless of the number of rolegroups), so that no datanode is a single point of failure. Similarly for ZooKeeper, I want to define PDBs at role level as ZK quorum is independent of role groups.
24
+
3. As a user I want to override defaults to maybe have less availability but faster rollout times in rolling redeployments; for example a Trino cluster that could take more than 6 hours to rolling redeploy, as the graceful shutdown of Trino workers takes a considerable amount of time - depended on the queries getting executed.
25
+
4. As a user I want to configure maxUnavailable on rolegroups individually, as I e.g. have some fast datanodes using SSDs and some slow datanodes using HDDs. I want to have always X number of fast datanodes online for performance reasons.
26
+
5. As a user I want a Superset/NiFi/Kafka and they (or parts) should not be disturbed by planned pod evictions.
27
+
6. As a user I might want to define PDBs across roles or on other specific Pod selections, in that case I want to be able to disable the Stackable generated PDBs.
28
+
29
+
We expect the majority of users to either use default PDB settings or define PDBs at a role level. Role group configuration like in use-case 4 has merit but seems like a more niche usage scenario.
30
+
31
+
=== Technical considerations
32
+
33
+
We have the following constraints:
34
+
35
+
If we use https://kubernetes.io/docs/tasks/run-application/configure-pdb/#arbitrary-controllers-and-selectors[arbitrary workloads and arbitrary selectors] (for example when selecting Pods from multiple StatefulSets) we have the following constraints:
36
+
* only `.spec.minAvailable` can be used, not `.spec.maxUnavailable`.
37
+
* only an integer value can be used with `.spec.minAvailable`, not a percentage.
38
+
39
+
This means that if we select any Pods that are not part of a StatefulSet or Deployment etc. then we are bound by these constraints. Preliminary testing showed that `.spec.maxUnavailable` works with multiple StatefulSets.
40
+
41
+
You can use a selector which selects a subset or superset of the pods belonging to a workload resource. The eviction API will disallow eviction of any pod covered by multiple PDBs, so most users will want to avoid overlapping selectors.
42
+
43
+
We need to create PDBs in such a way that every Pod is only selected once. This is easiest if a selector is defined per role or for all role groups individually. Excluding certain labels is also possible my using match expressions, but we did not test whether is conflicts with the first constraint about arbitrary selectors.
44
+
To support the user creating their own custom PDBs we need to support disabling PDB generation to prevent overlapping selectors.
45
+
46
+
== Decision drivers
47
+
48
+
* Common use-cases should be easy to configure.
49
+
* Principle of least surprise: CRD configuration settings and their interactions in case of multiple settings need to be easy to comprehend to prevent user error.
22
50
* Extendable design, so that we can later non-breaking add new functionality, such as giving the chance to configure PDBs on roleGroup level as well.
51
+
* Simple implementation (far less important)
23
52
24
-
== Example use-cases
53
+
== Decision outcome
25
54
26
-
1. As a user I want an HDFS and it (or parts) should not be disturbed by planned pod evictions.
27
-
2. As a user I want to configure maxUnavailable on the role (e.g. datanode) across all rolegroups (e.g. dfs replicas 3 and only a single datanode is allowed to go down - regardless of the number of rolegroups), so that no datanode is a single point of failure.
28
-
3. As a user I want to configure maxUnavailable on the rolegroups individually, as I e.g. have some fast datanodes using SSDs and some slow datanodes using HDDs. I want to have always X number of fast datanodes online for performance reasons.
29
-
4. As a user I want a Superset/NiFi/Kafka and they (or parts) should not be disturbed by planned pod evictions.
55
+
Option 1 was picked.
30
56
31
-
Most of the users probably either don't know what PDBs are or are fine with the default values our operators deploy based upon our knowledge of the products.
57
+
== Considered options
32
58
33
-
== Requirements
59
+
=== Option 1
34
60
35
-
1. We must deploy a PDB alongside all the product StatefulSets (and Deployments in the future) to restrict pod disruptions.
36
-
2. Also users need the ability to override the numbers we default to, as they need to make a tradeoff between availability and rollout times e.g. in rolling redeployment. Context: I have operated Trino clusters that could take more than 6 hours to rolling redeploy, as the graceful shutdown of Trino workers takes a considerable amount of time - depended on the queries getting executed.
61
+
Introduce a new `roleConfig` at role level and put PDBs in there. Only role level PDBs are supported, for role group level the PDBs should be disabled and the user needs to create PDBs manually. The `roleConfig` is put in place to not put the PDB setting directly in the role.
37
62
38
-
We have the following constraints:
63
+
[source,yaml]
64
+
----
65
+
spec:
66
+
nameNodes:
67
+
roleConfig: # <<<
68
+
podDisruptionBudget: # optional
69
+
enabled: true # optional, defaults to true
70
+
maxUnavailable: 1 # optional, defaults to our "smart" calculation
71
+
roleGroups:
72
+
default:
73
+
replicas: 2
74
+
dataNodes:
75
+
# use pdb defaults
76
+
roleGroups:
77
+
default:
78
+
replicas: 2
79
+
----
39
80
40
-
1. If we use https://kubernetes.io/docs/tasks/run-application/configure-pdb/#arbitrary-controllers-and-selectors[arbitrary workloads and arbitrary selectors] we have the following constraints:
41
-
* only `.spec.minAvailable` can be used, not `.spec.maxUnavailable`.
42
-
* only an integer value can be used with `.spec.minAvailable`, not a percentage.
43
-
2. You can use a selector which selects a subset or superset of the pods belonging to a workload resource. The eviction API will disallow eviction of any pod covered by multiple PDBs, so most users will want to avoid overlapping selectors
81
+
==== Pros
44
82
45
-
Because of the mentioned constraints we have the following implications:
83
+
* simple to understand
84
+
* covers the majority of use cases
85
+
* still leaves the option to disable and roll your own
46
86
47
-
1. Use `.spec.maxUnavailable` everywhere
48
-
2. Have `.spec.maxUnavailable` configurable on the product CRD.
49
-
3. Create PodDisruptionBudget over the role and not over the rolegroups, as e.g. the Zookeeper quorum does not care about rolegroups. As of the docs we can not add a PDB for the role and the rolegroup at the same time.
50
-
4. Users must be able to disable our PDB creation in the case they want to define their own, as otherwise the Pods would have multiple PDBs, which is not supported.
51
-
5. We try to have a PDB per role, as this makes things much easier than e.g. saying "out of the namenodes and journalnodes only one can be down". Otherwise we can not make it "simply" configurable on the role.
87
+
==== Cons
52
88
53
-
== Question 1: Do we want to support configuring PDBs on role or role and rolegroup?
89
+
* Yet another "config" (config, clusterConfig and now roleConfig as well)
90
+
** That's kind of the way the real world is: There are some thing you can configure on cluster level (e.g. ldap), role level (pdbs) and role group level (resources). This models this the closest.
91
+
* Its not possible to define PDBs on rolegroups without the user deploying it's own PDBs.
54
92
55
-
=== Option 1: Configurable on role level
93
+
NOTE: In the discussion the option of having the PDB directly in the role without a `roleConfig` was briefly discussed but not considered as an option due to being too messy, so it is not listed as an explicit option here.
56
94
57
-
=== Option 2: Configurable on role + rolegroup level
95
+
=== Option 2a - PDB in `config`, but only at role level
58
96
59
-
Cons:
97
+
Instead of inventing a new `roleConfig` setting, put the PDB in the `config`. This might seem better at first, but usually settings in `config` can also be set at role group level, and in this case, that would not be true.
60
98
61
-
* It's really really complicated for the user and the implementation.
99
+
[source,yaml]
100
+
----
101
+
spec:
102
+
nameNodes:
103
+
config: # <<<
104
+
podDisruptionBudget:
105
+
enabled: true
106
+
maxUnavailable: 1
107
+
roleGroups:
108
+
default:
109
+
replicas: 2
110
+
config: {}
111
+
# no such field as podDisruptionBudget
112
+
----
113
+
114
+
==== Pros
115
+
116
+
* Everything configurable is below `config`, no new `roleConfig`
117
+
* Like Option 1, covers configuration of the most important use cases
118
+
119
+
==== Cons
62
120
63
-
.Explanation
121
+
* `spec.nameNodes.config` is *not* similar to `spec.nameNodes.roleGroups.default.config` => Confusing to the user
122
+
** thinking more about it, it might be confusing that the setting is not "copied" to all role groups like other settings like resources or affinities.
123
+
* Still no option to configure role group level PDBs
124
+
* Possibly complicated to implement, due to `config` usually being identical at role and role group level
125
+
126
+
=== Option 2b: PDB in config with elaborate merge mechanism
127
+
128
+
Similar to Option 2a, the PDB setting is located in the `config` but it is actually possible to use it at both role and role group level.
129
+
We develop a semantic merge mechanism that would prevent overlapping PDBs.
130
+
131
+
.CRD Example
64
132
[%collapsible]
65
133
====
66
134
[source,yaml]
@@ -134,93 +202,25 @@ spec:
134
202
----
135
203
====
136
204
137
-
Chosen option: *Option 1: Configurable on role level*
138
-
139
-
== Question 2: How dows the CRD structure look like?
140
-
141
-
=== Option 1
142
-
143
-
[source,yaml]
144
-
----
145
-
apiVersion: hdfs.stackable.tech/v1alpha1
146
-
kind: HdfsCluster
147
-
metadata:
148
-
name: simple-hdfs
149
-
spec:
150
-
nameNodes:
151
-
podDisruptionBudget: # optional
152
-
enabled: true # optional, defaults to true
153
-
maxUnavailable: 1 # optional, defaults to our "smart" calculation
154
-
roleGroups:
155
-
default:
156
-
replicas: 2
157
-
dataNodes:
158
-
# use pdb defaults
159
-
roleGroups:
160
-
default:
161
-
replicas: 2
162
-
----
163
-
164
-
==== Pros
165
-
166
-
* Everything below `config` can be merged, everything below `clusterConfig` has applied to the whole cluster (no exceptions)
167
-
168
-
==== Cons
169
-
170
-
* Bloating `spec.namenodes`
171
-
172
-
=== Option 2
173
-
174
-
[source,yaml]
175
-
----
176
-
spec:
177
-
nameNodes:
178
-
config: # <<<
179
-
podDisruptionBudget:
180
-
enabled: true
181
-
maxUnavailable: 1
182
-
roleGroups:
183
-
default:
184
-
replicas: 2
185
-
config: {}
186
-
# no such field as podDisruptionBudget
187
-
----
188
-
189
205
==== Pros
190
206
191
-
* Everything configurable is below `config` - some attributes of it can be merged - or `clusterConfig`.
207
+
* Fits into the existing config structure
208
+
* Allows configuring role config level PDBs and even hybrid configs
192
209
193
210
==== Cons
194
211
195
-
* `spec.nameNodes.config` is *not* similar to `spec.nameNodes.roleGroups.default.config` => Confusing to the user
212
+
* Complex merge mechanism possibly difficult to understand and therefore easy to use the wrong way
213
+
* Complex mechanism also not trivial to implement
196
214
197
-
=== Option 3
215
+
=== Option 2c - PDB in config with normal "shared role group config" behaviour
198
216
199
-
[source,yaml]
200
-
----
201
-
spec:
202
-
nameNodes:
203
-
roleConfig: # <<<
204
-
podDisruptionBudget:
205
-
enabled: true
206
-
maxUnavailable: 1
207
-
roleGroups:
208
-
default:
209
-
replicas: 2
210
-
----
217
+
Again we put the PDB in the `config` section but simply use the normal "copy" behaviour for this setting.
218
+
This would be simple and easy to understand, but does not allow for true role level PDBs.
211
219
212
-
==== Pros
213
-
214
-
* Not bloating `spec.namenodes`
215
-
216
-
==== Cons
217
-
218
-
* Yet another "config" (config, clusterConfig and now roleConfig as well)
219
-
** That's kind of the way the real world is: There are some thing you can configure on cluster level (e.g. ldap), role level (pdbs) and role group level (resources). This models this the closest.
220
-
* Its not possible to define PDBs on rolegroups without the user deploying it's own PDBs.
221
-
222
-
=== Option 4
223
220
221
+
.CRD Example
222
+
[%collapsible]
223
+
====
224
224
[source,yaml]
225
225
----
226
226
spec:
@@ -334,63 +334,14 @@ spec:
334
334
app.kubernetes.io/component: datanode
335
335
app.kubernetes.io/rolegroup: in-memory
336
336
----
337
-
338
-
337
+
====
339
338
340
339
==== Pros
341
340
342
-
*
341
+
* easy to understand
342
+
* easy to implement
343
+
* works the same as all other config
343
344
344
345
==== Cons
345
346
346
-
*
347
-
348
-
349
-
350
-
351
-
352
-
353
-
354
-
355
-
356
-
357
-
We end up with the following PDBs when the default values are used:
358
-
359
-
[source,yaml]
360
-
----
361
-
apiVersion: policy/v1
362
-
kind: PodDisruptionBudget
363
-
metadata:
364
-
name: simple-hdfs-journalnodes
365
-
spec:
366
-
maxUnavailable: 1
367
-
selector:
368
-
matchLabels:
369
-
app.kubernetes.io/name: hdfs
370
-
app.kubernetes.io/instance: simple-hdfs
371
-
app.kubernetes.io/component: journalnode
372
-
---
373
-
apiVersion: policy/v1
374
-
kind: PodDisruptionBudget
375
-
metadata:
376
-
name: simple-hdfs-namenodes
377
-
spec:
378
-
maxUnavailable: 1
379
-
selector:
380
-
matchLabels:
381
-
app.kubernetes.io/name: hdfs
382
-
app.kubernetes.io/instance: simple-hdfs
383
-
app.kubernetes.io/component: namenode
384
-
---
385
-
apiVersion: policy/v1
386
-
kind: PodDisruptionBudget
387
-
metadata:
388
-
name: simple-hdfs-datanodes
389
-
spec:
390
-
maxUnavailable: 2 # assuming dfs replication 3
391
-
selector:
392
-
matchLabels:
393
-
app.kubernetes.io/name: hdfs
394
-
app.kubernetes.io/instance: simple-hdfs
395
-
app.kubernetes.io/component: datanode
396
-
----
347
+
* Does not support the common use case of role level PDBs
0 commit comments