Skip to content

Commit bf48280

Browse files
sbernauerfhennig
andauthored
Add ADR on Pod disruptions (#452)
* Add ADR on Pod disruptions * WIP * docs * WIP * WIP * typo * todo * Update ADR030-reduce-pod-disruptions.adoc * fixes * Add to nav * WIP * WIP * updated ADR * renamed options * Update modules/contributor/pages/adr/ADR030-reduce-pod-disruptions.adoc * Rename to "Allowed Pod disruptions" * Link from concepts apge to ADR --------- Co-authored-by: Felix Hennig <mail@felixhennig.com>
1 parent b6a3c7a commit bf48280

File tree

3 files changed

+353
-1
lines changed

3 files changed

+353
-1
lines changed

modules/concepts/pages/operations/pod_disruptions.adoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -91,4 +91,4 @@ spec:
9191
This PDB allows only one Pod out of all the Namenodes and Journalnodes to be down at one time.
9292

9393
== Details
94-
Have a look at <<< TODO: link ADR on Pod Disruptions once merged >>> for the implementation details.
94+
Have a look at the xref:contributor:adr/ADR030-allowed-pod-disruptions.adoc[ADR on Allowed Pod disruptions] for the implementation details.
Lines changed: 351 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,351 @@
1+
= ADR030: Allowed Pod disruptions
2+
Sebastian Bernauer <sebastian.bernauer.tech>
3+
v0.1, 2023-09-15
4+
:status: accepted
5+
6+
* Status: {status}
7+
* Deciders:
8+
** Felix Hennig
9+
** Lars Francke
10+
** Sascha Lautenschläger
11+
** Sebastian Bernauer
12+
** Sönke Liebau
13+
* Date: 2023-09-15
14+
15+
== Context and problem statement
16+
17+
Downtime of products is always bad, but sometimes Pods need to be restarted to roll out updates or renew certificates.
18+
To prevent services from becoming unavailable we need to make sure that there is always a certain number of Pods still online when restarting Pods.
19+
Kubernetes has a concept called https://kubernetes.io/docs/tasks/run-application/configure-pdb/[PodDisruptionBudget] (PDB) to define the number of Pods
20+
that need to be kept online or the number of Pods that can safely be taken offline.
21+
We want to use this functionary to either prevent services outages entirely or try to keep them to a minimum.
22+
PDBs are defined not on a StatefulSet or Deployment, but with a selector over labels, so they can also span Pods from multiple StatefulSets.
23+
24+
=== Example use-cases
25+
26+
1. As a user I want an HDFS and it (or parts) should not be disturbed by planned pod evictions (for example for a certificate renewal). I expect this to be the default behaviour.
27+
2. As a user I want to configure maxUnavailable on the role (e.g. datanode) across all rolegroups (e.g. dfs replicas 3 and only a single datanode is allowed to go down - regardless of the number of rolegroups), so that no datanode is a single point of failure. Similarly for ZooKeeper, I want to define PDBs at role level as ZK quorum is independent of role groups.
28+
3. As a user I want to override defaults to maybe have less availability but faster rollout times in rolling redeployments; for example a Trino cluster that could take more than 6 hours to rolling redeploy, as the graceful shutdown of Trino workers takes a considerable amount of time - depended on the queries getting executed.
29+
4. As a user I want to configure maxUnavailable on rolegroups individually, as I e.g. have some fast datanodes using SSDs and some slow datanodes using HDDs. I want to have always X number of fast datanodes online for performance reasons.
30+
5. As a user I want a Superset/NiFi/Kafka and they (or parts) should not be disturbed by planned pod evictions.
31+
6. As a user I might want to define PDBs across roles or on other specific Pod selections, in that case I want to be able to disable the Stackable generated PDBs.
32+
33+
We expect the majority of users to either use default PDB settings or define PDBs at a role level. Role group configuration like in use-case 4 has merit but seems like a more niche usage scenario.
34+
35+
=== Technical considerations
36+
37+
We have the following constraints:
38+
39+
If we use https://kubernetes.io/docs/tasks/run-application/configure-pdb/#arbitrary-controllers-and-selectors[arbitrary workloads and arbitrary selectors] (for example when selecting Pods from multiple StatefulSets) we have the following constraints:
40+
* only `.spec.minAvailable` can be used, not `.spec.maxUnavailable`.
41+
* only an integer value can be used with `.spec.minAvailable`, not a percentage.
42+
43+
This means that if we select any Pods that are not part of a StatefulSet or Deployment etc. then we are bound by these constraints. Preliminary testing showed that `.spec.maxUnavailable` works with multiple StatefulSets.
44+
45+
You can use a selector which selects a subset or superset of the pods belonging to a workload resource. The eviction API will disallow eviction of any pod covered by multiple PDBs, so most users will want to avoid overlapping selectors.
46+
47+
We need to create PDBs in such a way that every Pod is only selected once. This is easiest if a selector is defined per role or for all role groups individually. Excluding certain labels is also possible my using match expressions, but we did not test whether is conflicts with the first constraint about arbitrary selectors.
48+
To support the user creating their own custom PDBs we need to support disabling PDB generation to prevent overlapping selectors.
49+
50+
== Decision drivers
51+
52+
* Common use-cases should be easy to configure.
53+
* Principle of least surprise: CRD configuration settings and their interactions in case of multiple settings need to be easy to comprehend to prevent user error.
54+
* Extendable design, so that we can later non-breaking add new functionality, such as giving the chance to configure PDBs on roleGroup level as well.
55+
* Simple implementation (far less important)
56+
57+
== Decision outcome
58+
59+
Option 1 was picked.
60+
61+
== Considered options
62+
63+
=== Option 1
64+
65+
Introduce a new `roleConfig` at role level and put PDBs in there. Only role level PDBs are supported, for role group level the PDBs should be disabled and the user needs to create PDBs manually. The `roleConfig` is put in place to not put the PDB setting directly in the role.
66+
67+
[source,yaml]
68+
----
69+
spec:
70+
nameNodes:
71+
roleConfig: # <<<
72+
podDisruptionBudget: # optional
73+
enabled: true # optional, defaults to true
74+
maxUnavailable: 1 # optional, defaults to our "smart" calculation
75+
roleGroups:
76+
default:
77+
replicas: 2
78+
dataNodes:
79+
# use pdb defaults
80+
roleGroups:
81+
default:
82+
replicas: 2
83+
----
84+
85+
==== Pros
86+
87+
* simple to understand
88+
* covers the majority of use cases
89+
* still leaves the option to disable and roll your own
90+
91+
==== Cons
92+
93+
* Yet another "config" (config, clusterConfig and now roleConfig as well)
94+
** That's kind of the way the real world is: There are some thing you can configure on cluster level (e.g. ldap), role level (pdbs) and role group level (resources). This models this the closest.
95+
* Its not possible to define PDBs on rolegroups without the user deploying it's own PDBs.
96+
97+
NOTE: In the discussion the option of having the PDB directly in the role without a `roleConfig` was briefly discussed but not considered as an option due to being too messy, so it is not listed as an explicit option here.
98+
99+
=== Option 2 - PDB in `config`, but only at role level
100+
101+
Instead of inventing a new `roleConfig` setting, put the PDB in the `config`. This might seem better at first, but usually settings in `config` can also be set at role group level, and in this case, that would not be true.
102+
103+
[source,yaml]
104+
----
105+
spec:
106+
nameNodes:
107+
config: # <<<
108+
podDisruptionBudget:
109+
enabled: true
110+
maxUnavailable: 1
111+
roleGroups:
112+
default:
113+
replicas: 2
114+
config: {}
115+
# no such field as podDisruptionBudget
116+
----
117+
118+
==== Pros
119+
120+
* Everything configurable is below `config`, no new `roleConfig`
121+
* Like Option 1, covers configuration of the most important use cases
122+
123+
==== Cons
124+
125+
* `spec.nameNodes.config` is *not* similar to `spec.nameNodes.roleGroups.default.config` => Confusing to the user
126+
** thinking more about it, it might be confusing that the setting is not "copied" to all role groups like other settings like resources or affinities.
127+
* Still no option to configure role group level PDBs
128+
* Possibly complicated to implement, due to `config` usually being identical at role and role group level
129+
130+
=== Option 3: PDB in config with elaborate merge mechanism
131+
132+
Similar to Option 2, the PDB setting is located in the `config` but it is actually possible to use it at both role and role group level.
133+
We develop a semantic merge mechanism that would prevent overlapping PDBs.
134+
135+
.CRD Example
136+
[%collapsible]
137+
====
138+
[source,yaml]
139+
----
140+
apiVersion: hdfs.stackable.tech/v1alpha1
141+
kind: HdfsCluster
142+
metadata:
143+
name: simple-hdfs
144+
spec:
145+
image:
146+
productVersion: 3.3.4
147+
clusterConfig:
148+
zookeeperConfigMapName: simple-hdfs-znode
149+
nameNodes:
150+
config:
151+
podDisruptionBudget:
152+
enabled: true
153+
maxUnavailable: 2
154+
roleGroups:
155+
hdd:
156+
replicas: 16
157+
config:
158+
podDisruptionBudget:
159+
maxUnavailable: 4
160+
ssd:
161+
replicas: 8
162+
config:
163+
podDisruptionBudget:
164+
enabled: false
165+
in-memory:
166+
replicas: 4
167+
----
168+
169+
would end up with something like
170+
171+
[source,yaml]
172+
----
173+
apiVersion: policy/v1
174+
kind: PodDisruptionBudget
175+
metadata:
176+
name: simple-hdfs-datanodes-hdds
177+
spec:
178+
maxUnavailable: 4
179+
selector:
180+
matchLabels:
181+
app.kubernetes.io/name: hdfs
182+
app.kubernetes.io/instance: simple-hdfs
183+
app.kubernetes.io/component: datanode
184+
app.kubernetes.io/rolegroup: hdd
185+
---
186+
apiVersion: policy/v1
187+
kind: PodDisruptionBudget
188+
metadata:
189+
name: simple-hdfs-datanodes-not-hdds
190+
spec:
191+
maxUnavailable: 2
192+
selector:
193+
matchLabels:
194+
app.kubernetes.io/name: hdfs
195+
app.kubernetes.io/instance: simple-hdfs
196+
app.kubernetes.io/component: datanode
197+
matchExpressions:
198+
- key: app.kubernetes.io/rolegroup
199+
operator: NotIn
200+
values:
201+
- hdd
202+
- key: app.kubernetes.io/rolegroup
203+
operator: NotIn
204+
values:
205+
- in-memory
206+
----
207+
====
208+
209+
==== Pros
210+
211+
* Fits into the existing config structure
212+
* Allows configuring role config level PDBs and even hybrid configs
213+
214+
==== Cons
215+
216+
* Complex merge mechanism possibly difficult to understand and therefore easy to use the wrong way
217+
* Complex mechanism also not trivial to implement
218+
219+
=== Option 4 - PDB in config with normal "shared role group config" behaviour
220+
221+
Again we put the PDB in the `config` section but simply use the normal "copy" behaviour for this setting.
222+
This would be simple and easy to understand, but does not allow for true role level PDBs.
223+
224+
225+
.CRD Example
226+
[%collapsible]
227+
====
228+
[source,yaml]
229+
----
230+
spec:
231+
dataNodes:
232+
config:
233+
podDisruptionBudget:
234+
maxUnavailable: 2
235+
roleGroups:
236+
hdd:
237+
replicas: 16
238+
ssd:
239+
replicas: 8
240+
in-memory:
241+
replicas: 4
242+
----
243+
244+
would end up with
245+
246+
[source,yaml]
247+
----
248+
apiVersion: policy/v1
249+
kind: PodDisruptionBudget
250+
metadata:
251+
name: simple-hdfs-datanodes-hdds
252+
spec:
253+
maxUnavailable: 2
254+
selector:
255+
matchLabels:
256+
app.kubernetes.io/name: hdfs
257+
app.kubernetes.io/instance: simple-hdfs
258+
app.kubernetes.io/component: datanode
259+
app.kubernetes.io/rolegroup: hdd
260+
---
261+
apiVersion: policy/v1
262+
kind: PodDisruptionBudget
263+
metadata:
264+
name: simple-hdfs-datanodes-hdds
265+
spec:
266+
maxUnavailable: 2
267+
selector:
268+
matchLabels:
269+
app.kubernetes.io/name: hdfs
270+
app.kubernetes.io/instance: simple-hdfs
271+
app.kubernetes.io/component: datanode
272+
app.kubernetes.io/rolegroup: ssd
273+
---
274+
apiVersion: policy/v1
275+
kind: PodDisruptionBudget
276+
metadata:
277+
name: simple-hdfs-datanodes-hdds
278+
spec:
279+
maxUnavailable: 2
280+
selector:
281+
matchLabels:
282+
app.kubernetes.io/name: hdfs
283+
app.kubernetes.io/instance: simple-hdfs
284+
app.kubernetes.io/component: datanode
285+
app.kubernetes.io/rolegroup: in-memory
286+
----
287+
288+
[source,yaml]
289+
----
290+
spec:
291+
nameNodes:
292+
config:
293+
podDisruptionBudget:
294+
enabled: true
295+
maxUnavailable: 2
296+
roleGroups:
297+
hdd:
298+
replicas: 16
299+
config:
300+
podDisruptionBudget:
301+
maxUnavailable: 4
302+
ssd:
303+
replicas: 8
304+
config:
305+
podDisruptionBudget:
306+
enabled: false
307+
in-memory:
308+
replicas: 4
309+
----
310+
311+
would end up with
312+
313+
[source,yaml]
314+
----
315+
apiVersion: policy/v1
316+
kind: PodDisruptionBudget
317+
metadata:
318+
name: simple-hdfs-datanodes-hdds
319+
spec:
320+
maxUnavailable: 4
321+
selector:
322+
matchLabels:
323+
app.kubernetes.io/name: hdfs
324+
app.kubernetes.io/instance: simple-hdfs
325+
app.kubernetes.io/component: datanode
326+
app.kubernetes.io/rolegroup: hdd
327+
---
328+
apiVersion: policy/v1
329+
kind: PodDisruptionBudget
330+
metadata:
331+
name: simple-hdfs-datanodes-hdds
332+
spec:
333+
maxUnavailable: 2
334+
selector:
335+
matchLabels:
336+
app.kubernetes.io/name: hdfs
337+
app.kubernetes.io/instance: simple-hdfs
338+
app.kubernetes.io/component: datanode
339+
app.kubernetes.io/rolegroup: in-memory
340+
----
341+
====
342+
343+
==== Pros
344+
345+
* easy to understand
346+
* easy to implement
347+
* works the same as all other config
348+
349+
==== Cons
350+
351+
* Does not support the common use case of role level PDBs

modules/contributor/partials/current_adrs.adoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,3 +26,4 @@
2626
**** xref:adr/ADR027-status.adoc[]
2727
**** xref:adr/ADR028-automatic-stackable-version.adoc[]
2828
**** xref:adr/ADR029-database-connection.adoc[]
29+
**** xref:adr/ADR030-allowed-pod-disruptions.adoc[]

0 commit comments

Comments
 (0)