Description
WHY
Scenario (observed in practice):
Initial state:
- a Pod
pod-1
isRunning
on nodex
and is 1 of the minimum number of members required, as set in the MCAD scheduling spec. - the name of all the pods in the group is determined by the pod template (within the AppWrapper) and/or the corresponding controller (i.e. the pod spec may be included directly in the
genericitem
list or it may be embedded in a CRD such as aPyTorchJob
. The important point is that the pod name does not change between MCAD’s initial dispatch and subsequent dispatches.
Problematic observed sequence of transitions and states:
- Node
x
becomes unschedulable*. - The pod
pod-1
enters aTerminating
state. - The pod
pod-1
remains in aTerminating
state despite attempts to force-terminate the pod (automated within MCAD or manual with…delete —force=true —grace-period=0…
). We have observed pods stuck like this for over 24hrs. - When MCAD observes the number of
Running
pods is less thanminMembers
, it removes the group and requeues the AppWrapper. This is expected. - Later, MCAD re-dispatches the AppWrapper while the pod
pod-1
is still in aTerminating
state. - Because
pod-1
can still not be transitioned out of aTerminating
state and still considered a member of the group, MCAD removes the group and requeues the AppWrapper for the same reason as before (insufficiently many members areRunning
). - This process repeats for as long as the pod is stuck in a
Terminating
state (which, again, we have observed to be hours, even over a day) and the job fails to make progress.
The cost:
- The job fails to make progress, and may be delayed in restarting even after the pod finally terminates due to e.g. exponential backoff
- Resources may idle when in fact there is work available
Existing workaround
- Currently, once the above is manually identified to be the case for a given job, manually renaming the job/pods provides a workaround to get the job going again, however this takes time and in the meantime (e.g. overnight) the job does not make progress and resources may idle.
Call for solution ideas:
-
This is not an MCAD “bug” per se; it is an issue for any job/pod controller which keeps pod names constant between job restarts (of which there are multiple)
-
It is an opportunity for improved job resilience from MCAD in a job-controller-independent manner
-
Note, we have observed a variety of reasons why a node might become unschedule in practice, from unexpected hardware failures to intentional cordoning. The sought-for solution here is independent of the underlying cause of failure.
WHAT
What is being asked for?
Ideas/proposals for a MCAD-level (pod/job controller independent) solution.
HOW
Suggestions for how this may be solved. [Optional]
TESTS
List of related tests
The above description could be captured in even a single pod, single node test.
DONE
Bullet point items for what should be completed
Metadata
Metadata
Assignees
Labels
Type
Projects
Status