Enhancement: handle infinite-requeuing of jobs with pod-stuck-terminating

## WHY

Scenario (observed in practice):

Initial state: 
* a Pod `pod-1` is `Running` on node `x` and is 1 of the minimum number of members required, as set in the MCAD scheduling spec. 
* the name of all the pods in the group is determined by the pod template (within the AppWrapper) and/or the corresponding controller (i.e. the pod spec may be included directly in the `genericitem` list or it may be embedded in a CRD such as a `PyTorchJob`. The important point is that the pod name does not change between MCAD’s initial dispatch and subsequent dispatches.

Problematic observed sequence of transitions and states:
* Node `x` becomes unschedulable*.
* The pod `pod-1` enters a `Terminating` state. 
* The pod `pod-1` remains in a `Terminating` state despite attempts to force-terminate the pod (automated within MCAD or manual with `…delete —force=true —grace-period=0…`). We have observed pods stuck like this for over 24hrs.
* When MCAD observes the number of `Running` pods is less than `minMembers`, it removes the group and requeues the AppWrapper. This is expected.
* Later, MCAD re-dispatches the AppWrapper while the pod `pod-1` is still in a `Terminating` state. 
* Because `pod-1` can still not be transitioned out of a `Terminating` state and still considered a member of the group, MCAD removes the group and requeues the AppWrapper for the same reason as before (insufficiently many members are `Running`).
* This process repeats for as long as the pod is stuck in a `Terminating` state (which, again, we have observed to be hours, even over a day) and the job fails to make progress.

The cost:
* The job fails to make progress, and may be delayed in restarting even after the pod finally terminates due to e.g. exponential backoff
* Resources may idle when in fact there is work available

Existing workaround
* Currently, once the above is manually identified to be the case for a given job, manually renaming the job/pods provides a workaround to get the job going again, however this takes time and in the meantime (e.g. overnight) the job does not make progress and resources may idle.

Call for solution ideas:
* This is not an MCAD “bug” per se; it is an issue for any job/pod controller which keeps pod names constant between job restarts (of which there are multiple)
* It is an opportunity for improved job resilience from MCAD in a job-controller-independent manner


* Note, we have observed a variety of reasons why a node might become unschedule in practice, from unexpected hardware failures to intentional cordoning. The sought-for solution here is independent of the underlying cause of failure.

## WHAT
What is being asked for?

Ideas/proposals for a MCAD-level (pod/job controller independent) solution.

## HOW
Suggestions for how this may be solved. [Optional]
## TESTS
List of related tests
The above description could be captured in even a single pod, single node test.
## DONE
Bullet point items for what should be completed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhancement: handle infinite-requeuing of jobs with pod-stuck-terminating #599

WHY

WHAT

HOW

TESTS

DONE

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Enhancement: handle infinite-requeuing of jobs with pod-stuck-terminating #599

Description

WHY

WHAT

HOW

TESTS

DONE

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions