Skip to content

Enhancement: handle infinite-requeuing of jobs with pod-stuck-terminating #599

Open
@MEllis-github

Description

@MEllis-github

WHY

Scenario (observed in practice):

Initial state:

  • a Pod pod-1 is Running on node x and is 1 of the minimum number of members required, as set in the MCAD scheduling spec.
  • the name of all the pods in the group is determined by the pod template (within the AppWrapper) and/or the corresponding controller (i.e. the pod spec may be included directly in the genericitem list or it may be embedded in a CRD such as a PyTorchJob. The important point is that the pod name does not change between MCAD’s initial dispatch and subsequent dispatches.

Problematic observed sequence of transitions and states:

  • Node x becomes unschedulable*.
  • The pod pod-1 enters a Terminating state.
  • The pod pod-1 remains in a Terminating state despite attempts to force-terminate the pod (automated within MCAD or manual with …delete —force=true —grace-period=0…). We have observed pods stuck like this for over 24hrs.
  • When MCAD observes the number of Running pods is less than minMembers, it removes the group and requeues the AppWrapper. This is expected.
  • Later, MCAD re-dispatches the AppWrapper while the pod pod-1 is still in a Terminating state.
  • Because pod-1 can still not be transitioned out of a Terminating state and still considered a member of the group, MCAD removes the group and requeues the AppWrapper for the same reason as before (insufficiently many members are Running).
  • This process repeats for as long as the pod is stuck in a Terminating state (which, again, we have observed to be hours, even over a day) and the job fails to make progress.

The cost:

  • The job fails to make progress, and may be delayed in restarting even after the pod finally terminates due to e.g. exponential backoff
  • Resources may idle when in fact there is work available

Existing workaround

  • Currently, once the above is manually identified to be the case for a given job, manually renaming the job/pods provides a workaround to get the job going again, however this takes time and in the meantime (e.g. overnight) the job does not make progress and resources may idle.

Call for solution ideas:

  • This is not an MCAD “bug” per se; it is an issue for any job/pod controller which keeps pod names constant between job restarts (of which there are multiple)

  • It is an opportunity for improved job resilience from MCAD in a job-controller-independent manner

  • Note, we have observed a variety of reasons why a node might become unschedule in practice, from unexpected hardware failures to intentional cordoning. The sought-for solution here is independent of the underlying cause of failure.

WHAT

What is being asked for?

Ideas/proposals for a MCAD-level (pod/job controller independent) solution.

HOW

Suggestions for how this may be solved. [Optional]

TESTS

List of related tests
The above description could be captured in even a single pod, single node test.

DONE

Bullet point items for what should be completed

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    In Progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions