Skip to content

[quota management] rapid preemptions until it gets to steady state  #360

Open
@asm582

Description

@asm582

Rapid preemptions were observed for a high-priority AW until it reached a steady state. consider the below leaf node in quota tree :

    - name: namespace-2
      quotas:
        hardLimit: false
        requests:
          cpu: 3000
          memory: 6Gi

the root had cpu of 24000 and memory 32Gi. 3 AWs were submitted with a priority of 100 and resource requirements as CPU: 2000 and memory 24Gi, CPU: 2000 and memory 4Gi, CPU: 2000 and memory 4Gi, later a high priority AW with priority 1000 consuming the same quota node was submitted with CPU as 22000m and memory as 22 Gi, this cause AWs using quota namespace-2 to be deleted often for quite some time. below is some history of pods getting deleted:

base) abhishekmalvankar@Abhisheks-MacBook-Pro multi-cluster-app-dispatcher % oc get pods
NAME                 READY   STATUS        RESTARTS   AGE
batch-job-11-j4ldg   1/1     Running       0          47m
batch-job-12-bngz9   0/1     Pending       0          27s
batch-job-2-q9hnr    1/1     Terminating   0          48s
batch-job-3-ch5s6    1/1     Running       0          14m
batch-job-4-nqs8k    1/1     Terminating   0          44s
(base) abhishekmalvankar@Abhisheks-MacBook-Pro multi-cluster-app-dispatcher % oc get pods
NAME                 READY   STATUS        RESTARTS   AGE
batch-job-11-j4ldg   1/1     Running       0          47m
batch-job-12-bngz9   0/1     Pending       0          28s
batch-job-2-q9hnr    1/1     Terminating   0          49s
batch-job-3-ch5s6    1/1     Running       0          14m
(base) abhishekmalvankar@Abhisheks-MacBook-Pro multi-cluster-app-dispatcher % oc get pods
NAME                 READY   STATUS    RESTARTS   AGE
batch-job-11-j4ldg   1/1     Running   0          47m
batch-job-12-bngz9   0/1     Pending   0          29s
batch-job-3-ch5s6    1/1     Running   0          14m
(base) abhishekmalvankar@Abhisheks-MacBook-Pro multi-cluster-app-dispatcher % oc get pods
NAME                 READY   STATUS    RESTARTS   AGE
batch-job-11-j4ldg   1/1     Running   0          47m
batch-job-12-bngz9   0/1     Pending   0          32s
batch-job-3-ch5s6    1/1     Running   0          14m
(base) abhishekmalvankar@Abhisheks-MacBook-Pro multi-cluster-app-dispatcher % oc get pods
NAME                 READY   STATUS    RESTARTS   AGE
batch-job-11-j4ldg   1/1     Running   0          47m
batch-job-12-bngz9   0/1     Pending   0          33s
batch-job-3-ch5s6    1/1     Running   0          14m
(base) abhishekmalvankar@Abhisheks-MacBook-Pro multi-cluster-app-dispatcher % oc get pods
NAME                 READY   STATUS    RESTARTS   AGE
batch-job-11-j4ldg   1/1     Running   0          47m
batch-job-3-ch5s6    1/1     Running   0          14m
(base) abhishekmalvankar@Abhisheks-MacBook-Pro multi-cluster-app-dispatcher % oc get pods -o wide
NAME                 READY   STATUS              RESTARTS   AGE   IP             NODE                                         NOMINATED NODE   READINESS GATES
batch-job-11-j4ldg   1/1     Running             0          47m   10.128.21.56   ip-10-0-148-128.us-east-2.compute.internal   <none>           <none>
batch-job-3-ch5s6    1/1     Running             0          14m   10.128.21.65   ip-10-0-148-128.us-east-2.compute.internal   <none>           <none>
batch-job-4-wl9px    0/1     ContainerCreating   0          2s    <none>         ip-10-0-148-128.us-east-2.compute.internal   <none>           <none>
(base) abhishekmalvankar@Abhisheks-MacBook-




(base) abhishekmalvankar@Abhisheks-MacBook-Pro multi-cluster-app-dispatcher % oc get pods -o wide
NAME                 READY   STATUS    RESTARTS   AGE   IP              NODE                                         NOMINATED NODE   READINESS GATES
batch-job-11-j4ldg   1/1     Running   0          12h   10.128.21.56    ip-10-0-148-128.us-east-2.compute.internal   <none>           <none>
batch-job-2-zlpww    1/1     Running   0          9h    10.128.20.104   ip-10-0-148-128.us-east-2.compute.internal   <none>           <none>
batch-job-3-ch5s6    1/1     Running   0          11h   10.128.21.65    ip-10-0-148-128.us-east-2.compute.internal   <none>           <none>
batch-job-5-nzs5h    1/1     Running   0          9h    10.128.20.103   ip-10-0-148-128.us-east-2.compute.internal   <none>           <none>

batch job batch-job-11 is never preempted since it is using quota from another leaf node that has hardLimit: true

The current behavior is unclear as to when the preemption cycle stops, and the criteria for stopping the preemption, with experiment, kept running log enough we see preemption stopped and high priority AW never dispatched:

(base) abhishekmalvankar@Abhisheks-MacBook-Pro multi-cluster-app-dispatcher % oc get pods 
NAME                 READY   STATUS    RESTARTS   AGE
batch-job-11-j4ldg   1/1     Running   0          12h
batch-job-2-zlpww    1/1     Running   0          10h
batch-job-3-ch5s6    1/1     Running   0          12h
batch-job-5-nzs5h    1/1     Running   0          10h
Status:
  Conditions:
    Last Transition Micro Time:  2023-05-10T21:59:28.781563Z
    Last Update Micro Time:      2023-05-10T21:59:28.781563Z
    Status:                      True
    Type:                        Init
    Last Transition Micro Time:  2023-05-10T21:59:28.781678Z
    Last Update Micro Time:      2023-05-10T21:59:28.781678Z
    Reason:                      AwaitingHeadOfLine
    Status:                      True
    Type:                        Queueing
    Last Transition Micro Time:  2023-05-11T00:12:19.685738Z
    Last Update Micro Time:      2023-05-11T00:12:19.685738Z
    Reason:                      AppWrapperRunnable
    Status:                      True
    Type:                        Dispatched
    Last Transition Micro Time:  2023-05-11T00:12:56.241814Z
    Last Update Micro Time:      2023-05-11T00:12:56.241813Z
    Message:                     Pods failed scheduling failed=1, running=0.
    Reason:                      PodsFailedScheduling
    Status:                      True
    Type:                        PreemptCandidate
    Last Transition Micro Time:  2023-05-10T21:59:55.190754Z
    Last Update Micro Time:      2023-05-10T21:59:55.190754Z
    Message:                     Pods failed scheduling failed=1, running=0.
    Reason:                      PreemptionTriggered
    Status:                      True
    Type:                        Backoff
    Last Transition Micro Time:  2023-05-10T22:00:15.192484Z
    Last Update Micro Time:      2023-05-10T22:00:15.192484Z
    Reason:                      FrontOfQueue.
    Status:                      True
    Type:                        HeadOfLine
    Last Transition Micro Time:  2023-05-11T00:13:16.262124Z
    Last Update Micro Time:      2023-05-11T00:13:16.262124Z
    Message:                     Insufficient quota to dispatch AppWrapper.
    Reason:                      AppWrapperNotRunnable.  Failed to allocate quota on quota designation 'quota_context'
    Status:                      True
    Type:                        Backoff
  Controllerfirsttimestamp:      2023-05-10T21:59:28.781432Z
  Filterignore:                  true
  Pendingpodconditions:
    Conditions:
      Message:     0/2 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 2 Insufficient cpu. preemption: 0/2 nodes are available: 1 No preemption victims found for incoming pod, 1 Preemption is not helpful for scheduling.
      Reason:      Unschedulable
      Status:      False
      Type:        PodScheduled
    Podname:       batch-job-12-dxvqd
  Queuejobstate:   HeadOfLine
  Sender:          before ScheduleNext - setHOL
  State:           Pending
  Systempriority:  1000
Events:            <none>

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions