fix double counting when AWs dont have any pods running #415

asm582 · 2023-06-16T19:38:44Z

When AWs are running with no pods under it, there is a double counting of resources

As seen in the screenshot minikube resources are increased in MCAD accounting when AWs are dispatched and do not have any pods running. This will cause queued AWs to dispatch and will never run, the real resources are shown in the terminal section of the screenshot

asm582 · 2023-06-16T19:43:16Z

@z103cb scheduleNext thread runs every 0 seconds and the cache is updated every 1 sec, below method clones the cache state, should we add a lock to this method?

multi-cluster-app-dispatcher/pkg/controller/queuejob/queuejob_controller_ex.go

Line 872 in 433d1cb

    
           func (qjm *XController) getAggregatedAvailableResourcesPriority(unallocatedClusterResources *clusterstateapi.

clones cache state, should we add a lock to this method?

z103cb · 2023-06-19T09:03:53Z

@z103cb scheduleNext thread runs every 0 seconds and the cache is updated every 1 sec, below method clones the cache state, should we add a lock to this method?

multi-cluster-app-dispatcher/pkg/controller/queuejob/queuejob_controller_ex.go

Line 872 in 433d1cb

func (qjm *XController) getAggregatedAvailableResourcesPriority(unallocatedClusterResources *clusterstateapi.

clones cache state, should we add a lock to this method?

I do not think so. The r := unallocatedClusterResources.Clone() at line 875 is superfluous as the value passed into the function you mentioned, is already a copy which is made in thread safe manner at this line: resources, proposedPreemptions := qjm.getAggregatedAvailableResourcesPriority(. The GetUnallocatedResources() method seems to be thread safe.

asm582 · 2023-06-25T03:14:57Z

To get the number of running pods associated with an AW, PR uses an existing informer, this informer will update the AW status whenever the pod status changes. The performance penalty for counting resources of the pods is not significant when compared to previous versions as the existing version uses the informer to get counts of pods in different phases (Running, Failed, Completed, etc) owned by AW.

For this PR histogram which was only implemented for GPUs is turned off this will increase the number of optimistic dispatches when fragmented resources are available in the cluster. An optimistic dispatch logger line is added to the PR which should help in determining the fragmented GPU resources in the cluster. I think optimistic dispatch is better than not dispatching an AW which was happening in previous MCAD versions. The instances where double counting was happening are now removed from MCAD bookkeeping (with the best effort).

asm582 · 2023-06-25T20:23:17Z

Fixes #429

asm582 · 2023-06-25T20:50:16Z

Submitted 1000 CPU appwrapper jobs, all of them completed. the completion time is in the ballpark when compared with previous runs, actually a bit faster than previous runs which are typically completed in about 3000 seconds (FYI VMs used in this test are with different specs from previous runs):

Total amount of time for 1000 appwrappers is: 2541 seconds

z103cb

Updates to the CRD(s) are preventing approval of this PR.

pkg/apis/controller/v1beta1/appwrapper.go

z103cb

LGTM, but I would not merge it unless we are confident all of e2e tests are passing.

asm582 · 2023-06-27T19:00:51Z

Testing a bit more, all bets are off if the status does not have totalcpu, totalmemory or totalGPU resources populated. This will lead to false dispatch.

asm582 · 2023-06-27T19:44:06Z

@z103cb @tardieu 1 test failed due to out of order AW dispatch since accounting changed, we need to add barriers in tests to fix the out of order test dispatch or redesign tests, please take a look at this issue: https://app.travis-ci.com/github/project-codeflare/multi-cluster-app-dispatcher/builds/264124859

asm582 · 2023-06-28T10:40:25Z

Merging, thanks everyone for the feedback.

fix double counting when AWs dont have any pods running

714374e

asm582 requested review from metalcycling and z103cb June 16, 2023 19:38

asm582 marked this pull request as ready for review June 19, 2023 13:33

z103cb previously approved these changes Jun 19, 2023

View reviewed changes

tardieu mentioned this pull request Jun 19, 2023

Double counting when AWs dont have any pods running #417

Closed

use proposed preemptions to dispatch

c81e05b

asm582 dismissed z103cb’s stale review via c81e05b June 19, 2023 18:14

asm582 added 4 commits June 20, 2023 12:39

perform histrogram checks prior to preemption or pending res eval

ff6c59a

remove histograms

6f351aa

negate nodechecks histogram

afe6397

fix double counting for pending AWs

2017cfc

asm582 self-assigned this Jun 24, 2023

asm582 added 2 commits June 24, 2023 22:15

remove noisy logging

66a21c2

print optimistic dispatch msg when preempt res acquired

09c3852

release quota on completed AWs only when QM is enabled

0040e4c

z103cb suggested changes Jun 26, 2023

View reviewed changes

pkg/apis/controller/v1beta1/appwrapper.go Show resolved Hide resolved

update CRDs

adbe374

z103cb previously approved these changes Jun 26, 2023

View reviewed changes

add all AW resources for status dispatched

abbfe88

asm582 dismissed z103cb’s stale review via abbfe88 June 27, 2023 10:58

asm582 added 2 commits June 27, 2023 07:56

add label to test dep

93bb22a

rollback test case change

557f378

reserve all res during dispatch

d1c2352

z103cb approved these changes Jun 28, 2023

View reviewed changes

asm582 merged commit 7171e8e into main Jun 28, 2023

tardieu mentioned this pull request Jun 30, 2023

simplify acc logic #451

Merged

asm582 mentioned this pull request Jul 10, 2023

[quota management] MCAD panic in quota release when quota management is not enabled #429

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix double counting when AWs dont have any pods running #415

fix double counting when AWs dont have any pods running #415

Uh oh!

asm582 commented Jun 16, 2023 •

edited

Loading

Uh oh!

asm582 commented Jun 16, 2023

Uh oh!

z103cb commented Jun 19, 2023 •

edited

Loading

Uh oh!

asm582 commented Jun 25, 2023 •

edited

Loading

Uh oh!

asm582 commented Jun 25, 2023

Uh oh!

asm582 commented Jun 25, 2023

Uh oh!

z103cb left a comment

Uh oh!

Uh oh!

z103cb left a comment

Uh oh!

asm582 commented Jun 27, 2023

Uh oh!

asm582 commented Jun 27, 2023

Uh oh!

asm582 commented Jun 28, 2023

Uh oh!

Uh oh!

fix double counting when AWs dont have any pods running #415

fix double counting when AWs dont have any pods running #415

Uh oh!

Conversation

asm582 commented Jun 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asm582 commented Jun 16, 2023

Uh oh!

z103cb commented Jun 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asm582 commented Jun 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asm582 commented Jun 25, 2023

Uh oh!

asm582 commented Jun 25, 2023

Uh oh!

z103cb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

z103cb left a comment

Choose a reason for hiding this comment

Uh oh!

asm582 commented Jun 27, 2023

Uh oh!

asm582 commented Jun 27, 2023

Uh oh!

asm582 commented Jun 28, 2023

Uh oh!

Uh oh!

asm582 commented Jun 16, 2023 •

edited

Loading

z103cb commented Jun 19, 2023 •

edited

Loading

asm582 commented Jun 25, 2023 •

edited

Loading