Skip to content

fix double counting when AWs dont have any pods running #415

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jun 28, 2023
Merged

Conversation

asm582
Copy link
Member

@asm582 asm582 commented Jun 16, 2023

When AWs are running with no pods under it, there is a double counting of resources
Screenshot 2023-06-16 at 1 46 32 PM

As seen in the screenshot minikube resources are increased in MCAD accounting when AWs are dispatched and do not have any pods running. This will cause queued AWs to dispatch and will never run, the real resources are shown in the terminal section of the screenshot

@asm582 asm582 requested review from metalcycling and z103cb June 16, 2023 19:38
@asm582
Copy link
Member Author

asm582 commented Jun 16, 2023

@z103cb scheduleNext thread runs every 0 seconds and the cache is updated every 1 sec, below method clones the cache state, should we add a lock to this method?

func (qjm *XController) getAggregatedAvailableResourcesPriority(unallocatedClusterResources *clusterstateapi.
clones cache state, should we add a lock to this method?

@z103cb
Copy link
Contributor

z103cb commented Jun 19, 2023

@z103cb scheduleNext thread runs every 0 seconds and the cache is updated every 1 sec, below method clones the cache state, should we add a lock to this method?

func (qjm *XController) getAggregatedAvailableResourcesPriority(unallocatedClusterResources *clusterstateapi.

clones cache state, should we add a lock to this method?

I do not think so. The r := unallocatedClusterResources.Clone() at line 875 is superfluous as the value passed into the function you mentioned, is already a copy which is made in thread safe manner at this line: resources, proposedPreemptions := qjm.getAggregatedAvailableResourcesPriority(. The GetUnallocatedResources() method seems to be thread safe.

@asm582 asm582 marked this pull request as ready for review June 19, 2023 13:33
z103cb
z103cb previously approved these changes Jun 19, 2023
@asm582 asm582 self-assigned this Jun 24, 2023
@asm582
Copy link
Member Author

asm582 commented Jun 25, 2023

To get the number of running pods associated with an AW, PR uses an existing informer, this informer will update the AW status whenever the pod status changes. The performance penalty for counting resources of the pods is not significant when compared to previous versions as the existing version uses the informer to get counts of pods in different phases (Running, Failed, Completed, etc) owned by AW.

For this PR histogram which was only implemented for GPUs is turned off this will increase the number of optimistic dispatches when fragmented resources are available in the cluster. An optimistic dispatch logger line is added to the PR which should help in determining the fragmented GPU resources in the cluster. I think optimistic dispatch is better than not dispatching an AW which was happening in previous MCAD versions. The instances where double counting was happening are now removed from MCAD bookkeeping (with the best effort).

@asm582
Copy link
Member Author

asm582 commented Jun 25, 2023

Fixes #429

@asm582
Copy link
Member Author

asm582 commented Jun 25, 2023

Submitted 1000 CPU appwrapper jobs, all of them completed. the completion time is in the ballpark when compared with previous runs, actually a bit faster than previous runs which are typically completed in about 3000 seconds (FYI VMs used in this test are with different specs from previous runs):

Total amount of time for 1000 appwrappers is: 2541 seconds

Copy link
Contributor

@z103cb z103cb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updates to the CRD(s) are preventing approval of this PR.

z103cb
z103cb previously approved these changes Jun 26, 2023
Copy link
Contributor

@z103cb z103cb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I would not merge it unless we are confident all of e2e tests are passing.

@asm582
Copy link
Member Author

asm582 commented Jun 27, 2023

Testing a bit more, all bets are off if the status does not have totalcpu, totalmemory or totalGPU resources populated. This will lead to false dispatch.

@asm582
Copy link
Member Author

asm582 commented Jun 27, 2023

@z103cb @tardieu 1 test failed due to out of order AW dispatch since accounting changed, we need to add barriers in tests to fix the out of order test dispatch or redesign tests, please take a look at this issue: https://app.travis-ci.com/github/project-codeflare/multi-cluster-app-dispatcher/builds/264124859

@asm582
Copy link
Member Author

asm582 commented Jun 28, 2023

Merging, thanks everyone for the feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants