-
Notifications
You must be signed in to change notification settings - Fork 64
fix double counting when AWs dont have any pods running #415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@z103cb scheduleNext thread runs every 0 seconds and the cache is updated every 1 sec, below method clones the cache state, should we add a lock to this method?
|
I do not think so. The |
To get the number of running pods associated with an AW, PR uses an existing informer, this informer will update the AW status whenever the pod status changes. The performance penalty for counting resources of the pods is not significant when compared to previous versions as the existing version uses the informer to get counts of pods in different phases (Running, Failed, Completed, etc) owned by AW. For this PR histogram which was only implemented for GPUs is turned off this will increase the number of optimistic dispatches when fragmented resources are available in the cluster. An optimistic dispatch logger line is added to the PR which should help in determining the fragmented GPU resources in the cluster. I think optimistic dispatch is better than not dispatching an AW which was happening in previous MCAD versions. The instances where double counting was happening are now removed from MCAD bookkeeping (with the best effort). |
Fixes #429 |
Submitted 1000 CPU appwrapper jobs, all of them completed. the completion time is in the ballpark when compared with previous runs, actually a bit faster than previous runs which are typically completed in about 3000 seconds (FYI VMs used in this test are with different specs from previous runs):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updates to the CRD(s) are preventing approval of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but I would not merge it unless we are confident all of e2e tests are passing.
Testing a bit more, all bets are off if the status does not have totalcpu, totalmemory or totalGPU resources populated. This will lead to false dispatch. |
@z103cb @tardieu 1 test failed due to out of order AW dispatch since accounting changed, we need to add barriers in tests to fix the out of order test dispatch or redesign tests, please take a look at this issue: https://app.travis-ci.com/github/project-codeflare/multi-cluster-app-dispatcher/builds/264124859 |
Merging, thanks everyone for the feedback. |
When AWs are running with no pods under it, there is a double counting of resources

As seen in the screenshot minikube resources are increased in MCAD accounting when AWs are dispatched and do not have any pods running. This will cause queued AWs to dispatch and will never run, the real resources are shown in the terminal section of the screenshot