Skip to content

Adding GPUs to Kind cluster #494

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jul 25, 2023

Conversation

metalcycling
Copy link
Collaborator

This update patches the nodes created by kind by extending the resources to include GPUs. These GPUs are just there as a count and do now represent real GPUs. This version of the build system will allow tests to include GPUs even when the cluster doesn't have physical GPUs.

Copy link
Contributor

@z103cb z103cb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with this change you are bypassing the whole e2e test. Why is this necessary ?

@metalcycling
Copy link
Collaborator Author

with this change you are bypassing the whole e2e test. Why is this necessary ?

Yes, that was intentional. I'm testing something that involves patching the nodes created with kind and I wanted to make sure the Travis update shows what I want. I'm not trying to merge this yet, so that is why I haven't requested anyone's reviews.

@z103cb
Copy link
Contributor

z103cb commented Jul 19, 2023

Thanks for the explanation, draft PR(s) are great for communicating WIP.

@metalcycling
Copy link
Collaborator Author

I thought about a draft PR, but I didn't know if it would trigger the Travis build which is what I need. If it does, then I don't have any problems making this PR a draft while I do my work.

@z103cb
Copy link
Contributor

z103cb commented Jul 19, 2023

I think it will work jus fine: #482 is still draft and CI build ran.

@metalcycling metalcycling marked this pull request as draft July 19, 2023 14:58
@metalcycling
Copy link
Collaborator Author

Thanks @z103cb ! I made the PR a draft PR while I work on it.

@metalcycling metalcycling marked this pull request as ready for review July 22, 2023 22:49
@metalcycling metalcycling requested review from asm582 and z103cb July 22, 2023 22:49
Copy link
Contributor

@z103cb z103cb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is a very good step in the right direction, but I can't approve this PR for these reasons:

  1. There are missing error checking steps when the node are being extended
  2. I think that the extend-resources function should not exist, some of its behaviours should be moved to other functions:
  • The cluster enhancements should be moved to the setup-mcad-env function, generally that is where the cluster cusimizations are done
  • Executing the new tests should be moved into the kuttl-test function (adding the yaml to the array, should be sufficient).

z103cb
z103cb previously approved these changes Jul 25, 2023
Copy link
Contributor

@z103cb z103cb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asside from the small nitpick around the use of = instead of ==, this is looking good. It can be merged once the correct operator is used.

Copy link
Contributor

@z103cb z103cb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be merged once the PR check passes.

@asm582
Copy link
Member

asm582 commented Jul 25, 2023

@metalcycling Thanks for this PR, I would like to understand a bit more, who is mutating the node object when an AW with GPU resource requirement is dispatched?

@metalcycling
Copy link
Collaborator Author

@metalcycling Thanks for this PR, I would like to understand a bit more, who is mutating the node object when an AW with GPU resource requirement is dispatched?

The node is already mutated by the time the AW comes in. The Kubernetes patching adds an extended resource to the list of allocatable resources that the scheduler sees when making the decision of where to place the pods. It is not a physical resource and it is only used by the scheduler.

@metalcycling metalcycling merged commit a4d16e1 into project-codeflare:main Jul 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants