-
Notifications
You must be signed in to change notification settings - Fork 580
[WIP]: create 100 node test for DRA #3368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Alay Patel <alayp@nvidia.com>
Skipping CI for Draft Pull Request. |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: alaypatel07 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I have run this test on a 5 node cluster, I get the following measurements:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've skimmed through it and it's the going in the direction I wanted it to go - thanks!
action, err := util.GetString(config.Params, "action") | ||
if err != nil { | ||
return nil, err | ||
} | ||
masterIP, err := util.GetStringOrDefault(config.Params, "masterIP", config.ClusterFramework.GetClusterConfig().GetMasterIP()) | ||
|
||
endpoint, err := util.GetStringOrDefault(config.Params, "endpoint", "localhost:10259") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
localhost may not be a correct assumption at least in some cases.
Can you clarify what problem you faced with the current implementation?
And in any way - I suggest moving changes to scheduler_latency.go to a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my tests, I'm using a 5 node kind provider cluster.
I was getting 401 unauthorised error during the scheduler metrics collection phase. The ToT version of scheduler metrics collection uses apiserver to proxy the call to scheduler metrics endpoint, which wasn't working.
As a workaround, to immediately be unblocked I started using kubectl port-forward, with a custom service account token to get access to the metrics, hence you see the localhost. My intention is to drop this commit before merge, I have kept it here only if anyone wants to try this PR out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it - thanks!
If it's for testing purposes - it's fine. If at some point you decide that you need it (or you want to simplify stuff, or make it work out-of-the-box for your setup or sth) - please open a separate PR for it.
} | ||
|
||
// InstallChart installs a Helm chart using the provided options | ||
func InstallChart(opts *InstallOptions) (*release.Release, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would really like to avoid dependency on helm if possible for multiple different reasons.
However - looking at the the dra-example-driver repo - it seems help is the only supported option of installing (there are no provided ready-to-use manifests, right?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not at the moment, but we could consider changing that if desired. The problem is that we have quite a few different options that influence the deployment. With Helm, those are configurable. Without it, a ready-to-use manifest would have to pick some options upfront. We might need several of them...
cc @nojnhuh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is kustomize, envsubst, or something else like that any more acceptable? Or somehow generating a manifest on the fly with helm template
before the main flow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wojtek-t just for my understanding what are some of the key reasons to avoid helm dependency?
In the future, I see the need to add more customized installations of the driver to extend the scale tests for other DRA related features that are currently in alpha/beta.
My approach with this PR was to use something that works now and collect more requirements for the test driver, to make it better in future iterations. I can add removing helm dependency in that list.
Does that approach work? Can we start out with a helm install and start collecting numbers? Or is this a strict no-go to begin with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wojtek-t just for my understanding what are some of the key reasons to avoid helm dependency?
helm is fairly complex tool and depending on it means that the maintainers has to understand it (e.g. for debugging purposes).
So the simpler thing we use - the better.
From that perspective - envsubst
seems interesting, as a native linux tool.
I understand that in a generic case there are many knobs that we need. But in a single particular test - is it still the case? Can we maybe generate the templates using help statically and commit them to our repo?
[That means the need for periodically updating that, but given that we effectively want to scale-test the DRA support in core, not the driver itself - that may actually not be needed frequently.]
Does that approach work? Can we start out with a helm install and start collecting numbers? Or is this a strict no-go to begin with?
Having something in a list is easy to skip later :) I'm not saying it's hard-no, but I would like to explore the other potential options first.
What type of PR is this?
This PR adds scale test for DRA. At a high level, it does the following:
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer: