Skip to content

[WIP]: create 100 node test for DRA #3368

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

alaypatel07
Copy link

@alaypatel07 alaypatel07 commented May 28, 2025

What type of PR is this?

This PR adds scale test for DRA. At a high level, it does the following:

  1. install example dra driver. The driver create 8 gpu devices per node. 100 nodes, 800 gpu devices
  2. create a long running job to fill up 90% of the available capacity, i.e 720 jobs with 1 pod each.
  3. take scheduling metrics measurements and pod start up times
  4. create additional short lived jobs, job.spec.parallelism being 10% of the capacity, job.spec.completions being 10 * (job.spec.parallism), this will give us 10% capacity churn for 10 times, i.e. 800 additional pods, 80 at a time.
  5. take scheduling metrics and short lived churn pods start up times

What this PR does / why we need it:

  • needed for DRA GA

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

  • creating a draft PR for early review on the direction, some rough edges need to be addressed
  • cc6ea1c this commit is a hack to make scheduler metrics work in my local kind environment. It needs to be dropped before this PR merges

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 28, 2025
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot requested review from mborsz and wojtek-t May 28, 2025 04:26
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: alaypatel07
Once this PR has been reviewed and has the lgtm label, please assign marseel for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 28, 2025
@alaypatel07
Copy link
Author

cc @pohly @wojtek-t

I have run this test on a 5 node cluster, I get the following measurements:

$ ls /tmp/clusterloader2-results/
...
...
PodStartupLatency_ChurnPodStartupLatency_dra-steady-state_2025-05-28T02:53:11Z.json
PodStartupLatency_FastFillPodStartupLatency_dra-steady-state_2025-05-28T02:52:05Z.json
SchedulingMetrics_ChurnSchedulingMetrics_dra-steady-state_2025-05-28T02:53:11Z.json
SchedulingMetrics_FastFillSchedulingMetrics_dra-steady-state_2025-05-28T02:52:05Z.json

Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've skimmed through it and it's the going in the direction I wanted it to go - thanks!

action, err := util.GetString(config.Params, "action")
if err != nil {
return nil, err
}
masterIP, err := util.GetStringOrDefault(config.Params, "masterIP", config.ClusterFramework.GetClusterConfig().GetMasterIP())

endpoint, err := util.GetStringOrDefault(config.Params, "endpoint", "localhost:10259")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

localhost may not be a correct assumption at least in some cases.

Can you clarify what problem you faced with the current implementation?
And in any way - I suggest moving changes to scheduler_latency.go to a separate PR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my tests, I'm using a 5 node kind provider cluster.

I was getting 401 unauthorised error during the scheduler metrics collection phase. The ToT version of scheduler metrics collection uses apiserver to proxy the call to scheduler metrics endpoint, which wasn't working.

As a workaround, to immediately be unblocked I started using kubectl port-forward, with a custom service account token to get access to the metrics, hence you see the localhost. My intention is to drop this commit before merge, I have kept it here only if anyone wants to try this PR out

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it - thanks!

If it's for testing purposes - it's fine. If at some point you decide that you need it (or you want to simplify stuff, or make it work out-of-the-box for your setup or sth) - please open a separate PR for it.

}

// InstallChart installs a Helm chart using the provided options
func InstallChart(opts *InstallOptions) (*release.Release, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really like to avoid dependency on helm if possible for multiple different reasons.
However - looking at the the dra-example-driver repo - it seems help is the only supported option of installing (there are no provided ready-to-use manifests, right?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at the moment, but we could consider changing that if desired. The problem is that we have quite a few different options that influence the deployment. With Helm, those are configurable. Without it, a ready-to-use manifest would have to pick some options upfront. We might need several of them...

cc @nojnhuh

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is kustomize, envsubst, or something else like that any more acceptable? Or somehow generating a manifest on the fly with helm template before the main flow?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wojtek-t just for my understanding what are some of the key reasons to avoid helm dependency?

In the future, I see the need to add more customized installations of the driver to extend the scale tests for other DRA related features that are currently in alpha/beta.

My approach with this PR was to use something that works now and collect more requirements for the test driver, to make it better in future iterations. I can add removing helm dependency in that list.

Does that approach work? Can we start out with a helm install and start collecting numbers? Or is this a strict no-go to begin with?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wojtek-t just for my understanding what are some of the key reasons to avoid helm dependency?

helm is fairly complex tool and depending on it means that the maintainers has to understand it (e.g. for debugging purposes).
So the simpler thing we use - the better.
From that perspective - envsubst seems interesting, as a native linux tool.

I understand that in a generic case there are many knobs that we need. But in a single particular test - is it still the case? Can we maybe generate the templates using help statically and commit them to our repo?
[That means the need for periodically updating that, but given that we effectively want to scale-test the DRA support in core, not the driver itself - that may actually not be needed frequently.]

Does that approach work? Can we start out with a helm install and start collecting numbers? Or is this a strict no-go to begin with?

Having something in a list is easy to skip later :) I'm not saying it's hard-no, but I would like to explore the other potential options first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants