[WIP]: create 100 node test for DRA #3368

alaypatel07 · 2025-05-28T04:26:19Z

What type of PR is this?

This PR adds scale test for DRA. At a high level, it does the following:

install example dra driver. The driver create 8 gpu devices per node. 100 nodes, 800 gpu devices
create a long running job to fill up 90% of the available capacity, i.e 720 jobs with 1 pod each.
take scheduling metrics measurements and pod start up times
create additional short lived jobs, job.spec.parallelism being 10% of the capacity, job.spec.completions being 10 * (job.spec.parallism), this will give us 10% capacity churn for 10 times, i.e. 800 additional pods, 80 at a time.
take scheduling metrics and short lived churn pods start up times

What this PR does / why we need it:

needed for DRA GA

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

creating a draft PR for early review on the direction, some rough edges need to be addressed
cc6ea1c this commit is a hack to make scheduler metrics work in my local kind environment. It needs to be dropped before this PR merges

Signed-off-by: Alay Patel <alayp@nvidia.com>

k8s-ci-robot · 2025-05-28T04:26:22Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

k8s-ci-robot · 2025-05-28T04:26:28Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-05-28T04:26:32Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: alaypatel07
Once this PR has been reviewed and has the lgtm label, please assign marseel for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

clusterloader2/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

alaypatel07 · 2025-05-28T16:46:19Z

cc @pohly @wojtek-t

I have run this test on a 5 node cluster, I get the following measurements:

$ ls /tmp/clusterloader2-results/
...
...
PodStartupLatency_ChurnPodStartupLatency_dra-steady-state_2025-05-28T02:53:11Z.json
PodStartupLatency_FastFillPodStartupLatency_dra-steady-state_2025-05-28T02:52:05Z.json
SchedulingMetrics_ChurnSchedulingMetrics_dra-steady-state_2025-05-28T02:53:11Z.json
SchedulingMetrics_FastFillSchedulingMetrics_dra-steady-state_2025-05-28T02:52:05Z.json

wojtek-t

I've skimmed through it and it's the going in the direction I wanted it to go - thanks!

wojtek-t · 2025-05-29T11:21:34Z

clusterloader2/pkg/measurement/common/scheduler_latency.go

 	action, err := util.GetString(config.Params, "action")
 	if err != nil {
 		return nil, err
 	}
-	masterIP, err := util.GetStringOrDefault(config.Params, "masterIP", config.ClusterFramework.GetClusterConfig().GetMasterIP())
+
+	endpoint, err := util.GetStringOrDefault(config.Params, "endpoint", "localhost:10259")


localhost may not be a correct assumption at least in some cases.

Can you clarify what problem you faced with the current implementation?
And in any way - I suggest moving changes to scheduler_latency.go to a separate PR.

In my tests, I'm using a 5 node kind provider cluster.

I was getting 401 unauthorised error during the scheduler metrics collection phase. The ToT version of scheduler metrics collection uses apiserver to proxy the call to scheduler metrics endpoint, which wasn't working.

As a workaround, to immediately be unblocked I started using kubectl port-forward, with a custom service account token to get access to the metrics, hence you see the localhost. My intention is to drop this commit before merge, I have kept it here only if anyone wants to try this PR out

Got it - thanks!

If it's for testing purposes - it's fine. If at some point you decide that you need it (or you want to simplify stuff, or make it work out-of-the-box for your setup or sth) - please open a separate PR for it.

wojtek-t · 2025-05-29T11:28:25Z

clusterloader2/pkg/dra/install.go

+}
+
+// InstallChart installs a Helm chart using the provided options
+func InstallChart(opts *InstallOptions) (*release.Release, error) {


I would really like to avoid dependency on helm if possible for multiple different reasons.
However - looking at the the dra-example-driver repo - it seems help is the only supported option of installing (there are no provided ready-to-use manifests, right?)

Not at the moment, but we could consider changing that if desired. The problem is that we have quite a few different options that influence the deployment. With Helm, those are configurable. Without it, a ready-to-use manifest would have to pick some options upfront. We might need several of them...

cc @nojnhuh

Is kustomize, envsubst, or something else like that any more acceptable? Or somehow generating a manifest on the fly with helm template before the main flow?

@wojtek-t just for my understanding what are some of the key reasons to avoid helm dependency?

In the future, I see the need to add more customized installations of the driver to extend the scale tests for other DRA related features that are currently in alpha/beta.

My approach with this PR was to use something that works now and collect more requirements for the test driver, to make it better in future iterations. I can add removing helm dependency in that list.

Does that approach work? Can we start out with a helm install and start collecting numbers? Or is this a strict no-go to begin with?

@wojtek-t just for my understanding what are some of the key reasons to avoid helm dependency?

helm is fairly complex tool and depending on it means that the maintainers has to understand it (e.g. for debugging purposes).
So the simpler thing we use - the better.
From that perspective - envsubst seems interesting, as a native linux tool.

I understand that in a generic case there are many knobs that we need. But in a single particular test - is it still the case? Can we maybe generate the templates using help statically and commit them to our repo?
[That means the need for periodically updating that, but given that we effectively want to scale-test the DRA support in core, not the driver itself - that may actually not be needed frequently.]

Does that approach work? Can we start out with a helm install and start collecting numbers? Or is this a strict no-go to begin with?

Having something in a list is easy to skip later :) I'm not saying it's hard-no, but I would like to explore the other potential options first.

alaypatel07 added 6 commits May 12, 2025 14:33

add ability to install dra-example-driver

c2488b9

add a job with dra devices

c7f4250

hack to workaround apiserver proxy call to kubescheduler metrics

cc6ea1c

update dra test to have 90% utilization and churn the remaining 10%

5581569

Signed-off-by: Alay Patel <alayp@nvidia.com>

rework the numbers to create 10% churn 10 times

25217a9

separate observations for fastfill vs churn phase

2cf1d13

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 28, 2025

k8s-ci-robot requested review from mborsz and wojtek-t May 28, 2025 04:26

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 28, 2025

clean up: remove comments, add proper logs

cc31c8c

wojtek-t reviewed May 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP]: create 100 node test for DRA #3368

[WIP]: create 100 node test for DRA #3368

alaypatel07 commented May 28, 2025 •

edited

Loading

Uh oh!

k8s-ci-robot commented May 28, 2025

Uh oh!

k8s-ci-robot commented May 28, 2025

Uh oh!

k8s-ci-robot commented May 28, 2025

Uh oh!

alaypatel07 commented May 28, 2025

Uh oh!

wojtek-t left a comment

Uh oh!

wojtek-t May 29, 2025

Uh oh!

alaypatel07 May 30, 2025

Uh oh!

wojtek-t Jun 2, 2025

Uh oh!

wojtek-t May 29, 2025

Uh oh!

pohly May 30, 2025

Uh oh!

nojnhuh May 30, 2025

Uh oh!

alaypatel07 May 30, 2025

Uh oh!

wojtek-t Jun 2, 2025

Uh oh!

Uh oh!

[WIP]: create 100 node test for DRA #3368

Are you sure you want to change the base?

[WIP]: create 100 node test for DRA #3368

Conversation

alaypatel07 commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Uh oh!

k8s-ci-robot commented May 28, 2025

Uh oh!

k8s-ci-robot commented May 28, 2025

Uh oh!

k8s-ci-robot commented May 28, 2025

Uh oh!

alaypatel07 commented May 28, 2025

Uh oh!

wojtek-t left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alaypatel07 commented May 28, 2025 •

edited

Loading