Description
ClusterConfiguration should support tolerations
Run Ray clusters (especially the worker pods) on tainted nodes.
Description of Problem the Feature Should Solve
You cannot create a Ray cluster with tolerations using the CodeFlare SDK
cluster = Cluster(ClusterConfiguration(...))
Often, machine nodes are tainted to prevent unwanted workloads. This is especially the case in GPU nodes which are often tainted. In addition different nodes will have different sized gpus, which would also use taints to make sure the correct workers land on the correct nodes.
You might also want to add a toleration to the headGroupSpec
Describe the Solution You Would Like to See
Add worker_tolerations
and head_tolderations
as optional parameters for ClusterConfiguration
cluster = Cluster(ClusterConfiguration(
head_tolerations=[key, operator, effect],
worker_tolerations=[key, operator, effect]
))
Describe Alternatives You Have Considered
Editing the yaml file and just using kuberay directly. You can currently manually edit an AppWrapper yaml to include a toleration for these taints.
workerGroupSpecs:
- spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
Metadata
Metadata
Assignees
Labels
Type
Projects
Status