Skip to content

ClusterConfiguration should support tolerations #413

Open
@cfchase

Description

@cfchase

ClusterConfiguration should support tolerations

Run Ray clusters (especially the worker pods) on tainted nodes.

Description of Problem the Feature Should Solve

You cannot create a Ray cluster with tolerations using the CodeFlare SDK
cluster = Cluster(ClusterConfiguration(...))

Often, machine nodes are tainted to prevent unwanted workloads. This is especially the case in GPU nodes which are often tainted. In addition different nodes will have different sized gpus, which would also use taints to make sure the correct workers land on the correct nodes.

You might also want to add a toleration to the headGroupSpec

Describe the Solution You Would Like to See

Add worker_tolerations and head_tolderations as optional parameters for ClusterConfiguration

cluster = Cluster(ClusterConfiguration(
    head_tolerations=[key, operator, effect],
    worker_tolerations=[key, operator, effect]
))

Describe Alternatives You Have Considered

Editing the yaml file and just using kuberay directly. You can currently manually edit an AppWrapper yaml to include a toleration for these taints.

workerGroupSpecs:
  - spec:
        tolerations:
          - key: nvidia.com/gpu
            operator: Exists
            effect: NoSchedule

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions