Skip to content

Ray cluster headgroup resources #190

Closed
@tedhtchang

Description

@tedhtchang

Should we allow configurable [headgroup resources] ? (for development on a laptop with 8cpu x 16gb ram)
(

limits:
cpu: 2
memory: "8G"
nvidia.com/gpu: 0
requests:
cpu: 2
memory: "8G"
nvidia.com/gpu: 0
)

The resources allocation with only the codeflare-stack (w/o any ODH component) was:

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests      Limits
  --------           --------      ------
  cpu                2445m (31%)   1100m (14%)
  memory             8442Mi (55%)  768Mi (5%)

Creating a cluster on OpenShift local on a 8cpu x 16gb workstation would fail with insufficient resources.
cluster = Cluster(ClusterConfiguration(namespace="default", name="torch", min_worker=1, max_worker=1, min_cpus=1, max_cpus=1, min_memory=1, max_memory=1, gpu=0, instascale=False))

I0630 22:41:52.423038       1 queuejob_controller_ex.go:1009] [getAggAvaiResPri] cpu 5365.00, memory 7098806272.00, GPU 0 available resources to schedule
I0630 22:41:52.423066       1 queuejob_controller_ex.go:1260] [ScheduleNext] XQJ torch with resources cpu 3000.00, memory 9000000000.00, GPU 0 to be scheduled on aggregated idle resources cpu 5365.00, memory 7098806272.00, GPU 0
I0630 22:41:52.423204       1 queuejob_controller_ex.go:1336] [ScheduleNext] HOL Blocking by torch for 163.595µs activeQ=false Unsched=true &qj=0xc0007df900 Version=61385 Status={Pending:0 Running:0 Succeeded:0 Failed:0 MinAvailable:0 CanRun:false IsDispatched:false State:Pending Message: SystemPriority:9 QueueJobState:HeadOfLine ControllerFirstTimestamp:2023-06-30 22:40:12.010146 +0000 UTC ControllerFirstDispatchTimestamp:0001-01-01 00:00:00 +0000 UTC FilterIgnore:true Sender:before ScheduleNext - setHOL Local:false Conditions:[{Type:Init Status:True LastUpdateMicroTime:2023-06-30 22:40:12.010149 +0000 UTC LastTransitionMicroTime:2023-06-30 22:40:12.01015 +0000 UTC Reason: Message:} {Type:Queueing Status:True LastUpdateMicroTime:2023-06-30 22:40:12.010603 +0000 UTC LastTransitionMicroTime:2023-06-30 22:40:12.010605 +0000 UTC Reason:AwaitingHeadOfLine Message:} {Type:HeadOfLine Status:True LastUpdateMicroTime:2023-06-30 22:40:12.082065 +0000 UTC LastTransitionMicroTime:2023-06-30 22:40:12.082067 +0000 UTC Reason:FrontOfQueue. Message:} {Type:Backoff Status:True LastUpdateMicroTime:2023-06-30 22:40:32.322476 +0000 UTC LastTransitionMicroTime:2023-06-30 22:40:32.322478 +0000 UTC Reason:AppWrapperNotRunnable. Message:Insufficient resources to dispatch AppWrapper.}] PendingPodConditions:[]}

My openshift local config:

crc config view
- consent-telemetry                     : yes
- cpus                                  : 8
- disk-size                             : 80
- memory                                : 16000
- network-mode                          : user
- pull-secret-file                      : /home/tedchang/secret.json

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions