gpu nodegroup may cant trigger scale-up from 0

focus this code (cluster-autoscaler-1.26.6)

<img width="896" alt="Image" src="https://github.com/user-attachments/assets/9c15f013-27c4-4033-9889-58f9bc989f80" />

assume nvdp cant start up, may be image not found or etc. then a gpu node come in nodegroup, `p.nodeInfoCache` will cache a node without nvidia.com/gpu; and this moment trigger scaledown to 0, this cache item still exist in cluster-autoscaler

when next scale-up triggered, even now nvdp is ok, due to this cache item, cant trigger scale-up, describe the pending pod will see:

<img width="1254" alt="Image" src="https://github.com/user-attachments/assets/5ffa180d-7c66-4455-b853-a8b24034ca60" />




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gpu nodegroup may cant trigger scale-up from 0 #8123

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

gpu nodegroup may cant trigger scale-up from 0 #8123

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions