Skip to content

gpu nodegroup may cant trigger scale-up from 0 #8123

Open
@suqinglee

Description

@suqinglee

focus this code (cluster-autoscaler-1.26.6)

Image

assume nvdp cant start up, may be image not found or etc. then a gpu node come in nodegroup, p.nodeInfoCache will cache a node without nvidia.com/gpu; and this moment trigger scaledown to 0, this cache item still exist in cluster-autoscaler

when next scale-up triggered, even now nvdp is ok, due to this cache item, cant trigger scale-up, describe the pending pod will see:

Image

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions