Description
Description
Given the following list of node groups from a cluster config:
node_groups:
- name: A
instance_type: t3.medium
- name: B
instance_type: c5.xlarge
- name: C
instance_type: c5.4xlarge
Assume we have no deployed APIs and that we have a node from each node group. Since we know that the order of the node groups dictates their priority, we expect that the node groups will be populated based on their priority.
Situation
Except that there are 2 problems:
- The order, when there are enough nodes from each node group, only dictates the likelihood of a pod being scheduled onto a specific node group. And the more node groups there are, the less likely that becomes.
- Increasing the number of API replicas will only lead to filling the node groups evenly (again, for the live nodes only). We don't want that because this way, B and C node groups will only slightly get filled when instead the API replicas could have gotten scheduled onto A. This way, the cluster-autoscaler could have taken the extra nodes away and reduce the costs overall.
The above 2 problems need to be addressed. For that, we need to edit the existing k8s scheduler or create a new one for our workloads. More on that here https://kubernetes.io/docs/reference/scheduling/config/ and here https://kubernetes.io/docs/reference/scheduling/policies/.
This becomes serious when there's a major scale-down event in the cluster and then afterwards only a fraction of the API replicas are brought back. The likelihood of this happening on a production cluster is high to very high.
Update
This is also applicable to single node group clusters.