cortex up times out when using region us-west-2

### Version

```
cortex version
cli version: 0.42.0
```

### Description

`cortex up` fails with "timeout has occurred when validating your cortex cluster". This happens consistently.

The failure only occurs with `region: us-west-2`. When region is set to `us-east-2`, `cortex up` succeeds.

### Configuration

`cortex up` fails when using this `cluster.yml`:

```
cluster_name: this-config-fails
region: us-west-2
node_groups:
  - name: tmp
    instance_type: m5.large
    min_instances: 1
    max_instances: 5
    spot: false
```

while this `cluster.yml` succeeds:

```
cluster_name: this-config-works
region: us-east-2
node_groups:
  - name: tmp
    instance_type: m5.large
    min_instances: 1
    max_instances: 5
    spot: false
```

### Steps to reproduce

1. run `cortex up` on the `cluster.yml` specified above, using `us-west-2` as the region.

### Expected behavior

`cortex up` to complete successfully, just as it does when region is `us-east-2`

### Actual behavior

`cortex up` exits nonzero and reports a failure

### Stack traces

<details>

<summary>failure trace</summary>

```text
cortex cluster up ./<MASKED>/cluster.yaml
using aws credentials with access key <MASKED>

verifying your configuration ...

aws resource                            cost per hour
1 eks cluster                           $0.10
nodegroup tmp: 1-5 m5.large instances   $0.102 each
2 t3.medium instances (cortex system)   $0.088 total
1 t3.medium instance (prometheus)       $0.05
2 network load balancers                $0.045 total

your cluster will cost $0.38 - $0.79 per hour based on cluster size

cortex will also create an s3 bucket (this-config-fails-36f0f6ff) and a cloudwatch log group (this-config-fails)

would you like to continue? (y/n): y

￮ creating a new s3 bucket: this-config-fails-36f0f6ff ✓
￮ creating a new cloudwatch log group: this-config-fails ✓
￮ spinning up the cluster (this will take about 30 minutes) ...

2022-02-23 19:01:51 [ℹ]  eksctl version 0.67.0
2022-02-23 19:01:51 [ℹ]  using region us-west-2
2022-02-23 19:01:51 [ℹ]  subnets for us-west-2a - public:192.168.0.0/19 private:192.168.96.0/19
2022-02-23 19:01:51 [ℹ]  subnets for us-west-2b - public:192.168.32.0/19 private:192.168.128.0/19
2022-02-23 19:01:51 [ℹ]  subnets for us-west-2c - public:192.168.64.0/19 private:192.168.160.0/19
2022-02-23 19:01:51 [!]  Custom AMI detected for nodegroup cx-operator. Please refer to https://github.com/weaveworks/eksctl/issues/3563 for upcoming breaking changes
2022-02-23 19:01:51 [ℹ]  nodegroup "cx-operator" will use "ami-002539dd2c532d0a5" [AmazonLinux2/1.21]
2022-02-23 19:01:51 [!]  Custom AMI detected for nodegroup cx-prometheus. Please refer to https://github.com/weaveworks/eksctl/issues/3563 for upcoming breaking changes
2022-02-23 19:01:51 [ℹ]  nodegroup "cx-prometheus" will use "ami-002539dd2c532d0a5" [AmazonLinux2/1.21]
2022-02-23 19:01:51 [!]  Custom AMI detected for nodegroup cx-wd-tmp. Please refer to https://github.com/weaveworks/eksctl/issues/3563 for upcoming breaking changes
2022-02-23 19:01:51 [ℹ]  nodegroup "cx-wd-tmp" will use "ami-002539dd2c532d0a5" [AmazonLinux2/1.21]
2022-02-23 19:01:51 [ℹ]  using Kubernetes version 1.21
2022-02-23 19:01:51 [ℹ]  creating EKS cluster "this-config-fails" in "us-west-2" region with un-managed nodes
2022-02-23 19:01:51 [ℹ]  3 nodegroups (cx-operator, cx-prometheus, cx-wd-tmp) were included (based on the include/exclude rules)
2022-02-23 19:01:51 [ℹ]  will create a CloudFormation stack for cluster itself and 3 nodegroup stack(s)
2022-02-23 19:01:51 [ℹ]  will create a CloudFormation stack for cluster itself and 0 managed nodegroup stack(s)
2022-02-23 19:01:51 [ℹ]  if you encounter any issues, check CloudFormation console or try 'eksctl utils describe-stacks --region=us-west-2 --cluster=this-config-fails'
2022-02-23 19:01:51 [ℹ]  CloudWatch logging will not be enabled for cluster "this-config-fails" in "us-west-2"
2022-02-23 19:01:51 [ℹ]  you can enable it with 'eksctl utils update-cluster-logging --enable-types={SPECIFY-YOUR-LOG-TYPES-HERE (e.g. all)} --region=us-west-2 --cluster=this-config-fails'
2022-02-23 19:01:51 [ℹ]  Kubernetes API endpoint access will use default of {publicAccess=true, privateAccess=false} for cluster "this-config-fails" in "us-west-2"
2022-02-23 19:01:51 [ℹ]  2 sequential tasks: { create cluster control plane "this-config-fails", 3 sequential sub-tasks: { 2 sequential sub-tasks: { wait for control plane to become ready, tag cluster }, 1 task: { create addons }, 3 parallel sub-tasks: { create nodegroup "cx-operator", create nodegroup "cx-prometheus", create nodegroup "cx-wd-tmp" } } }
2022-02-23 19:01:51 [ℹ]  building cluster stack "eksctl-this-config-fails-cluster"
2022-02-23 19:01:52 [ℹ]  deploying stack "eksctl-this-config-fails-cluster"
2022-02-23 19:02:22 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:02:52 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:03:52 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:04:52 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:05:52 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:06:52 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:07:52 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:08:52 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:09:52 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:10:52 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:11:53 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:12:53 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:13:53 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:14:53 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:16:54 [✔]  tagged EKS cluster (cortex.dev/cluster-name=this-config-fails)
2022-02-23 19:18:55 [!]  OIDC is disabled but policies are required/specified for this addon. Users are responsible for attaching the policies to all nodegroup roles
2022-02-23 19:18:55 [ℹ]  creating addon
2022-02-23 19:23:25 [ℹ]  addon "vpc-cni" active
2022-02-23 19:23:25 [ℹ]  building nodegroup stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:23:25 [!]  Custom AMI detected for nodegroup cx-wd-tmp, using legacy nodebootstrap mechanism. Please refer to https://github.com/weaveworks/eksctl/issues/3563 for upcoming breaking changes
2022-02-23 19:23:25 [ℹ]  building nodegroup stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:23:25 [!]  Custom AMI detected for nodegroup cx-operator, using legacy nodebootstrap mechanism. Please refer to https://github.com/weaveworks/eksctl/issues/3563 for upcoming breaking changes
2022-02-23 19:23:25 [ℹ]  building nodegroup stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:23:25 [!]  Custom AMI detected for nodegroup cx-prometheus, using legacy nodebootstrap mechanism. Please refer to https://github.com/weaveworks/eksctl/issues/3563 for upcoming breaking changes
2022-02-23 19:23:26 [ℹ]  deploying stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:23:26 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:23:26 [ℹ]  deploying stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:23:26 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:23:26 [ℹ]  deploying stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:23:26 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:23:42 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:23:44 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:23:44 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:23:57 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:23:59 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:24:02 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:24:16 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:24:16 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:24:20 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:24:33 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:24:36 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:24:40 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:24:51 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:24:53 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:24:55 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:25:10 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:25:11 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:25:11 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:25:25 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:25:28 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:25:30 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:25:44 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:25:44 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:25:50 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:26:00 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:26:00 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:26:10 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:26:17 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:26:20 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:26:25 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:26:33 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:26:40 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:26:41 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:26:51 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:26:56 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:26:57 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:26:57 [ℹ]  waiting for the control plane availability...
2022-02-23 19:26:57 [✔]  saved kubeconfig as "/root/.kube/config"
2022-02-23 19:26:57 [ℹ]  1 task: { suspend ASG processes for nodegroup cx-wd-tmp }
2022-02-23 19:26:58 [ℹ]  suspended ASG processes [AZRebalance] for cx-wd-tmp
2022-02-23 19:26:58 [✔]  all EKS cluster resources for "this-config-fails" have been created
2022-02-23 19:26:58 [ℹ]  adding identity "arn:aws:iam::<MASKED>:role/eksctl-this-config-fails-nodegrou-NodeInstanceRole-OG2YBA75HYPE" to auth ConfigMap
2022-02-23 19:26:58 [ℹ]  nodegroup "cx-operator" has 0 node(s)
2022-02-23 19:26:58 [ℹ]  waiting for at least 2 node(s) to become ready in "cx-operator"
2022-02-23 19:27:30 [ℹ]  nodegroup "cx-operator" has 2 node(s)
2022-02-23 19:27:30 [ℹ]  node "ip-192-168-20-129.us-west-2.compute.internal" is ready
2022-02-23 19:27:30 [ℹ]  node "ip-192-168-88-85.us-west-2.compute.internal" is ready
2022-02-23 19:27:30 [ℹ]  adding identity "arn:aws:iam::<MASKED>:role/eksctl-this-config-fails-nodegrou-NodeInstanceRole-75KAP1SXG8SQ" to auth ConfigMap
2022-02-23 19:27:30 [ℹ]  nodegroup "cx-prometheus" has 0 node(s)
2022-02-23 19:27:30 [ℹ]  waiting for at least 1 node(s) to become ready in "cx-prometheus"
2022-02-23 19:28:32 [ℹ]  nodegroup "cx-prometheus" has 1 node(s)
2022-02-23 19:28:32 [ℹ]  node "ip-192-168-54-32.us-west-2.compute.internal" is ready
2022-02-23 19:28:32 [ℹ]  adding identity "arn:aws:iam::<MASKED>:role/eksctl-this-config-fails-nodegrou-NodeInstanceRole-JLK3EF72JQAV" to auth ConfigMap
2022-02-23 19:28:32 [ℹ]  nodegroup "cx-wd-tmp" has 0 node(s)
2022-02-23 19:28:32 [ℹ]  waiting for at least 1 node(s) to become ready in "cx-wd-tmp"
2022-02-23 19:31:00 [ℹ]  nodegroup "cx-wd-tmp" has 1 node(s)
2022-02-23 19:31:00 [ℹ]  node "ip-192-168-72-237.us-west-2.compute.internal" is ready
2022-02-23 19:33:01 [ℹ]  kubectl command should work with "/root/.kube/config", try 'kubectl get nodes'
2022-02-23 19:33:01 [✔]  EKS cluster "this-config-fails" in "us-west-2" region is ready

￮ updating cluster configuration ✓
￮ configuring networking (this will take a few minutes) ✓
￮ configuring autoscaling ✓
￮ configuring async gateway ✓
￮ configuring logging ✓
￮ configuring metrics ✓
￮ configuring gpu support (for nodegroups that may require it) ✓
￮ configuring inf support (for nodegroups that may require it) ✓
￮ starting operator ✓
￮ starting controller manager ✓
￮ waiting for load balancers .............................................................................................................................................................................................................................................................................................................................................

timeout has occurred when validating your cortex cluster

debugging info:
operator pod name: pod/operator-controller-manager-6f8bb85b96-clqxf
operator pod is ready: true
operator endpoint: <MASKED>.elb.us-west-2.amazonaws.com
noperator curl response:
{}additional networking events:
LAST SEEN   TYPE     REASON                 OBJECT                            MESSAGE
30m         Normal   EnsuringLoadBalancer   service/ingressgateway-apis       Ensuring load balancer
30m         Normal   EnsuredLoadBalancer    service/ingressgateway-apis       Ensured load balancer
30m         Normal   EnsuringLoadBalancer   service/ingressgateway-operator   Ensuring load balancer
30m         Normal   EnsuredLoadBalancer    service/ingressgateway-operator   Ensured load balancer
30m         Normal    Scheduled   pod/ingressgateway-apis-69465f9956-gzxtf       Successfully assigned istio-system/ingressgateway-apis-69465f9956-gzxtf to ip-192-168-20-129.us-west-2.compute.internal
30m         Normal    Pulling     pod/ingressgateway-operator-7b54fcf5cd-gsvcb   Pulling image "quay.io/cortexlabs/istio-proxy:0.42.0"
30m         Normal    Pulling     pod/ingressgateway-apis-69465f9956-gzxtf       Pulling image "quay.io/cortexlabs/istio-proxy:0.42.0"
30m         Normal    Pulled      pod/ingressgateway-operator-7b54fcf5cd-gsvcb   Successfully pulled image "quay.io/cortexlabs/istio-proxy:0.42.0" in 4.987000991s
30m         Normal    Created     pod/ingressgateway-operator-7b54fcf5cd-gsvcb   Created container istio-proxy
30m         Normal    Started     pod/ingressgateway-operator-7b54fcf5cd-gsvcb   Started container istio-proxy
30m         Normal    Pulled      pod/ingressgateway-apis-69465f9956-gzxtf       Successfully pulled image "quay.io/cortexlabs/istio-proxy:0.42.0" in 6.764940388s
30m         Normal    Started     pod/ingressgateway-apis-69465f9956-gzxtf       Started container istio-proxy
30m         Normal    Created     pod/ingressgateway-apis-69465f9956-gzxtf       Created container istio-proxy
30m         Warning   Unhealthy   pod/ingressgateway-apis-69465f9956-gzxtf       Readiness probe failed: Get "http://192.168.3.27:15021/healthz/ready": dial tcp 192.168.3.27:15021: connect: connection refused


please run `cortex cluster down` to delete the cluster before trying to create this cluster again
```

</details>

### Additional context

I've only tested `us-west-2` and `us-east-2` so far. I've repeated the experiment a number of times. I see consistent failure when region is `us-west-2` and consistent success when region is `us-east-2`.

A search in the slack channel for `timeout has occurred when validating your cortex cluster` shows that this issue is pretty common. I see four or five reports of the issue in the last year.

`us-west-2` is my default region.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cortex up times out when using region us-west-2 #2430

Version

Description

Configuration

Steps to reproduce

Expected behavior

Actual behavior

Stack traces

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

cortex up times out when using region us-west-2 #2430

Description

Version

Description

Configuration

Steps to reproduce

Expected behavior

Actual behavior

Stack traces

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions