Description
Version
cortex version
cli version: 0.42.0
Description
cortex up
fails with "timeout has occurred when validating your cortex cluster". This happens consistently.
The failure only occurs with region: us-west-2
. When region is set to us-east-2
, cortex up
succeeds.
Configuration
cortex up
fails when using this cluster.yml
:
cluster_name: this-config-fails
region: us-west-2
node_groups:
- name: tmp
instance_type: m5.large
min_instances: 1
max_instances: 5
spot: false
while this cluster.yml
succeeds:
cluster_name: this-config-works
region: us-east-2
node_groups:
- name: tmp
instance_type: m5.large
min_instances: 1
max_instances: 5
spot: false
Steps to reproduce
- run
cortex up
on thecluster.yml
specified above, usingus-west-2
as the region.
Expected behavior
cortex up
to complete successfully, just as it does when region is us-east-2
Actual behavior
cortex up
exits nonzero and reports a failure
Stack traces
failure trace
cortex cluster up ./<MASKED>/cluster.yaml
using aws credentials with access key <MASKED>
verifying your configuration ...
aws resource cost per hour
1 eks cluster $0.10
nodegroup tmp: 1-5 m5.large instances $0.102 each
2 t3.medium instances (cortex system) $0.088 total
1 t3.medium instance (prometheus) $0.05
2 network load balancers $0.045 total
your cluster will cost $0.38 - $0.79 per hour based on cluster size
cortex will also create an s3 bucket (this-config-fails-36f0f6ff) and a cloudwatch log group (this-config-fails)
would you like to continue? (y/n): y
○ creating a new s3 bucket: this-config-fails-36f0f6ff ✓
○ creating a new cloudwatch log group: this-config-fails ✓
○ spinning up the cluster (this will take about 30 minutes) ...
2022-02-23 19:01:51 [ℹ] eksctl version 0.67.0
2022-02-23 19:01:51 [ℹ] using region us-west-2
2022-02-23 19:01:51 [ℹ] subnets for us-west-2a - public:192.168.0.0/19 private:192.168.96.0/19
2022-02-23 19:01:51 [ℹ] subnets for us-west-2b - public:192.168.32.0/19 private:192.168.128.0/19
2022-02-23 19:01:51 [ℹ] subnets for us-west-2c - public:192.168.64.0/19 private:192.168.160.0/19
2022-02-23 19:01:51 [!] Custom AMI detected for nodegroup cx-operator. Please refer to https://github.com/weaveworks/eksctl/issues/3563 for upcoming breaking changes
2022-02-23 19:01:51 [ℹ] nodegroup "cx-operator" will use "ami-002539dd2c532d0a5" [AmazonLinux2/1.21]
2022-02-23 19:01:51 [!] Custom AMI detected for nodegroup cx-prometheus. Please refer to https://github.com/weaveworks/eksctl/issues/3563 for upcoming breaking changes
2022-02-23 19:01:51 [ℹ] nodegroup "cx-prometheus" will use "ami-002539dd2c532d0a5" [AmazonLinux2/1.21]
2022-02-23 19:01:51 [!] Custom AMI detected for nodegroup cx-wd-tmp. Please refer to https://github.com/weaveworks/eksctl/issues/3563 for upcoming breaking changes
2022-02-23 19:01:51 [ℹ] nodegroup "cx-wd-tmp" will use "ami-002539dd2c532d0a5" [AmazonLinux2/1.21]
2022-02-23 19:01:51 [ℹ] using Kubernetes version 1.21
2022-02-23 19:01:51 [ℹ] creating EKS cluster "this-config-fails" in "us-west-2" region with un-managed nodes
2022-02-23 19:01:51 [ℹ] 3 nodegroups (cx-operator, cx-prometheus, cx-wd-tmp) were included (based on the include/exclude rules)
2022-02-23 19:01:51 [ℹ] will create a CloudFormation stack for cluster itself and 3 nodegroup stack(s)
2022-02-23 19:01:51 [ℹ] will create a CloudFormation stack for cluster itself and 0 managed nodegroup stack(s)
2022-02-23 19:01:51 [ℹ] if you encounter any issues, check CloudFormation console or try 'eksctl utils describe-stacks --region=us-west-2 --cluster=this-config-fails'
2022-02-23 19:01:51 [ℹ] CloudWatch logging will not be enabled for cluster "this-config-fails" in "us-west-2"
2022-02-23 19:01:51 [ℹ] you can enable it with 'eksctl utils update-cluster-logging --enable-types={SPECIFY-YOUR-LOG-TYPES-HERE (e.g. all)} --region=us-west-2 --cluster=this-config-fails'
2022-02-23 19:01:51 [ℹ] Kubernetes API endpoint access will use default of {publicAccess=true, privateAccess=false} for cluster "this-config-fails" in "us-west-2"
2022-02-23 19:01:51 [ℹ] 2 sequential tasks: { create cluster control plane "this-config-fails", 3 sequential sub-tasks: { 2 sequential sub-tasks: { wait for control plane to become ready, tag cluster }, 1 task: { create addons }, 3 parallel sub-tasks: { create nodegroup "cx-operator", create nodegroup "cx-prometheus", create nodegroup "cx-wd-tmp" } } }
2022-02-23 19:01:51 [ℹ] building cluster stack "eksctl-this-config-fails-cluster"
2022-02-23 19:01:52 [ℹ] deploying stack "eksctl-this-config-fails-cluster"
2022-02-23 19:02:22 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:02:52 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:03:52 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:04:52 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:05:52 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:06:52 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:07:52 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:08:52 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:09:52 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:10:52 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:11:53 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:12:53 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:13:53 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:14:53 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:16:54 [✔] tagged EKS cluster (cortex.dev/cluster-name=this-config-fails)
2022-02-23 19:18:55 [!] OIDC is disabled but policies are required/specified for this addon. Users are responsible for attaching the policies to all nodegroup roles
2022-02-23 19:18:55 [ℹ] creating addon
2022-02-23 19:23:25 [ℹ] addon "vpc-cni" active
2022-02-23 19:23:25 [ℹ] building nodegroup stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:23:25 [!] Custom AMI detected for nodegroup cx-wd-tmp, using legacy nodebootstrap mechanism. Please refer to https://github.com/weaveworks/eksctl/issues/3563 for upcoming breaking changes
2022-02-23 19:23:25 [ℹ] building nodegroup stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:23:25 [!] Custom AMI detected for nodegroup cx-operator, using legacy nodebootstrap mechanism. Please refer to https://github.com/weaveworks/eksctl/issues/3563 for upcoming breaking changes
2022-02-23 19:23:25 [ℹ] building nodegroup stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:23:25 [!] Custom AMI detected for nodegroup cx-prometheus, using legacy nodebootstrap mechanism. Please refer to https://github.com/weaveworks/eksctl/issues/3563 for upcoming breaking changes
2022-02-23 19:23:26 [ℹ] deploying stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:23:26 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:23:26 [ℹ] deploying stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:23:26 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:23:26 [ℹ] deploying stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:23:26 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:23:42 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:23:44 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:23:44 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:23:57 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:23:59 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:24:02 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:24:16 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:24:16 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:24:20 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:24:33 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:24:36 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:24:40 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:24:51 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:24:53 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:24:55 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:25:10 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:25:11 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:25:11 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:25:25 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:25:28 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:25:30 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:25:44 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:25:44 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:25:50 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:26:00 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:26:00 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:26:10 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:26:17 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:26:20 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:26:25 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:26:33 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:26:40 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:26:41 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:26:51 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:26:56 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:26:57 [ℹ] waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:26:57 [ℹ] waiting for the control plane availability...
2022-02-23 19:26:57 [✔] saved kubeconfig as "/root/.kube/config"
2022-02-23 19:26:57 [ℹ] 1 task: { suspend ASG processes for nodegroup cx-wd-tmp }
2022-02-23 19:26:58 [ℹ] suspended ASG processes [AZRebalance] for cx-wd-tmp
2022-02-23 19:26:58 [✔] all EKS cluster resources for "this-config-fails" have been created
2022-02-23 19:26:58 [ℹ] adding identity "arn:aws:iam::<MASKED>:role/eksctl-this-config-fails-nodegrou-NodeInstanceRole-OG2YBA75HYPE" to auth ConfigMap
2022-02-23 19:26:58 [ℹ] nodegroup "cx-operator" has 0 node(s)
2022-02-23 19:26:58 [ℹ] waiting for at least 2 node(s) to become ready in "cx-operator"
2022-02-23 19:27:30 [ℹ] nodegroup "cx-operator" has 2 node(s)
2022-02-23 19:27:30 [ℹ] node "ip-192-168-20-129.us-west-2.compute.internal" is ready
2022-02-23 19:27:30 [ℹ] node "ip-192-168-88-85.us-west-2.compute.internal" is ready
2022-02-23 19:27:30 [ℹ] adding identity "arn:aws:iam::<MASKED>:role/eksctl-this-config-fails-nodegrou-NodeInstanceRole-75KAP1SXG8SQ" to auth ConfigMap
2022-02-23 19:27:30 [ℹ] nodegroup "cx-prometheus" has 0 node(s)
2022-02-23 19:27:30 [ℹ] waiting for at least 1 node(s) to become ready in "cx-prometheus"
2022-02-23 19:28:32 [ℹ] nodegroup "cx-prometheus" has 1 node(s)
2022-02-23 19:28:32 [ℹ] node "ip-192-168-54-32.us-west-2.compute.internal" is ready
2022-02-23 19:28:32 [ℹ] adding identity "arn:aws:iam::<MASKED>:role/eksctl-this-config-fails-nodegrou-NodeInstanceRole-JLK3EF72JQAV" to auth ConfigMap
2022-02-23 19:28:32 [ℹ] nodegroup "cx-wd-tmp" has 0 node(s)
2022-02-23 19:28:32 [ℹ] waiting for at least 1 node(s) to become ready in "cx-wd-tmp"
2022-02-23 19:31:00 [ℹ] nodegroup "cx-wd-tmp" has 1 node(s)
2022-02-23 19:31:00 [ℹ] node "ip-192-168-72-237.us-west-2.compute.internal" is ready
2022-02-23 19:33:01 [ℹ] kubectl command should work with "/root/.kube/config", try 'kubectl get nodes'
2022-02-23 19:33:01 [✔] EKS cluster "this-config-fails" in "us-west-2" region is ready
○ updating cluster configuration ✓
○ configuring networking (this will take a few minutes) ✓
○ configuring autoscaling ✓
○ configuring async gateway ✓
○ configuring logging ✓
○ configuring metrics ✓
○ configuring gpu support (for nodegroups that may require it) ✓
○ configuring inf support (for nodegroups that may require it) ✓
○ starting operator ✓
○ starting controller manager ✓
○ waiting for load balancers .............................................................................................................................................................................................................................................................................................................................................
timeout has occurred when validating your cortex cluster
debugging info:
operator pod name: pod/operator-controller-manager-6f8bb85b96-clqxf
operator pod is ready: true
operator endpoint: <MASKED>.elb.us-west-2.amazonaws.com
noperator curl response:
{}additional networking events:
LAST SEEN TYPE REASON OBJECT MESSAGE
30m Normal EnsuringLoadBalancer service/ingressgateway-apis Ensuring load balancer
30m Normal EnsuredLoadBalancer service/ingressgateway-apis Ensured load balancer
30m Normal EnsuringLoadBalancer service/ingressgateway-operator Ensuring load balancer
30m Normal EnsuredLoadBalancer service/ingressgateway-operator Ensured load balancer
30m Normal Scheduled pod/ingressgateway-apis-69465f9956-gzxtf Successfully assigned istio-system/ingressgateway-apis-69465f9956-gzxtf to ip-192-168-20-129.us-west-2.compute.internal
30m Normal Pulling pod/ingressgateway-operator-7b54fcf5cd-gsvcb Pulling image "quay.io/cortexlabs/istio-proxy:0.42.0"
30m Normal Pulling pod/ingressgateway-apis-69465f9956-gzxtf Pulling image "quay.io/cortexlabs/istio-proxy:0.42.0"
30m Normal Pulled pod/ingressgateway-operator-7b54fcf5cd-gsvcb Successfully pulled image "quay.io/cortexlabs/istio-proxy:0.42.0" in 4.987000991s
30m Normal Created pod/ingressgateway-operator-7b54fcf5cd-gsvcb Created container istio-proxy
30m Normal Started pod/ingressgateway-operator-7b54fcf5cd-gsvcb Started container istio-proxy
30m Normal Pulled pod/ingressgateway-apis-69465f9956-gzxtf Successfully pulled image "quay.io/cortexlabs/istio-proxy:0.42.0" in 6.764940388s
30m Normal Started pod/ingressgateway-apis-69465f9956-gzxtf Started container istio-proxy
30m Normal Created pod/ingressgateway-apis-69465f9956-gzxtf Created container istio-proxy
30m Warning Unhealthy pod/ingressgateway-apis-69465f9956-gzxtf Readiness probe failed: Get "http://192.168.3.27:15021/healthz/ready": dial tcp 192.168.3.27:15021: connect: connection refused
please run `cortex cluster down` to delete the cluster before trying to create this cluster again
Additional context
I've only tested us-west-2
and us-east-2
so far. I've repeated the experiment a number of times. I see consistent failure when region is us-west-2
and consistent success when region is us-east-2
.
A search in the slack channel for timeout has occurred when validating your cortex cluster
shows that this issue is pretty common. I see four or five reports of the issue in the last year.
us-west-2
is my default region.