From 3852f220e5db6828b9c32cd389097b1d2a6af213 Mon Sep 17 00:00:00 2001 From: Benjamin Jee Date: Thu, 5 Oct 2023 13:13:49 -0700 Subject: [PATCH 01/16] Add graceful recovery test --- tests/graceful-recovery.md | 97 +++++++++++++++++++++++ tests/results/graceful-recover-results.md | 15 ++++ 2 files changed, 112 insertions(+) create mode 100644 tests/graceful-recovery.md create mode 100644 tests/results/graceful-recover-results.md diff --git a/tests/graceful-recovery.md b/tests/graceful-recovery.md new file mode 100644 index 0000000000..1e35a1fc8f --- /dev/null +++ b/tests/graceful-recovery.md @@ -0,0 +1,97 @@ +# Graceful recovery from restarts + +## Description +When one of the containers in the NGF Pod fails, the Gateway API configuration +should be reapplied to the new container so that when either container +fails, NGF should be able to recover without any assistance. + +## Goal +Ensure that NGF can recover gracefully from container failures without any intervention. + +## Cluster Details + +- GKE 1.27.3-gke.100 +- us-central1-c +- Machine type of node is e2-medium +- 3 nodes + +## Setup + +1. Setup GKE Cluster. +2. Clone the repo and change into the nginx-gateway-fabric directory. +3. Check out the latest tag (unless you are installing the edge version from the main branch). +4. Go into `deploy/manifests/nginx-gateway.yaml` and change `runAsNonRoot` from `true` to `false`. +5. Follow the [installation instructions](https://github.com/nginxinc/nginx-gateway-fabric/blob/main/docs/installation.md) +to deploy NGINX Gateway Fabric. +6. In a separate terminal track NGF logs by running `kubectl -n nginx-gateway logs -f deploy/nginx-gateway` +7. In a separate terminal track nginx container logs by running +`kubectl -n nginx-gateway logs -f -c nginx` +8. Exec into the nginx container inside of the NGF pod by running +`kubectl exec -it -n nginx-gateway --container nginx -- bin/sh` +9. Inside the nginx container, navigate to `/etc/nginx/conf.d` and ensure that +`http.conf` and `config-version.conf` look correct. +10. In a different terminal, deploy the +[https-termination example](https://github.com/nginxinc/nginx-gateway-fabric/tree/main/examples/https-termination). +11. Inside the nginx container, check `http.conf` and `config-version.config` to see +if the configuration and version were correctly updated. +12. Send traffic through the example application and ensure it is working correctly. + +## Testing when nginx-gateway container restarts + +1. Ensure NGF and nginx container logs are set up and traffic flows through the example application correctly. +2. Insert ephemeral container in NGF Pod and kill the nginx-gateway process. + 1. `kubectl debug -it -n nginx-gateway --image=busybox:1.28 --target=nginx-gateway` + 2. run `ps -A` + 3. run `kill ` (Command should start with `/usr/bin/gateway`) +3. Check for errors in the NGF and nginx-container logs. +4. When the nginx-gateway container is back up, ensure traffic flows through the example application correctly. +5. Open up the NGF and nginx container logs and check for errors. +6. Inside the nginx container, check that `http.conf` was not changed and `config-version.conf` had its version set to `2`. +7. Send traffic through the example application and ensure it is working correctly. +8. Check that NGF can still update statuses of resources. + 1. Delete the HTTPRoute resources by running `kubectl delete -f cafe-routes.yaml` in `/examples/https-termination` + 2. Inside the terminal which is inside the nginx container, check that `http.conf` and + `config-version.conf` were correctly updated. + 3. Send traffic through the example application using the updated resources and ensure traffic does not flow. + 4. Apply the HTTPRoute resources by running `kubectl apply -f cafe-routes.yaml` in `/examples/https-termination` + 5. Inside the terminal which is inside the nginx container, check that `http.conf` and + `config-version.conf` were correctly updated. + 6. Send traffic through the example application using the updated resources and ensure traffic flows correctly. + +## Testing when nginx container restarts + +1. Ensure NGF and nginx container logs are set up and traffic flows through the example application correctly. +2. Insert ephemeral container in NGF Pod and kill the nginx-master process. + 1. If there isn't already an ephemeral container inserted, run: + `kubectl debug -it -n nginx-gateway --image=busybox:1.28 --target=nginx-gateway` + 2. run `ps -A` + 3. run `kill ` (Command should start with `nginx: master process`) +3. When nginx container is back up, ensure traffic flows through the example application correctly. +4. Open up the nginx-container logs and check for errors. +5. Exec back into the nginx container and check that `http.conf` and `config-version.conf` were not changed. + +## Testing when the NGF Pod restarts through a docker container restart with a graceful exit on node + +1. Switch over to a one-node Kind cluster. Can run `make create-kind-cluster` from main directory. +2. Run steps 4-12 of the Setup section above using [this guide] +(https://github.com/nginxinc/nginx-gateway-fabric/blob/main/docs/running-on-kind.md) for running on Kind. +3. Ensure NGF and nginx container logs are set up and traffic flows through the example application correctly. +4. Drain the node of its resources by running `kubectl drain kind-control-plane --ignore-daemonsets --delete-local-data` +5. Delete the node by running `kubectl delete node kind-control-plane` +6. Restart the docker container by running `docker restart kind-control-plane` +7. Open up both NGF and nginx-container logs and check for errors. +8. Exec back into the nginx container and check that `http.conf` and `config-version.conf` were not changed. +9. Send traffic through the example application and ensure it is working correctly. +10. Check that NGF can still update statuses of resources. + 1. Delete the HTTPRoute resources by running `kubectl delete -f cafe-routes.yaml` in `/examples/https-termination` + 2. Inside the terminal which is inside the nginx container, check that `http.conf` and + `config-version.conf` were correctly updated. + 3. Send traffic through the example application using the updated resources and ensure traffic does not flow. + 4. Apply the HTTPRoute resources by running `kubectl apply -f cafe-routes.yaml` in `/examples/https-termination` + 5. Inside the terminal which is inside the nginx container, check that `http.conf` and + `config-version.conf` were correctly updated. + 6. Send traffic through the example application using the updated resources and ensure traffic flows correctly. + +## Testing when the NGF Pod restarts through a docker container restart without cleaning up the node prior + +1. Repeat the above test but remove steps 4-5 which include draining and deleting the node. diff --git a/tests/results/graceful-recover-results.md b/tests/results/graceful-recover-results.md new file mode 100644 index 0000000000..82b8d4fb6a --- /dev/null +++ b/tests/results/graceful-recover-results.md @@ -0,0 +1,15 @@ +# Test Results + +## Testing when nginx-gateway container restarts +Passes test with no errors. + +## Testing when nginx container restarts +Passes test with no errors. + +## Testing when the NGF Pod restarts through a docker container restart with a graceful exit on node +Passes test with no errors. + +## Testing when the NGF Pod restarts through a docker container restart without cleaning up the node prior +Does not work correctly the majority of times and errors after running `docker restart kind-control-plane`. +NGF Pod is not able to recover as the nginx container logs show this error: +`bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)`. From 4766ee3b448870cccd0a07a48a630909d1ca762d Mon Sep 17 00:00:00 2001 From: Benjamin Jee Date: Thu, 5 Oct 2023 13:20:06 -0700 Subject: [PATCH 02/16] Change naming of test --- tests/graceful-recovery.md | 4 ++-- tests/results/graceful-recover-results.md | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/tests/graceful-recovery.md b/tests/graceful-recovery.md index 1e35a1fc8f..0b33dda7ea 100644 --- a/tests/graceful-recovery.md +++ b/tests/graceful-recovery.md @@ -70,7 +70,7 @@ if the configuration and version were correctly updated. 4. Open up the nginx-container logs and check for errors. 5. Exec back into the nginx container and check that `http.conf` and `config-version.conf` were not changed. -## Testing when the NGF Pod restarts through a docker container restart with a graceful exit on node +## Testing when the NGF Pod restarts through node shutdown with cleaning up of resources 1. Switch over to a one-node Kind cluster. Can run `make create-kind-cluster` from main directory. 2. Run steps 4-12 of the Setup section above using [this guide] @@ -92,6 +92,6 @@ if the configuration and version were correctly updated. `config-version.conf` were correctly updated. 6. Send traffic through the example application using the updated resources and ensure traffic flows correctly. -## Testing when the NGF Pod restarts through a docker container restart without cleaning up the node prior +## Testing when the NGF Pod restarts through node shutdown without cleaning up of resources 1. Repeat the above test but remove steps 4-5 which include draining and deleting the node. diff --git a/tests/results/graceful-recover-results.md b/tests/results/graceful-recover-results.md index 82b8d4fb6a..50370d6385 100644 --- a/tests/results/graceful-recover-results.md +++ b/tests/results/graceful-recover-results.md @@ -6,10 +6,10 @@ Passes test with no errors. ## Testing when nginx container restarts Passes test with no errors. -## Testing when the NGF Pod restarts through a docker container restart with a graceful exit on node +## Testing when the NGF Pod restarts through node shutdown with cleaning up of resources Passes test with no errors. -## Testing when the NGF Pod restarts through a docker container restart without cleaning up the node prior +## Testing when the NGF Pod restarts through node shutdown without cleaning up of resources Does not work correctly the majority of times and errors after running `docker restart kind-control-plane`. NGF Pod is not able to recover as the nginx container logs show this error: `bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)`. From 930656964701f57c4ab0deb78b233852070ba820 Mon Sep 17 00:00:00 2001 From: Benjamin Jee Date: Fri, 6 Oct 2023 14:56:32 -0700 Subject: [PATCH 03/16] Add separate directory for specific test --- tests/{ => graceful-recovery}/graceful-recovery.md | 0 tests/{ => graceful-recovery}/results/graceful-recover-results.md | 0 2 files changed, 0 insertions(+), 0 deletions(-) rename tests/{ => graceful-recovery}/graceful-recovery.md (100%) rename tests/{ => graceful-recovery}/results/graceful-recover-results.md (100%) diff --git a/tests/graceful-recovery.md b/tests/graceful-recovery/graceful-recovery.md similarity index 100% rename from tests/graceful-recovery.md rename to tests/graceful-recovery/graceful-recovery.md diff --git a/tests/results/graceful-recover-results.md b/tests/graceful-recovery/results/graceful-recover-results.md similarity index 100% rename from tests/results/graceful-recover-results.md rename to tests/graceful-recovery/results/graceful-recover-results.md From f86d072a88bd84a52ae143762cda8f827a2863e2 Mon Sep 17 00:00:00 2001 From: Benjamin Jee Date: Fri, 6 Oct 2023 14:57:29 -0700 Subject: [PATCH 04/16] Remove Description and adjust Goal --- tests/graceful-recovery/graceful-recovery.md | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/tests/graceful-recovery/graceful-recovery.md b/tests/graceful-recovery/graceful-recovery.md index 0b33dda7ea..918cc365cf 100644 --- a/tests/graceful-recovery/graceful-recovery.md +++ b/tests/graceful-recovery/graceful-recovery.md @@ -1,12 +1,7 @@ # Graceful recovery from restarts -## Description -When one of the containers in the NGF Pod fails, the Gateway API configuration -should be reapplied to the new container so that when either container -fails, NGF should be able to recover without any assistance. - ## Goal -Ensure that NGF can recover gracefully from container failures without any intervention. +Ensure that NGF can recover gracefully from container failures without any user intervention. ## Cluster Details From 79a0e699811decc5aa503a9883e5b36bdc4c1e22 Mon Sep 17 00:00:00 2001 From: Benjamin Jee Date: Fri, 6 Oct 2023 15:10:14 -0700 Subject: [PATCH 05/16] Adjust naming of sections --- tests/graceful-recovery/graceful-recovery.md | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/tests/graceful-recovery/graceful-recovery.md b/tests/graceful-recovery/graceful-recovery.md index 918cc365cf..533a42590e 100644 --- a/tests/graceful-recovery/graceful-recovery.md +++ b/tests/graceful-recovery/graceful-recovery.md @@ -31,7 +31,9 @@ to deploy NGINX Gateway Fabric. if the configuration and version were correctly updated. 12. Send traffic through the example application and ensure it is working correctly. -## Testing when nginx-gateway container restarts +## Tests + +### Restart nginx-gateway container 1. Ensure NGF and nginx container logs are set up and traffic flows through the example application correctly. 2. Insert ephemeral container in NGF Pod and kill the nginx-gateway process. @@ -53,7 +55,7 @@ if the configuration and version were correctly updated. `config-version.conf` were correctly updated. 6. Send traffic through the example application using the updated resources and ensure traffic flows correctly. -## Testing when nginx container restarts +### Restart NGINX container 1. Ensure NGF and nginx container logs are set up and traffic flows through the example application correctly. 2. Insert ephemeral container in NGF Pod and kill the nginx-master process. @@ -65,7 +67,7 @@ if the configuration and version were correctly updated. 4. Open up the nginx-container logs and check for errors. 5. Exec back into the nginx container and check that `http.conf` and `config-version.conf` were not changed. -## Testing when the NGF Pod restarts through node shutdown with cleaning up of resources +### Restart Node with draining 1. Switch over to a one-node Kind cluster. Can run `make create-kind-cluster` from main directory. 2. Run steps 4-12 of the Setup section above using [this guide] @@ -87,6 +89,6 @@ if the configuration and version were correctly updated. `config-version.conf` were correctly updated. 6. Send traffic through the example application using the updated resources and ensure traffic flows correctly. -## Testing when the NGF Pod restarts through node shutdown without cleaning up of resources +### Restart Node without draining 1. Repeat the above test but remove steps 4-5 which include draining and deleting the node. From 82ae192f213265ccf362a10f0cbf8770662685c5 Mon Sep 17 00:00:00 2001 From: Benjamin Jee Date: Fri, 6 Oct 2023 15:14:18 -0700 Subject: [PATCH 06/16] Change nginx to NGINX and adjust test naming --- tests/graceful-recovery/graceful-recovery.md | 38 +++++++++---------- .../results/graceful-recover-results.md | 10 ++--- 2 files changed, 24 insertions(+), 24 deletions(-) diff --git a/tests/graceful-recovery/graceful-recovery.md b/tests/graceful-recovery/graceful-recovery.md index 533a42590e..f576b487b6 100644 --- a/tests/graceful-recovery/graceful-recovery.md +++ b/tests/graceful-recovery/graceful-recovery.md @@ -19,15 +19,15 @@ Ensure that NGF can recover gracefully from container failures without any user 5. Follow the [installation instructions](https://github.com/nginxinc/nginx-gateway-fabric/blob/main/docs/installation.md) to deploy NGINX Gateway Fabric. 6. In a separate terminal track NGF logs by running `kubectl -n nginx-gateway logs -f deploy/nginx-gateway` -7. In a separate terminal track nginx container logs by running +7. In a separate terminal track NGINX container logs by running `kubectl -n nginx-gateway logs -f -c nginx` -8. Exec into the nginx container inside of the NGF pod by running +8. Exec into the NGINX container inside of the NGF pod by running `kubectl exec -it -n nginx-gateway --container nginx -- bin/sh` -9. Inside the nginx container, navigate to `/etc/nginx/conf.d` and ensure that +9. Inside the NGINX container, navigate to `/etc/nginx/conf.d` and ensure that `http.conf` and `config-version.conf` look correct. 10. In a different terminal, deploy the [https-termination example](https://github.com/nginxinc/nginx-gateway-fabric/tree/main/examples/https-termination). -11. Inside the nginx container, check `http.conf` and `config-version.config` to see +11. Inside the NGINX container, check `http.conf` and `config-version.config` to see if the configuration and version were correctly updated. 12. Send traffic through the example application and ensure it is working correctly. @@ -35,57 +35,57 @@ if the configuration and version were correctly updated. ### Restart nginx-gateway container -1. Ensure NGF and nginx container logs are set up and traffic flows through the example application correctly. +1. Ensure NGF and NGINX container logs are set up and traffic flows through the example application correctly. 2. Insert ephemeral container in NGF Pod and kill the nginx-gateway process. 1. `kubectl debug -it -n nginx-gateway --image=busybox:1.28 --target=nginx-gateway` 2. run `ps -A` 3. run `kill ` (Command should start with `/usr/bin/gateway`) -3. Check for errors in the NGF and nginx-container logs. +3. Check for errors in the NGF and NGINX container logs. 4. When the nginx-gateway container is back up, ensure traffic flows through the example application correctly. -5. Open up the NGF and nginx container logs and check for errors. -6. Inside the nginx container, check that `http.conf` was not changed and `config-version.conf` had its version set to `2`. +5. Open up the NGF and NGINX container logs and check for errors. +6. Inside the NGINX container, check that `http.conf` was not changed and `config-version.conf` had its version set to `2`. 7. Send traffic through the example application and ensure it is working correctly. 8. Check that NGF can still update statuses of resources. 1. Delete the HTTPRoute resources by running `kubectl delete -f cafe-routes.yaml` in `/examples/https-termination` - 2. Inside the terminal which is inside the nginx container, check that `http.conf` and + 2. Inside the terminal which is inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated. 3. Send traffic through the example application using the updated resources and ensure traffic does not flow. 4. Apply the HTTPRoute resources by running `kubectl apply -f cafe-routes.yaml` in `/examples/https-termination` - 5. Inside the terminal which is inside the nginx container, check that `http.conf` and + 5. Inside the terminal which is inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated. 6. Send traffic through the example application using the updated resources and ensure traffic flows correctly. ### Restart NGINX container -1. Ensure NGF and nginx container logs are set up and traffic flows through the example application correctly. +1. Ensure NGF and NGINX container logs are set up and traffic flows through the example application correctly. 2. Insert ephemeral container in NGF Pod and kill the nginx-master process. 1. If there isn't already an ephemeral container inserted, run: `kubectl debug -it -n nginx-gateway --image=busybox:1.28 --target=nginx-gateway` 2. run `ps -A` 3. run `kill ` (Command should start with `nginx: master process`) -3. When nginx container is back up, ensure traffic flows through the example application correctly. -4. Open up the nginx-container logs and check for errors. -5. Exec back into the nginx container and check that `http.conf` and `config-version.conf` were not changed. +3. When NGINX container is back up, ensure traffic flows through the example application correctly. +4. Open up the NGINX container logs and check for errors. +5. Exec back into the NGINX container and check that `http.conf` and `config-version.conf` were not changed. ### Restart Node with draining 1. Switch over to a one-node Kind cluster. Can run `make create-kind-cluster` from main directory. 2. Run steps 4-12 of the Setup section above using [this guide] (https://github.com/nginxinc/nginx-gateway-fabric/blob/main/docs/running-on-kind.md) for running on Kind. -3. Ensure NGF and nginx container logs are set up and traffic flows through the example application correctly. +3. Ensure NGF and NGINX container logs are set up and traffic flows through the example application correctly. 4. Drain the node of its resources by running `kubectl drain kind-control-plane --ignore-daemonsets --delete-local-data` 5. Delete the node by running `kubectl delete node kind-control-plane` 6. Restart the docker container by running `docker restart kind-control-plane` -7. Open up both NGF and nginx-container logs and check for errors. -8. Exec back into the nginx container and check that `http.conf` and `config-version.conf` were not changed. +7. Open up both NGF and NGINX container logs and check for errors. +8. Exec back into the NGINX container and check that `http.conf` and `config-version.conf` were not changed. 9. Send traffic through the example application and ensure it is working correctly. 10. Check that NGF can still update statuses of resources. 1. Delete the HTTPRoute resources by running `kubectl delete -f cafe-routes.yaml` in `/examples/https-termination` - 2. Inside the terminal which is inside the nginx container, check that `http.conf` and + 2. Inside the terminal which is inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated. 3. Send traffic through the example application using the updated resources and ensure traffic does not flow. 4. Apply the HTTPRoute resources by running `kubectl apply -f cafe-routes.yaml` in `/examples/https-termination` - 5. Inside the terminal which is inside the nginx container, check that `http.conf` and + 5. Inside the terminal which is inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated. 6. Send traffic through the example application using the updated resources and ensure traffic flows correctly. diff --git a/tests/graceful-recovery/results/graceful-recover-results.md b/tests/graceful-recovery/results/graceful-recover-results.md index 50370d6385..b619bfb4bd 100644 --- a/tests/graceful-recovery/results/graceful-recover-results.md +++ b/tests/graceful-recovery/results/graceful-recover-results.md @@ -1,15 +1,15 @@ # Test Results -## Testing when nginx-gateway container restarts +## Restart nginx-gateway container Passes test with no errors. -## Testing when nginx container restarts +## Restart NGINX container Passes test with no errors. -## Testing when the NGF Pod restarts through node shutdown with cleaning up of resources +## Restart Node with draining Passes test with no errors. -## Testing when the NGF Pod restarts through node shutdown without cleaning up of resources +## Restart Node without draining Does not work correctly the majority of times and errors after running `docker restart kind-control-plane`. -NGF Pod is not able to recover as the nginx container logs show this error: +NGF Pod is not able to recover as the NGINX container logs show this error: `bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)`. From 3479d27338b75b11761fcf2577aa0a808f43d157 Mon Sep 17 00:00:00 2001 From: Benjamin Jee Date: Fri, 6 Oct 2023 15:29:21 -0700 Subject: [PATCH 07/16] Add details on exposing NGF and change from bin/sh to sh --- tests/graceful-recovery/graceful-recovery.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tests/graceful-recovery/graceful-recovery.md b/tests/graceful-recovery/graceful-recovery.md index f576b487b6..2151c2991a 100644 --- a/tests/graceful-recovery/graceful-recovery.md +++ b/tests/graceful-recovery/graceful-recovery.md @@ -17,12 +17,12 @@ Ensure that NGF can recover gracefully from container failures without any user 3. Check out the latest tag (unless you are installing the edge version from the main branch). 4. Go into `deploy/manifests/nginx-gateway.yaml` and change `runAsNonRoot` from `true` to `false`. 5. Follow the [installation instructions](https://github.com/nginxinc/nginx-gateway-fabric/blob/main/docs/installation.md) -to deploy NGINX Gateway Fabric. +to deploy NGINX Gateway Fabric using manifests and expose it through a LoadBalancer Service. 6. In a separate terminal track NGF logs by running `kubectl -n nginx-gateway logs -f deploy/nginx-gateway` 7. In a separate terminal track NGINX container logs by running `kubectl -n nginx-gateway logs -f -c nginx` 8. Exec into the NGINX container inside of the NGF pod by running -`kubectl exec -it -n nginx-gateway --container nginx -- bin/sh` +`kubectl exec -it -n nginx-gateway --container nginx -- sh` 9. Inside the NGINX container, navigate to `/etc/nginx/conf.d` and ensure that `http.conf` and `config-version.conf` look correct. 10. In a different terminal, deploy the From d7a55d8c58d333ca6a6c46be188f986646238a1d Mon Sep 17 00:00:00 2001 From: Benjamin Jee Date: Fri, 6 Oct 2023 16:14:26 -0700 Subject: [PATCH 08/16] Change naming of results and add version --- .../{graceful-recover-results.md => 1.0.0/results.md} | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) rename tests/graceful-recovery/results/{graceful-recover-results.md => 1.0.0/results.md} (71%) diff --git a/tests/graceful-recovery/results/graceful-recover-results.md b/tests/graceful-recovery/results/1.0.0/results.md similarity index 71% rename from tests/graceful-recovery/results/graceful-recover-results.md rename to tests/graceful-recovery/results/1.0.0/results.md index b619bfb4bd..bce21c70f1 100644 --- a/tests/graceful-recovery/results/graceful-recover-results.md +++ b/tests/graceful-recovery/results/1.0.0/results.md @@ -1,15 +1,17 @@ # Test Results -## Restart nginx-gateway container +## Version 1.0.0 + +### Restart nginx-gateway container Passes test with no errors. -## Restart NGINX container +### Restart NGINX container Passes test with no errors. -## Restart Node with draining +### Restart Node with draining Passes test with no errors. -## Restart Node without draining +### Restart Node without draining Does not work correctly the majority of times and errors after running `docker restart kind-control-plane`. NGF Pod is not able to recover as the NGINX container logs show this error: `bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)`. From f0cfd6adbd7b3f830d2d7a404eefa1b0fa7ee210 Mon Sep 17 00:00:00 2001 From: Benjamin Jee Date: Fri, 6 Oct 2023 16:29:22 -0700 Subject: [PATCH 09/16] Add filed issue to result --- tests/graceful-recovery/graceful-recovery.md | 8 +++----- tests/graceful-recovery/results/1.0.0/results.md | 2 ++ 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/tests/graceful-recovery/graceful-recovery.md b/tests/graceful-recovery/graceful-recovery.md index 2151c2991a..3bec27ae90 100644 --- a/tests/graceful-recovery/graceful-recovery.md +++ b/tests/graceful-recovery/graceful-recovery.md @@ -23,13 +23,11 @@ to deploy NGINX Gateway Fabric using manifests and expose it through a LoadBalan `kubectl -n nginx-gateway logs -f -c nginx` 8. Exec into the NGINX container inside of the NGF pod by running `kubectl exec -it -n nginx-gateway --container nginx -- sh` -9. Inside the NGINX container, navigate to `/etc/nginx/conf.d` and ensure that -`http.conf` and `config-version.conf` look correct. -10. In a different terminal, deploy the +9. In a different terminal, deploy the [https-termination example](https://github.com/nginxinc/nginx-gateway-fabric/tree/main/examples/https-termination). -11. Inside the NGINX container, check `http.conf` and `config-version.config` to see +10. Inside the NGINX container, navigate to `/etc/nginx/conf.d` and check `http.conf` and `config-version.config` to see if the configuration and version were correctly updated. -12. Send traffic through the example application and ensure it is working correctly. +11. Send traffic through the example application and ensure it is working correctly. ## Tests diff --git a/tests/graceful-recovery/results/1.0.0/results.md b/tests/graceful-recovery/results/1.0.0/results.md index bce21c70f1..a478144668 100644 --- a/tests/graceful-recovery/results/1.0.0/results.md +++ b/tests/graceful-recovery/results/1.0.0/results.md @@ -15,3 +15,5 @@ Passes test with no errors. Does not work correctly the majority of times and errors after running `docker restart kind-control-plane`. NGF Pod is not able to recover as the NGINX container logs show this error: `bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)`. + +Issue Filed: https://github.com/nginxinc/nginx-gateway-fabric/issues/1108 From 023bac39c29d623b9ec40adab65e9a8037279256 Mon Sep 17 00:00:00 2001 From: Benjamin Jee Date: Fri, 6 Oct 2023 16:52:23 -0700 Subject: [PATCH 10/16] Add explanation for runAsNonRoot adjustment and change wording --- tests/graceful-recovery/graceful-recovery.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/tests/graceful-recovery/graceful-recovery.md b/tests/graceful-recovery/graceful-recovery.md index 3bec27ae90..228474a5cc 100644 --- a/tests/graceful-recovery/graceful-recovery.md +++ b/tests/graceful-recovery/graceful-recovery.md @@ -16,6 +16,7 @@ Ensure that NGF can recover gracefully from container failures without any user 2. Clone the repo and change into the nginx-gateway-fabric directory. 3. Check out the latest tag (unless you are installing the edge version from the main branch). 4. Go into `deploy/manifests/nginx-gateway.yaml` and change `runAsNonRoot` from `true` to `false`. +This allows us to insert our ephemeral container as root which enables us to restart the nginx-gateway container. 5. Follow the [installation instructions](https://github.com/nginxinc/nginx-gateway-fabric/blob/main/docs/installation.md) to deploy NGINX Gateway Fabric using manifests and expose it through a LoadBalancer Service. 6. In a separate terminal track NGF logs by running `kubectl -n nginx-gateway logs -f deploy/nginx-gateway` @@ -43,7 +44,7 @@ if the configuration and version were correctly updated. 5. Open up the NGF and NGINX container logs and check for errors. 6. Inside the NGINX container, check that `http.conf` was not changed and `config-version.conf` had its version set to `2`. 7. Send traffic through the example application and ensure it is working correctly. -8. Check that NGF can still update statuses of resources. +8. Check that NGF can still process changes of resources. 1. Delete the HTTPRoute resources by running `kubectl delete -f cafe-routes.yaml` in `/examples/https-termination` 2. Inside the terminal which is inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated. @@ -77,7 +78,7 @@ if the configuration and version were correctly updated. 7. Open up both NGF and NGINX container logs and check for errors. 8. Exec back into the NGINX container and check that `http.conf` and `config-version.conf` were not changed. 9. Send traffic through the example application and ensure it is working correctly. -10. Check that NGF can still update statuses of resources. +10. Check that NGF can still process changes of resources. 1. Delete the HTTPRoute resources by running `kubectl delete -f cafe-routes.yaml` in `/examples/https-termination` 2. Inside the terminal which is inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated. From b996d938396ab7956a3ccaf11338d660cd8ddd2d Mon Sep 17 00:00:00 2001 From: Benjamin Jee Date: Tue, 10 Oct 2023 09:46:22 -0700 Subject: [PATCH 11/16] Fix commands to run from curr, refactor killing nginx container, rest is WIP --- tests/graceful-recovery/graceful-recovery.md | 85 ++++++++++++++------ 1 file changed, 61 insertions(+), 24 deletions(-) diff --git a/tests/graceful-recovery/graceful-recovery.md b/tests/graceful-recovery/graceful-recovery.md index 228474a5cc..641ea510e5 100644 --- a/tests/graceful-recovery/graceful-recovery.md +++ b/tests/graceful-recovery/graceful-recovery.md @@ -1,5 +1,19 @@ # Graceful recovery from restarts +This document describes how we test graceful recovery from restarts on NGF. + + +- [Graceful recovery from restarts](#graceful-recovery-from-restarts) + - [Goal](#goal) + - [Cluster Details](#cluster-details) + - [Setup](#setup) + - [Tests](#tests) + - [Restart nginx-gateway container](#restart-nginx-gateway-container) + - [Restart NGINX container](#restart-nginx-container) + - [Restart Node with draining](#restart-node-with-draining) + - [Restart Node without draining](#restart-node-without-draining) + + ## Goal Ensure that NGF can recover gracefully from container failures without any user intervention. @@ -19,11 +33,24 @@ Ensure that NGF can recover gracefully from container failures without any user This allows us to insert our ephemeral container as root which enables us to restart the nginx-gateway container. 5. Follow the [installation instructions](https://github.com/nginxinc/nginx-gateway-fabric/blob/main/docs/installation.md) to deploy NGINX Gateway Fabric using manifests and expose it through a LoadBalancer Service. -6. In a separate terminal track NGF logs by running `kubectl -n nginx-gateway logs -f deploy/nginx-gateway` +6. In a separate terminal track NGF logs by running + + ```console + kubectl -n nginx-gateway logs -f deploy/nginx-gateway + ``` + 7. In a separate terminal track NGINX container logs by running -`kubectl -n nginx-gateway logs -f -c nginx` + + ```console + kubectl -n nginx-gateway logs -f -c nginx + ``` + 8. Exec into the NGINX container inside of the NGF pod by running -`kubectl exec -it -n nginx-gateway --container nginx -- sh` + + ```console + kubectl exec -it -n nginx-gateway --container nginx -- sh + ``` + 9. In a different terminal, deploy the [https-termination example](https://github.com/nginxinc/nginx-gateway-fabric/tree/main/examples/https-termination). 10. Inside the NGINX container, navigate to `/etc/nginx/conf.d` and check `http.conf` and `config-version.config` to see @@ -36,20 +63,28 @@ if the configuration and version were correctly updated. 1. Ensure NGF and NGINX container logs are set up and traffic flows through the example application correctly. 2. Insert ephemeral container in NGF Pod and kill the nginx-gateway process. - 1. `kubectl debug -it -n nginx-gateway --image=busybox:1.28 --target=nginx-gateway` - 2. run `ps -A` - 3. run `kill ` (Command should start with `/usr/bin/gateway`) -3. Check for errors in the NGF and NGINX container logs. -4. When the nginx-gateway container is back up, ensure traffic flows through the example application correctly. -5. Open up the NGF and NGINX container logs and check for errors. -6. Inside the NGINX container, check that `http.conf` was not changed and `config-version.conf` had its version set to `2`. -7. Send traffic through the example application and ensure it is working correctly. -8. Check that NGF can still process changes of resources. - 1. Delete the HTTPRoute resources by running `kubectl delete -f cafe-routes.yaml` in `/examples/https-termination` + + ```console + kubectl debug -it -n nginx-gateway --image=busybox:1.28 --target=nginx-gateway + ``` + +3. Kill nginx-gateway process (Command should start with `/usr/bin/gateway`) + + ```console + kill + ``` + +4. Check for errors in the NGF and NGINX container logs. +5. When the nginx-gateway container is back up, ensure traffic flows through the example application correctly. +6. Open up the NGF and NGINX container logs and check for errors. +7. Inside the NGINX container, check that `http.conf` was not changed and `config-version.conf` had its version set to `2`. +8. Send traffic through the example application and ensure it is working correctly. +9. Check that NGF can still process changes of resources. + 1. Delete the HTTPRoute resources by running `kubectl delete -f ../../examples/https-termination/cafe-routes.yaml` 2. Inside the terminal which is inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated. 3. Send traffic through the example application using the updated resources and ensure traffic does not flow. - 4. Apply the HTTPRoute resources by running `kubectl apply -f cafe-routes.yaml` in `/examples/https-termination` + 4. Apply the HTTPRoute resources by running `kubectl apply -f ../../examples/https-termination/cafe-routes.yaml` 5. Inside the terminal which is inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated. 6. Send traffic through the example application using the updated resources and ensure traffic flows correctly. @@ -57,14 +92,16 @@ if the configuration and version were correctly updated. ### Restart NGINX container 1. Ensure NGF and NGINX container logs are set up and traffic flows through the example application correctly. -2. Insert ephemeral container in NGF Pod and kill the nginx-master process. - 1. If there isn't already an ephemeral container inserted, run: - `kubectl debug -it -n nginx-gateway --image=busybox:1.28 --target=nginx-gateway` - 2. run `ps -A` - 3. run `kill ` (Command should start with `nginx: master process`) -3. When NGINX container is back up, ensure traffic flows through the example application correctly. -4. Open up the NGINX container logs and check for errors. -5. Exec back into the NGINX container and check that `http.conf` and `config-version.conf` were not changed. +2. If the terminal inside the NGINX container is no longer running, Exec back into the NGINX container. +3. Inside the NGINX container, kill the nginx-master process (Command should start with `nginx: master process`). + + ```console + kill + ``` + +4. When NGINX container is back up, ensure traffic flows through the example application correctly. +5. Open up the NGINX container logs and check for errors. +6. Exec back into the NGINX container and check that `http.conf` and `config-version.conf` were not changed. ### Restart Node with draining @@ -79,11 +116,11 @@ if the configuration and version were correctly updated. 8. Exec back into the NGINX container and check that `http.conf` and `config-version.conf` were not changed. 9. Send traffic through the example application and ensure it is working correctly. 10. Check that NGF can still process changes of resources. - 1. Delete the HTTPRoute resources by running `kubectl delete -f cafe-routes.yaml` in `/examples/https-termination` + 1. Delete the HTTPRoute resources by running `kubectl delete -f ../../examples/https-termination/cafe-routes.yaml` 2. Inside the terminal which is inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated. 3. Send traffic through the example application using the updated resources and ensure traffic does not flow. - 4. Apply the HTTPRoute resources by running `kubectl apply -f cafe-routes.yaml` in `/examples/https-termination` + 4. Apply the HTTPRoute resources by running `kubectl apply -f ../../examples/https-termination/cafe-routes.yaml` 5. Inside the terminal which is inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated. 6. Send traffic through the example application using the updated resources and ensure traffic flows correctly. From fd720c882c8d58eb61d8bfb02490cc9722d1d45b Mon Sep 17 00:00:00 2001 From: Benjamin Jee Date: Tue, 10 Oct 2023 14:18:47 -0700 Subject: [PATCH 12/16] Add versioning and use SIGKILL --- tests/graceful-recovery/graceful-recovery.md | 19 ++++++++--- .../results/1.0.0/results.md | 33 +++++++++++++++++-- 2 files changed, 45 insertions(+), 7 deletions(-) diff --git a/tests/graceful-recovery/graceful-recovery.md b/tests/graceful-recovery/graceful-recovery.md index 641ea510e5..180d54dea7 100644 --- a/tests/graceful-recovery/graceful-recovery.md +++ b/tests/graceful-recovery/graceful-recovery.md @@ -62,16 +62,16 @@ if the configuration and version were correctly updated. ### Restart nginx-gateway container 1. Ensure NGF and NGINX container logs are set up and traffic flows through the example application correctly. -2. Insert ephemeral container in NGF Pod and kill the nginx-gateway process. +2. Insert ephemeral container in NGF Pod ```console kubectl debug -it -n nginx-gateway --image=busybox:1.28 --target=nginx-gateway ``` -3. Kill nginx-gateway process (Command should start with `/usr/bin/gateway`) +3. Kill nginx-gateway process through SIGKILL (Command should start with `/usr/bin/gateway`) ```console - kill + kill -9 ``` 4. Check for errors in the NGF and NGINX container logs. @@ -93,15 +93,24 @@ if the configuration and version were correctly updated. 1. Ensure NGF and NGINX container logs are set up and traffic flows through the example application correctly. 2. If the terminal inside the NGINX container is no longer running, Exec back into the NGINX container. -3. Inside the NGINX container, kill the nginx-master process (Command should start with `nginx: master process`). +3. Inside the NGINX container, kill the nginx-master process through SIGKILL (Command should start with `nginx: master process`). ```console - kill + kill -9 ``` 4. When NGINX container is back up, ensure traffic flows through the example application correctly. 5. Open up the NGINX container logs and check for errors. 6. Exec back into the NGINX container and check that `http.conf` and `config-version.conf` were not changed. +7. Check that NGF can still process changes of resources. + 1. Delete the HTTPRoute resources by running `kubectl delete -f ../../examples/https-termination/cafe-routes.yaml` + 2. Inside the terminal which is inside the NGINX container, check that `http.conf` and + `config-version.conf` were correctly updated. + 3. Send traffic through the example application using the updated resources and ensure traffic does not flow. + 4. Apply the HTTPRoute resources by running `kubectl apply -f ../../examples/https-termination/cafe-routes.yaml` + 5. Inside the terminal which is inside the NGINX container, check that `http.conf` and + `config-version.conf` were correctly updated. + 6. Send traffic through the example application using the updated resources and ensure traffic flows correctly. ### Restart Node with draining diff --git a/tests/graceful-recovery/results/1.0.0/results.md b/tests/graceful-recovery/results/1.0.0/results.md index a478144668..3ecf7dd856 100644 --- a/tests/graceful-recovery/results/1.0.0/results.md +++ b/tests/graceful-recovery/results/1.0.0/results.md @@ -1,6 +1,35 @@ -# Test Results +# Results for v1.0.0 + +## Versions + +NGF version: + +```text +commit: 72b6c6ef8915c697626eeab88fdb6a3ce15b8da0 +date: 2023-10-02T13:13:08Z +version: edge +``` + +with NGINX: + +```text +nginx/1.25.2 +built by gcc 12.2.1 20220924 (Alpine 12.2.1_git20220924-r10) +OS: Linux 5.15.49-linuxkit-pr +``` + + +Kubernetes: + +```text +Server Version: version.Info{Major:"1", Minor:"28", +GitVersion:"v1.28.0", +GitCommit:"855e7c48de7388eb330da0f8d9d2394ee818fb8d", +GitTreeState:"clean", BuildDate:"2023-08-15T21:26:40Z", +GoVersion:"go1.20.7", Compiler:"gc", +Platform:"linux/arm64"} +``` -## Version 1.0.0 ### Restart nginx-gateway container Passes test with no errors. From d87c28ea2e0aaa52b2b1698e5de3f4df9f14d85c Mon Sep 17 00:00:00 2001 From: Benjamin Jee Date: Tue, 10 Oct 2023 14:31:51 -0700 Subject: [PATCH 13/16] Refactor test to use code block --- tests/graceful-recovery/graceful-recovery.md | 83 +++++++++++++++----- 1 file changed, 64 insertions(+), 19 deletions(-) diff --git a/tests/graceful-recovery/graceful-recovery.md b/tests/graceful-recovery/graceful-recovery.md index 180d54dea7..ce73a7600a 100644 --- a/tests/graceful-recovery/graceful-recovery.md +++ b/tests/graceful-recovery/graceful-recovery.md @@ -80,11 +80,21 @@ if the configuration and version were correctly updated. 7. Inside the NGINX container, check that `http.conf` was not changed and `config-version.conf` had its version set to `2`. 8. Send traffic through the example application and ensure it is working correctly. 9. Check that NGF can still process changes of resources. - 1. Delete the HTTPRoute resources by running `kubectl delete -f ../../examples/https-termination/cafe-routes.yaml` + 1. Delete the HTTPRoute resources + + ```console + kubectl delete -f ../../examples/https-termination/cafe-routes.yaml + ``` + 2. Inside the terminal which is inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated. 3. Send traffic through the example application using the updated resources and ensure traffic does not flow. - 4. Apply the HTTPRoute resources by running `kubectl apply -f ../../examples/https-termination/cafe-routes.yaml` + 4. Apply the HTTPRoute resources + + ```console + kubectl apply -f ../../examples/https-termination/cafe-routes.yaml + ``` + 5. Inside the terminal which is inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated. 6. Send traffic through the example application using the updated resources and ensure traffic flows correctly. @@ -103,37 +113,72 @@ if the configuration and version were correctly updated. 5. Open up the NGINX container logs and check for errors. 6. Exec back into the NGINX container and check that `http.conf` and `config-version.conf` were not changed. 7. Check that NGF can still process changes of resources. - 1. Delete the HTTPRoute resources by running `kubectl delete -f ../../examples/https-termination/cafe-routes.yaml` - 2. Inside the terminal which is inside the NGINX container, check that `http.conf` and - `config-version.conf` were correctly updated. - 3. Send traffic through the example application using the updated resources and ensure traffic does not flow. - 4. Apply the HTTPRoute resources by running `kubectl apply -f ../../examples/https-termination/cafe-routes.yaml` - 5. Inside the terminal which is inside the NGINX container, check that `http.conf` and - `config-version.conf` were correctly updated. - 6. Send traffic through the example application using the updated resources and ensure traffic flows correctly. + 1. Delete the HTTPRoute resources + + ```console + kubectl delete -f ../../examples/https-termination/cafe-routes.yaml + ``` + + 2. Inside the terminal which is inside the NGINX container, check that `http.conf` and + `config-version.conf` were correctly updated. + 3. Send traffic through the example application using the updated resources and ensure traffic does not flow. + 4. Apply the HTTPRoute resources + + ```console + kubectl apply -f ../../examples/https-termination/cafe-routes.yaml + ``` + + 5. Inside the terminal which is inside the NGINX container, check that `http.conf` and + `config-version.conf` were correctly updated. + 6. Send traffic through the example application using the updated resources and ensure traffic flows correctly. ### Restart Node with draining -1. Switch over to a one-node Kind cluster. Can run `make create-kind-cluster` from main directory. +1. Switch over to a one-Node Kind cluster. Can run `make create-kind-cluster` from main directory. 2. Run steps 4-12 of the Setup section above using [this guide] (https://github.com/nginxinc/nginx-gateway-fabric/blob/main/docs/running-on-kind.md) for running on Kind. 3. Ensure NGF and NGINX container logs are set up and traffic flows through the example application correctly. -4. Drain the node of its resources by running `kubectl drain kind-control-plane --ignore-daemonsets --delete-local-data` -5. Delete the node by running `kubectl delete node kind-control-plane` -6. Restart the docker container by running `docker restart kind-control-plane` +4. Drain the Node of its resources + + ```console + kubectl drain kind-control-plane --ignore-daemonsets --delete-local-data + ``` + +5. Delete the Node + + ```console + kubectl delete node kind-control-plane + ``` + +6. Restart the Docker container + + ```console + docker restart kind-control-plane + ``` + 7. Open up both NGF and NGINX container logs and check for errors. 8. Exec back into the NGINX container and check that `http.conf` and `config-version.conf` were not changed. 9. Send traffic through the example application and ensure it is working correctly. 10. Check that NGF can still process changes of resources. - 1. Delete the HTTPRoute resources by running `kubectl delete -f ../../examples/https-termination/cafe-routes.yaml` + 1. Delete the HTTPRoute resources + + ```console + kubectl delete -f ../../examples/https-termination/cafe-routes.yaml + ``` + 2. Inside the terminal which is inside the NGINX container, check that `http.conf` and - `config-version.conf` were correctly updated. + `config-version.conf` were correctly updated. 3. Send traffic through the example application using the updated resources and ensure traffic does not flow. - 4. Apply the HTTPRoute resources by running `kubectl apply -f ../../examples/https-termination/cafe-routes.yaml` + 4. Apply the HTTPRoute resources + + ```console + kubectl apply -f ../../examples/https-termination/cafe-routes.yaml + ``` + 5. Inside the terminal which is inside the NGINX container, check that `http.conf` and - `config-version.conf` were correctly updated. + `config-version.conf` were correctly updated. 6. Send traffic through the example application using the updated resources and ensure traffic flows correctly. ### Restart Node without draining -1. Repeat the above test but remove steps 4-5 which include draining and deleting the node. +1. Repeat the above test but remove steps 4-5 which include draining and deleting the Node. From 4eae98ea0419cadb4bacce5383dc7f8f2e1c99f8 Mon Sep 17 00:00:00 2001 From: Benjamin Jee Date: Tue, 10 Oct 2023 15:41:16 -0700 Subject: [PATCH 14/16] Align test document to match others --- tests/graceful-recovery/graceful-recovery.md | 86 +++++++++---------- .../results/1.0.0/{results.md => 1.0.0.md} | 0 2 files changed, 42 insertions(+), 44 deletions(-) rename tests/graceful-recovery/results/1.0.0/{results.md => 1.0.0.md} (100%) diff --git a/tests/graceful-recovery/graceful-recovery.md b/tests/graceful-recovery/graceful-recovery.md index ce73a7600a..aec04c588c 100644 --- a/tests/graceful-recovery/graceful-recovery.md +++ b/tests/graceful-recovery/graceful-recovery.md @@ -5,26 +5,29 @@ This document describes how we test graceful recovery from restarts on NGF. - [Graceful recovery from restarts](#graceful-recovery-from-restarts) - [Goal](#goal) - - [Cluster Details](#cluster-details) - - [Setup](#setup) - - [Tests](#tests) - - [Restart nginx-gateway container](#restart-nginx-gateway-container) - - [Restart NGINX container](#restart-nginx-container) - - [Restart Node with draining](#restart-node-with-draining) - - [Restart Node without draining](#restart-node-without-draining) + - [Test Environment](#test-environment) + - [Steps](#steps) + - [Setup](#setup) + - [Run the tests](#run-the-tests) + - [Restart nginx-gateway container](#restart-nginx-gateway-container) + - [Restart NGINX container](#restart-nginx-container) + - [Restart Node with draining](#restart-node-with-draining) + - [Restart Node without draining](#restart-node-without-draining) ## Goal + Ensure that NGF can recover gracefully from container failures without any user intervention. -## Cluster Details +## Test Environment + +- A Kubernetes cluster with 3 nodes on GKE + - Node: e2-medium (2 vCPU, 4GB memory) +- A Kind cluster -- GKE 1.27.3-gke.100 -- us-central1-c -- Machine type of node is e2-medium -- 3 nodes +## Steps -## Setup +### Setup 1. Setup GKE Cluster. 2. Clone the repo and change into the nginx-gateway-fabric directory. @@ -57,18 +60,18 @@ to deploy NGINX Gateway Fabric using manifests and expose it through a LoadBalan if the configuration and version were correctly updated. 11. Send traffic through the example application and ensure it is working correctly. -## Tests +### Run the tests -### Restart nginx-gateway container +#### Restart nginx-gateway container 1. Ensure NGF and NGINX container logs are set up and traffic flows through the example application correctly. -2. Insert ephemeral container in NGF Pod +2. Insert ephemeral container in NGF Pod. ```console kubectl debug -it -n nginx-gateway --image=busybox:1.28 --target=nginx-gateway ``` -3. Kill nginx-gateway process through SIGKILL (Command should start with `/usr/bin/gateway`) +3. Kill nginx-gateway process through a SIGKILL signal (Process command should start with `/usr/bin/gateway`). ```console kill -9 @@ -80,30 +83,29 @@ if the configuration and version were correctly updated. 7. Inside the NGINX container, check that `http.conf` was not changed and `config-version.conf` had its version set to `2`. 8. Send traffic through the example application and ensure it is working correctly. 9. Check that NGF can still process changes of resources. - 1. Delete the HTTPRoute resources + 1. Delete the HTTPRoute resources. ```console kubectl delete -f ../../examples/https-termination/cafe-routes.yaml ``` - 2. Inside the terminal which is inside the NGINX container, check that `http.conf` and - `config-version.conf` were correctly updated. + 2. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated. 3. Send traffic through the example application using the updated resources and ensure traffic does not flow. - 4. Apply the HTTPRoute resources + 4. Apply the HTTPRoute resources. ```console kubectl apply -f ../../examples/https-termination/cafe-routes.yaml ``` - 5. Inside the terminal which is inside the NGINX container, check that `http.conf` and - `config-version.conf` were correctly updated. + 5. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated. 6. Send traffic through the example application using the updated resources and ensure traffic flows correctly. -### Restart NGINX container +#### Restart NGINX container 1. Ensure NGF and NGINX container logs are set up and traffic flows through the example application correctly. 2. If the terminal inside the NGINX container is no longer running, Exec back into the NGINX container. -3. Inside the NGINX container, kill the nginx-master process through SIGKILL (Command should start with `nginx: master process`). +3. Inside the NGINX container, kill the nginx-master process through a SIGKILL signal +(Process command should start with `nginx: master process`). ```console kill -9 @@ -113,44 +115,42 @@ if the configuration and version were correctly updated. 5. Open up the NGINX container logs and check for errors. 6. Exec back into the NGINX container and check that `http.conf` and `config-version.conf` were not changed. 7. Check that NGF can still process changes of resources. - 1. Delete the HTTPRoute resources + 1. Delete the HTTPRoute resources. ```console kubectl delete -f ../../examples/https-termination/cafe-routes.yaml ``` - 2. Inside the terminal which is inside the NGINX container, check that `http.conf` and - `config-version.conf` were correctly updated. + 2. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated. 3. Send traffic through the example application using the updated resources and ensure traffic does not flow. - 4. Apply the HTTPRoute resources + 4. Apply the HTTPRoute resources. ```console kubectl apply -f ../../examples/https-termination/cafe-routes.yaml ``` - 5. Inside the terminal which is inside the NGINX container, check that `http.conf` and - `config-version.conf` were correctly updated. + 5. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated. 6. Send traffic through the example application using the updated resources and ensure traffic flows correctly. -### Restart Node with draining +#### Restart Node with draining 1. Switch over to a one-Node Kind cluster. Can run `make create-kind-cluster` from main directory. -2. Run steps 4-12 of the Setup section above using [this guide] -(https://github.com/nginxinc/nginx-gateway-fabric/blob/main/docs/running-on-kind.md) for running on Kind. +2. Run steps 4-11 of the [Setup](#setup) section above using +[this guide](https://github.com/nginxinc/nginx-gateway-fabric/blob/main/docs/running-on-kind.md) for running on Kind. 3. Ensure NGF and NGINX container logs are set up and traffic flows through the example application correctly. -4. Drain the Node of its resources +4. Drain the Node of its resources. ```console kubectl drain kind-control-plane --ignore-daemonsets --delete-local-data ``` -5. Delete the Node +5. Delete the Node. ```console kubectl delete node kind-control-plane ``` -6. Restart the Docker container +6. Restart the Docker container. ```console docker restart kind-control-plane @@ -160,25 +160,23 @@ if the configuration and version were correctly updated. 8. Exec back into the NGINX container and check that `http.conf` and `config-version.conf` were not changed. 9. Send traffic through the example application and ensure it is working correctly. 10. Check that NGF can still process changes of resources. - 1. Delete the HTTPRoute resources + 1. Delete the HTTPRoute resources. ```console kubectl delete -f ../../examples/https-termination/cafe-routes.yaml ``` - 2. Inside the terminal which is inside the NGINX container, check that `http.conf` and - `config-version.conf` were correctly updated. + 2. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated. 3. Send traffic through the example application using the updated resources and ensure traffic does not flow. - 4. Apply the HTTPRoute resources + 4. Apply the HTTPRoute resources. ```console kubectl apply -f ../../examples/https-termination/cafe-routes.yaml ``` - 5. Inside the terminal which is inside the NGINX container, check that `http.conf` and - `config-version.conf` were correctly updated. + 5. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated. 6. Send traffic through the example application using the updated resources and ensure traffic flows correctly. -### Restart Node without draining +#### Restart Node without draining 1. Repeat the above test but remove steps 4-5 which include draining and deleting the Node. diff --git a/tests/graceful-recovery/results/1.0.0/results.md b/tests/graceful-recovery/results/1.0.0/1.0.0.md similarity index 100% rename from tests/graceful-recovery/results/1.0.0/results.md rename to tests/graceful-recovery/results/1.0.0/1.0.0.md From 8e0c639a78aa6e9757143a81958e606e277a0f51 Mon Sep 17 00:00:00 2001 From: Benjamin Jee Date: Tue, 10 Oct 2023 16:25:19 -0700 Subject: [PATCH 15/16] Add new results and refactor results styling --- .../graceful-recovery/results/1.0.0/1.0.0.md | 104 +++++++++++++++++- 1 file changed, 99 insertions(+), 5 deletions(-) diff --git a/tests/graceful-recovery/results/1.0.0/1.0.0.md b/tests/graceful-recovery/results/1.0.0/1.0.0.md index 3ecf7dd856..2fe3791e65 100644 --- a/tests/graceful-recovery/results/1.0.0/1.0.0.md +++ b/tests/graceful-recovery/results/1.0.0/1.0.0.md @@ -1,5 +1,17 @@ # Results for v1.0.0 + +- [Results for v1.0.0](#results-for-v100) + - [Versions](#versions) + - [Tests](#tests) + - [Restart nginx-gateway container](#restart-nginx-gateway-container) + - [Restart NGINX container](#restart-nginx-container) + - [Restart Node with draining](#restart-node-with-draining) + - [Restart Node without draining](#restart-node-without-draining) + - [Future Improvements](#future-improvements) + + + ## Versions NGF version: @@ -30,19 +42,101 @@ GoVersion:"go1.20.7", Compiler:"gc", Platform:"linux/arm64"} ``` +## Tests ### Restart nginx-gateway container Passes test with no errors. ### Restart NGINX container -Passes test with no errors. +The NGF Pod was unable to recover after sending a SIGKILL signal to the NGINX master process. +The following appeared in the NGINX logs: + +```text +2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-config-version.sock failed (98: Address in use) +2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address in use) +2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-500-server.sock failed (98: Address in use) +2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use) +2023/10/10 22:46:54 [notice] 141#141: try again to bind() after 500ms +2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-config-version.sock failed (98: Address in use) +2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address in use) +2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-500-server.sock failed (98: Address in use) +2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use) +2023/10/10 22:46:54 [notice] 141#141: try again to bind() after 500ms +2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-config-version.sock failed (98: Address in use) +2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address in use) +2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-500-server.sock failed (98: Address in use) +2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use) +2023/10/10 22:46:54 [notice] 141#141: try again to bind() after 500ms +2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-config-version.sock failed (98: Address in use) +2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address in use) +2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-500-server.sock failed (98: Address in use) +2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use) +2023/10/10 22:46:54 [notice] 141#141: try again to bind() after 500ms +2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-config-version.sock failed (98: Address in use) +2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address in use) +2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-500-server.sock failed (98: Address in use) +2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use) +2023/10/10 22:46:54 [notice] 141#141: try again to bind() after 500ms +2023/10/10 22:46:54 [emerg] 141#141: still could not bind() +``` + +Issue Filed: https://github.com/nginxinc/nginx-gateway-fabric/issues/1108 + ### Restart Node with draining Passes test with no errors. ### Restart Node without draining -Does not work correctly the majority of times and errors after running `docker restart kind-control-plane`. -NGF Pod is not able to recover as the NGINX container logs show this error: -`bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)`. +The NGF Pod was unable to recover the majority of times after running `docker restart kind-control-plane`. -Issue Filed: https://github.com/nginxinc/nginx-gateway-fabric/issues/1108 +The following appeared in the NGINX logs: + +```text +2023/10/10 22:57:05 [emerg] 140#140: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use) +2023/10/10 22:57:05 [notice] 140#140: try again to bind() after 500ms +2023/10/10 22:57:05 [emerg] 140#140: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use) +2023/10/10 22:57:05 [notice] 140#140: try again to bind() after 500ms +2023/10/10 22:57:05 [emerg] 140#140: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use) +2023/10/10 22:57:05 [notice] 140#140: try again to bind() after 500ms +2023/10/10 22:57:05 [emerg] 140#140: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use) +2023/10/10 22:57:05 [notice] 140#140: try again to bind() after 500ms +2023/10/10 22:57:05 [emerg] 140#140: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use) +2023/10/10 22:57:05 [notice] 140#140: try again to bind() after 500ms +2023/10/10 22:57:05 [emerg] 140#140: still could not bind() +``` + +The following appeared in the NGF logs: + +```text +{"level":"info","ts":"2023-10-10T22:57:05Z","msg":"Starting NGINX Gateway Fabric in static mode","version":"edge","commit":"b3fbf98d906f60ce66d70d7a2373c4b12b7d5606","date":"2023-10-10T22:02:06Z"} +Error: failed to start control loop: cannot create and register metrics collectors: cannot create NGINX status metrics collector: failed to get http://config-status/stub_status: Get "http://config-status/stub_status": dial unix /var/run/nginx/nginx-status.sock: connect: connection refused +Usage: + gateway static-mode [flags] + +Flags: + -c, --config string The name of the NginxGateway resource to be used for this controller's dynamic configuration. Lives in the same Namespace as the controller. (default "") + --gateway string The namespaced name of the Gateway resource to use. Must be of the form: NAMESPACE/NAME. If not specified, the control plane will process all Gateways for the configured GatewayClass. However, among them, it will choose the oldest resource by creation timestamp. If the timestamps are equal, it will choose the resource that appears first in alphabetical order by {namespace}/{name}. + --health-disable Disable running the health probe server. + --health-port int Set the port where the health probe server is exposed. Format: [1024 - 65535] (default 8081) + -h, --help help for static-mode + --leader-election-disable Disable leader election. Leader election is used to avoid multiple replicas of the NGINX Gateway Fabric reporting the status of the Gateway API resources. If disabled, all replicas of NGINX Gateway Fabric will update the statuses of the Gateway API resources. + --leader-election-lock-name string The name of the leader election lock. A Lease object with this name will be created in the same Namespace as the controller. (default "nginx-gateway-leader-election-lock") + --metrics-disable Disable exposing metrics in the Prometheus format. + --metrics-port int Set the port where the metrics are exposed. Format: [1024 - 65535] (default 9113) + --metrics-secure-serving Enable serving metrics via https. By default metrics are served via http. Please note that this endpoint will be secured with a self-signed certificate. + --update-gatewayclass-status Update the status of the GatewayClass resource. (default true) + +Global Flags: + --gateway-ctlr-name string The name of the Gateway controller. The controller name must be of the form: DOMAIN/PATH. The controller's domain is 'gateway.nginx.org' (default "") + --gatewayclass string The name of the GatewayClass resource. Every NGINX Gateway Fabric must have a unique corresponding GatewayClass resource. (default "") + +failed to start control loop: cannot create and register metrics collectors: cannot create NGINX status metrics collector: failed to get http://config-status/stub_status: Get "http://config-status/stub_status": dial unix /var/run/nginx/nginx-status.sock: connect: connection refused +``` + +Important to note that occasionally the test will pass and the NGF Pod would recover gracefully. + +Related to this issue: https://github.com/nginxinc/nginx-gateway-fabric/issues/1108 + +## Future Improvements + +- None From 95f81573dfd03eeac5d8910fa842ba73e0b32a72 Mon Sep 17 00:00:00 2001 From: Benjamin Jee Date: Tue, 10 Oct 2023 16:28:53 -0700 Subject: [PATCH 16/16] Add small fixes --- tests/graceful-recovery/graceful-recovery.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/tests/graceful-recovery/graceful-recovery.md b/tests/graceful-recovery/graceful-recovery.md index aec04c588c..b99ad303d1 100644 --- a/tests/graceful-recovery/graceful-recovery.md +++ b/tests/graceful-recovery/graceful-recovery.md @@ -36,19 +36,19 @@ Ensure that NGF can recover gracefully from container failures without any user This allows us to insert our ephemeral container as root which enables us to restart the nginx-gateway container. 5. Follow the [installation instructions](https://github.com/nginxinc/nginx-gateway-fabric/blob/main/docs/installation.md) to deploy NGINX Gateway Fabric using manifests and expose it through a LoadBalancer Service. -6. In a separate terminal track NGF logs by running +6. In a separate terminal track NGF logs. ```console kubectl -n nginx-gateway logs -f deploy/nginx-gateway ``` -7. In a separate terminal track NGINX container logs by running +7. In a separate terminal track NGINX container logs. ```console kubectl -n nginx-gateway logs -f -c nginx ``` -8. Exec into the NGINX container inside of the NGF pod by running +8. In a separate terminal Exec into the NGINX container inside the NGF pod. ```console kubectl exec -it -n nginx-gateway --container nginx -- sh