Add Graceful Recovery Baseline Test (#1111)

bjee19 · web-flow · commit d501d59ad2c6 · 2023-10-11T10:01:05.000-07:00
Add Graceful Recovery Baseline Test

Problem: We want to have a test that checks for how well NGF recovers from container failures.

Solution: Added manual tests that checked for when the nginx-gateway container restarted, when the NGINX container restarted, when we drained a node then restarted it, and when we didn't drain a node then restarted it.
diff --git a/tests/graceful-recovery/graceful-recovery.md b/tests/graceful-recovery/graceful-recovery.md
@@ -0,0 +1,182 @@
+# Graceful recovery from restarts
+
+This document describes how we test graceful recovery from restarts on NGF.
+
+<!-- TOC -->
+- [Graceful recovery from restarts](#graceful-recovery-from-restarts)
+  - [Goal](#goal)
+  - [Test Environment](#test-environment)
+  - [Steps](#steps)
+    - [Setup](#setup)
+    - [Run the tests](#run-the-tests)
+      - [Restart nginx-gateway container](#restart-nginx-gateway-container)
+      - [Restart NGINX container](#restart-nginx-container)
+      - [Restart Node with draining](#restart-node-with-draining)
+      - [Restart Node without draining](#restart-node-without-draining)
+<!-- TOC -->
+
+## Goal
+
+Ensure that NGF can recover gracefully from container failures without any user intervention.
+
+## Test Environment
+
+- A Kubernetes cluster with 3 nodes on GKE
+  - Node: e2-medium (2 vCPU, 4GB memory)
+- A Kind cluster
+
+## Steps
+
+### Setup
+
+1. Setup GKE Cluster.
+2. Clone the repo and change into the nginx-gateway-fabric directory.
+3. Check out the latest tag (unless you are installing the edge version from the main branch).
+4. Go into `deploy/manifests/nginx-gateway.yaml` and change `runAsNonRoot` from `true` to `false`.
+This allows us to insert our ephemeral container as root which enables us to restart the nginx-gateway container.
+5. Follow the [installation instructions](https://github.com/nginxinc/nginx-gateway-fabric/blob/main/docs/installation.md)
+to deploy NGINX Gateway Fabric using manifests and expose it through a LoadBalancer Service.
+6. In a separate terminal track NGF logs.
+
+    ```console
+    kubectl -n nginx-gateway logs -f deploy/nginx-gateway
+    ```
+
+7. In a separate terminal track NGINX container logs.
+
+    ```console
+    kubectl -n nginx-gateway logs -f <NGF_POD> -c nginx
+    ```
+
+8. In a separate terminal Exec into the NGINX container inside the NGF pod.
+
+    ```console
+    kubectl exec -it -n nginx-gateway <NGF_POD> --container nginx -- sh
+    ```
+
+9. In a different terminal, deploy the
+[https-termination example](https://github.com/nginxinc/nginx-gateway-fabric/tree/main/examples/https-termination).
+10. Inside the NGINX container, navigate to `/etc/nginx/conf.d` and check `http.conf` and `config-version.config` to see
+if the configuration and version were correctly updated.
+11. Send traffic through the example application and ensure it is working correctly.
+
+### Run the tests
+
+#### Restart nginx-gateway container
+
+1. Ensure NGF and NGINX container logs are set up and traffic flows through the example application correctly.
+2. Insert ephemeral container in NGF Pod.
+
+    ```console
+    kubectl debug -it -n nginx-gateway <NGF_POD> --image=busybox:1.28 --target=nginx-gateway
+    ```
+
+3. Kill nginx-gateway process through a SIGKILL signal (Process command should start with `/usr/bin/gateway`).
+
+    ```console
+    kill -9 <nginx-gateway_PID>
+    ```
+
+4. Check for errors in the NGF and NGINX container logs.
+5. When the nginx-gateway container is back up, ensure traffic flows through the example application correctly.
+6. Open up the NGF and NGINX container logs and check for errors.
+7. Inside the NGINX container, check that `http.conf` was not changed and `config-version.conf` had its version set to `2`.
+8. Send traffic through the example application and ensure it is working correctly.
+9. Check that NGF can still process changes of resources.
+   1. Delete the HTTPRoute resources.
+
+       ```console
+        kubectl delete -f ../../examples/https-termination/cafe-routes.yaml
+       ```
+
+   2. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated.
+   3. Send traffic through the example application using the updated resources and ensure traffic does not flow.
+   4. Apply the HTTPRoute resources.
+
+       ```console
+       kubectl apply -f ../../examples/https-termination/cafe-routes.yaml
+       ```
+
+   5. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated.
+   6. Send traffic through the example application using the updated resources and ensure traffic flows correctly.
+
+#### Restart NGINX container
+
+1. Ensure NGF and NGINX container logs are set up and traffic flows through the example application correctly.
+2. If the terminal inside the NGINX container is no longer running, Exec back into the NGINX container.
+3. Inside the NGINX container, kill the nginx-master process through a SIGKILL signal
+(Process command should start with `nginx: master process`).
+
+    ```console
+    kill -9 <nginx-master_PID>
+    ```
+
+4. When NGINX container is back up, ensure traffic flows through the example application correctly.
+5. Open up the NGINX container logs and check for errors.
+6. Exec back into the NGINX container and check that `http.conf` and `config-version.conf` were not changed.
+7. Check that NGF can still process changes of resources.
+    1. Delete the HTTPRoute resources.
+
+        ```console
+         kubectl delete -f ../../examples/https-termination/cafe-routes.yaml
+        ```
+
+    2. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated.
+    3. Send traffic through the example application using the updated resources and ensure traffic does not flow.
+    4. Apply the HTTPRoute resources.
+
+        ```console
+        kubectl apply -f ../../examples/https-termination/cafe-routes.yaml
+        ```
+
+    5. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated.
+    6. Send traffic through the example application using the updated resources and ensure traffic flows correctly.
+
+#### Restart Node with draining
+
+1. Switch over to a one-Node Kind cluster. Can run `make create-kind-cluster` from main directory.
+2. Run steps 4-11 of the [Setup](#setup) section above using
+[this guide](https://github.com/nginxinc/nginx-gateway-fabric/blob/main/docs/running-on-kind.md) for running on Kind.
+3. Ensure NGF and NGINX container logs are set up and traffic flows through the example application correctly.
+4. Drain the Node of its resources.
+
+    ```console
+    kubectl drain kind-control-plane --ignore-daemonsets --delete-local-data
+    ```
+
+5. Delete the Node.
+
+    ```console
+    kubectl delete node kind-control-plane
+    ```
+
+6. Restart the Docker container.
+
+    ```console
+    docker restart kind-control-plane
+    ```
+
+7. Open up both NGF and NGINX container logs and check for errors.
+8. Exec back into the NGINX container and check that `http.conf` and `config-version.conf` were not changed.
+9. Send traffic through the example application and ensure it is working correctly.
+10. Check that NGF can still process changes of resources.
+    1. Delete the HTTPRoute resources.
+
+        ```console
+         kubectl delete -f ../../examples/https-termination/cafe-routes.yaml
+        ```
+
+    2. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated.
+    3. Send traffic through the example application using the updated resources and ensure traffic does not flow.
+    4. Apply the HTTPRoute resources.
+
+        ```console
+        kubectl apply -f ../../examples/https-termination/cafe-routes.yaml
+        ```
+
+    5. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated.
+    6. Send traffic through the example application using the updated resources and ensure traffic flows correctly.
+
+#### Restart Node without draining
+
+1. Repeat the above test but remove steps 4-5 which include draining and deleting the Node.
diff --git a/tests/graceful-recovery/results/1.0.0/1.0.0.md b/tests/graceful-recovery/results/1.0.0/1.0.0.md
@@ -0,0 +1,142 @@
+# Results for v1.0.0
+
+<!-- TOC -->
+- [Results for v1.0.0](#results-for-v100)
+  - [Versions](#versions)
+  - [Tests](#tests)
+    - [Restart nginx-gateway container](#restart-nginx-gateway-container)
+    - [Restart NGINX container](#restart-nginx-container)
+    - [Restart Node with draining](#restart-node-with-draining)
+    - [Restart Node without draining](#restart-node-without-draining)
+  - [Future Improvements](#future-improvements)
+<!-- TOC -->
+
+
+## Versions
+
+NGF version:
+
+```text
+commit: 72b6c6ef8915c697626eeab88fdb6a3ce15b8da0
+date: 2023-10-02T13:13:08Z
+version: edge
+```
+
+with NGINX:
+
+```text
+nginx/1.25.2
+built by gcc 12.2.1 20220924 (Alpine 12.2.1_git20220924-r10)
+OS: Linux 5.15.49-linuxkit-pr
+```
+
+
+Kubernetes:
+
+```text
+Server Version: version.Info{Major:"1", Minor:"28",
+GitVersion:"v1.28.0",
+GitCommit:"855e7c48de7388eb330da0f8d9d2394ee818fb8d",
+GitTreeState:"clean", BuildDate:"2023-08-15T21:26:40Z",
+GoVersion:"go1.20.7", Compiler:"gc",
+Platform:"linux/arm64"}
+```
+
+## Tests
+
+### Restart nginx-gateway container
+Passes test with no errors.
+
+### Restart NGINX container
+The NGF Pod was unable to recover after sending a SIGKILL signal to the NGINX master process.
+The following appeared in the NGINX logs:
+
+```text
+2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-config-version.sock failed (98: Address in use)
+2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address in use)
+2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-500-server.sock failed (98: Address in use)
+2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
+2023/10/10 22:46:54 [notice] 141#141: try again to bind() after 500ms
+2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-config-version.sock failed (98: Address in use)
+2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address in use)
+2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-500-server.sock failed (98: Address in use)
+2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
+2023/10/10 22:46:54 [notice] 141#141: try again to bind() after 500ms
+2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-config-version.sock failed (98: Address in use)
+2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address in use)
+2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-500-server.sock failed (98: Address in use)
+2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
+2023/10/10 22:46:54 [notice] 141#141: try again to bind() after 500ms
+2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-config-version.sock failed (98: Address in use)
+2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address in use)
+2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-500-server.sock failed (98: Address in use)
+2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
+2023/10/10 22:46:54 [notice] 141#141: try again to bind() after 500ms
+2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-config-version.sock failed (98: Address in use)
+2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address in use)
+2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-500-server.sock failed (98: Address in use)
+2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
+2023/10/10 22:46:54 [notice] 141#141: try again to bind() after 500ms
+2023/10/10 22:46:54 [emerg] 141#141: still could not bind()
+```
+
+Issue Filed: https://github.com/nginxinc/nginx-gateway-fabric/issues/1108
+
+
+### Restart Node with draining
+Passes test with no errors.
+
+### Restart Node without draining
+The NGF Pod was unable to recover the majority of times after running `docker restart kind-control-plane`.
+
+The following appeared in the NGINX logs:
+
+```text
+2023/10/10 22:57:05 [emerg] 140#140: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
+2023/10/10 22:57:05 [notice] 140#140: try again to bind() after 500ms
+2023/10/10 22:57:05 [emerg] 140#140: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
+2023/10/10 22:57:05 [notice] 140#140: try again to bind() after 500ms
+2023/10/10 22:57:05 [emerg] 140#140: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
+2023/10/10 22:57:05 [notice] 140#140: try again to bind() after 500ms
+2023/10/10 22:57:05 [emerg] 140#140: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
+2023/10/10 22:57:05 [notice] 140#140: try again to bind() after 500ms
+2023/10/10 22:57:05 [emerg] 140#140: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
+2023/10/10 22:57:05 [notice] 140#140: try again to bind() after 500ms
+2023/10/10 22:57:05 [emerg] 140#140: still could not bind()
+```
+
+The following appeared in the NGF logs:
+
+```text
+{"level":"info","ts":"2023-10-10T22:57:05Z","msg":"Starting NGINX Gateway Fabric in static mode","version":"edge","commit":"b3fbf98d906f60ce66d70d7a2373c4b12b7d5606","date":"2023-10-10T22:02:06Z"}
+Error: failed to start control loop: cannot create and register metrics collectors: cannot create NGINX status metrics collector: failed to get http://config-status/stub_status: Get "http://config-status/stub_status": dial unix /var/run/nginx/nginx-status.sock: connect: connection refused
+Usage:
+  gateway static-mode [flags]
+
+Flags:
+  -c, --config string                      The name of the NginxGateway resource to be used for this controller's dynamic configuration. Lives in the same Namespace as the controller. (default "")
+      --gateway string                     The namespaced name of the Gateway resource to use. Must be of the form: NAMESPACE/NAME. If not specified, the control plane will process all Gateways for the configured GatewayClass. However, among them, it will choose the oldest resource by creation timestamp. If the timestamps are equal, it will choose the resource that appears first in alphabetical order by {namespace}/{name}.
+      --health-disable                     Disable running the health probe server.
+      --health-port int                    Set the port where the health probe server is exposed. Format: [1024 - 65535] (default 8081)
+  -h, --help                               help for static-mode
+      --leader-election-disable            Disable leader election. Leader election is used to avoid multiple replicas of the NGINX Gateway Fabric reporting the status of the Gateway API resources. If disabled, all replicas of NGINX Gateway Fabric will update the statuses of the Gateway API resources.
+      --leader-election-lock-name string   The name of the leader election lock. A Lease object with this name will be created in the same Namespace as the controller. (default "nginx-gateway-leader-election-lock")
+      --metrics-disable                    Disable exposing metrics in the Prometheus format.
+      --metrics-port int                   Set the port where the metrics are exposed. Format: [1024 - 65535] (default 9113)
+      --metrics-secure-serving             Enable serving metrics via https. By default metrics are served via http. Please note that this endpoint will be secured with a self-signed certificate.
+      --update-gatewayclass-status         Update the status of the GatewayClass resource. (default true)
+
+Global Flags:
+      --gateway-ctlr-name string   The name of the Gateway controller. The controller name must be of the form: DOMAIN/PATH. The controller's domain is 'gateway.nginx.org' (default "")
+      --gatewayclass string        The name of the GatewayClass resource. Every NGINX Gateway Fabric must have a unique corresponding GatewayClass resource. (default "")
+
+failed to start control loop: cannot create and register metrics collectors: cannot create NGINX status metrics collector: failed to get http://config-status/stub_status: Get "http://config-status/stub_status": dial unix /var/run/nginx/nginx-status.sock: connect: connection refused
+```
+
+Important to note that occasionally the test will pass and the NGF Pod would recover gracefully.
+
+Related to this issue: https://github.com/nginxinc/nginx-gateway-fabric/issues/1108
+
+## Future Improvements
+
+- None