Skip to content

Commit d501d59

Browse files
authored
Add Graceful Recovery Baseline Test (#1111)
Add Graceful Recovery Baseline Test Problem: We want to have a test that checks for how well NGF recovers from container failures. Solution: Added manual tests that checked for when the nginx-gateway container restarted, when the NGINX container restarted, when we drained a node then restarted it, and when we didn't drain a node then restarted it.
1 parent 71d605e commit d501d59

File tree

2 files changed

+324
-0
lines changed

2 files changed

+324
-0
lines changed
Lines changed: 182 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,182 @@
1+
# Graceful recovery from restarts
2+
3+
This document describes how we test graceful recovery from restarts on NGF.
4+
5+
<!-- TOC -->
6+
- [Graceful recovery from restarts](#graceful-recovery-from-restarts)
7+
- [Goal](#goal)
8+
- [Test Environment](#test-environment)
9+
- [Steps](#steps)
10+
- [Setup](#setup)
11+
- [Run the tests](#run-the-tests)
12+
- [Restart nginx-gateway container](#restart-nginx-gateway-container)
13+
- [Restart NGINX container](#restart-nginx-container)
14+
- [Restart Node with draining](#restart-node-with-draining)
15+
- [Restart Node without draining](#restart-node-without-draining)
16+
<!-- TOC -->
17+
18+
## Goal
19+
20+
Ensure that NGF can recover gracefully from container failures without any user intervention.
21+
22+
## Test Environment
23+
24+
- A Kubernetes cluster with 3 nodes on GKE
25+
- Node: e2-medium (2 vCPU, 4GB memory)
26+
- A Kind cluster
27+
28+
## Steps
29+
30+
### Setup
31+
32+
1. Setup GKE Cluster.
33+
2. Clone the repo and change into the nginx-gateway-fabric directory.
34+
3. Check out the latest tag (unless you are installing the edge version from the main branch).
35+
4. Go into `deploy/manifests/nginx-gateway.yaml` and change `runAsNonRoot` from `true` to `false`.
36+
This allows us to insert our ephemeral container as root which enables us to restart the nginx-gateway container.
37+
5. Follow the [installation instructions](https://github.com/nginxinc/nginx-gateway-fabric/blob/main/docs/installation.md)
38+
to deploy NGINX Gateway Fabric using manifests and expose it through a LoadBalancer Service.
39+
6. In a separate terminal track NGF logs.
40+
41+
```console
42+
kubectl -n nginx-gateway logs -f deploy/nginx-gateway
43+
```
44+
45+
7. In a separate terminal track NGINX container logs.
46+
47+
```console
48+
kubectl -n nginx-gateway logs -f <NGF_POD> -c nginx
49+
```
50+
51+
8. In a separate terminal Exec into the NGINX container inside the NGF pod.
52+
53+
```console
54+
kubectl exec -it -n nginx-gateway <NGF_POD> --container nginx -- sh
55+
```
56+
57+
9. In a different terminal, deploy the
58+
[https-termination example](https://github.com/nginxinc/nginx-gateway-fabric/tree/main/examples/https-termination).
59+
10. Inside the NGINX container, navigate to `/etc/nginx/conf.d` and check `http.conf` and `config-version.config` to see
60+
if the configuration and version were correctly updated.
61+
11. Send traffic through the example application and ensure it is working correctly.
62+
63+
### Run the tests
64+
65+
#### Restart nginx-gateway container
66+
67+
1. Ensure NGF and NGINX container logs are set up and traffic flows through the example application correctly.
68+
2. Insert ephemeral container in NGF Pod.
69+
70+
```console
71+
kubectl debug -it -n nginx-gateway <NGF_POD> --image=busybox:1.28 --target=nginx-gateway
72+
```
73+
74+
3. Kill nginx-gateway process through a SIGKILL signal (Process command should start with `/usr/bin/gateway`).
75+
76+
```console
77+
kill -9 <nginx-gateway_PID>
78+
```
79+
80+
4. Check for errors in the NGF and NGINX container logs.
81+
5. When the nginx-gateway container is back up, ensure traffic flows through the example application correctly.
82+
6. Open up the NGF and NGINX container logs and check for errors.
83+
7. Inside the NGINX container, check that `http.conf` was not changed and `config-version.conf` had its version set to `2`.
84+
8. Send traffic through the example application and ensure it is working correctly.
85+
9. Check that NGF can still process changes of resources.
86+
1. Delete the HTTPRoute resources.
87+
88+
```console
89+
kubectl delete -f ../../examples/https-termination/cafe-routes.yaml
90+
```
91+
92+
2. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated.
93+
3. Send traffic through the example application using the updated resources and ensure traffic does not flow.
94+
4. Apply the HTTPRoute resources.
95+
96+
```console
97+
kubectl apply -f ../../examples/https-termination/cafe-routes.yaml
98+
```
99+
100+
5. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated.
101+
6. Send traffic through the example application using the updated resources and ensure traffic flows correctly.
102+
103+
#### Restart NGINX container
104+
105+
1. Ensure NGF and NGINX container logs are set up and traffic flows through the example application correctly.
106+
2. If the terminal inside the NGINX container is no longer running, Exec back into the NGINX container.
107+
3. Inside the NGINX container, kill the nginx-master process through a SIGKILL signal
108+
(Process command should start with `nginx: master process`).
109+
110+
```console
111+
kill -9 <nginx-master_PID>
112+
```
113+
114+
4. When NGINX container is back up, ensure traffic flows through the example application correctly.
115+
5. Open up the NGINX container logs and check for errors.
116+
6. Exec back into the NGINX container and check that `http.conf` and `config-version.conf` were not changed.
117+
7. Check that NGF can still process changes of resources.
118+
1. Delete the HTTPRoute resources.
119+
120+
```console
121+
kubectl delete -f ../../examples/https-termination/cafe-routes.yaml
122+
```
123+
124+
2. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated.
125+
3. Send traffic through the example application using the updated resources and ensure traffic does not flow.
126+
4. Apply the HTTPRoute resources.
127+
128+
```console
129+
kubectl apply -f ../../examples/https-termination/cafe-routes.yaml
130+
```
131+
132+
5. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated.
133+
6. Send traffic through the example application using the updated resources and ensure traffic flows correctly.
134+
135+
#### Restart Node with draining
136+
137+
1. Switch over to a one-Node Kind cluster. Can run `make create-kind-cluster` from main directory.
138+
2. Run steps 4-11 of the [Setup](#setup) section above using
139+
[this guide](https://github.com/nginxinc/nginx-gateway-fabric/blob/main/docs/running-on-kind.md) for running on Kind.
140+
3. Ensure NGF and NGINX container logs are set up and traffic flows through the example application correctly.
141+
4. Drain the Node of its resources.
142+
143+
```console
144+
kubectl drain kind-control-plane --ignore-daemonsets --delete-local-data
145+
```
146+
147+
5. Delete the Node.
148+
149+
```console
150+
kubectl delete node kind-control-plane
151+
```
152+
153+
6. Restart the Docker container.
154+
155+
```console
156+
docker restart kind-control-plane
157+
```
158+
159+
7. Open up both NGF and NGINX container logs and check for errors.
160+
8. Exec back into the NGINX container and check that `http.conf` and `config-version.conf` were not changed.
161+
9. Send traffic through the example application and ensure it is working correctly.
162+
10. Check that NGF can still process changes of resources.
163+
1. Delete the HTTPRoute resources.
164+
165+
```console
166+
kubectl delete -f ../../examples/https-termination/cafe-routes.yaml
167+
```
168+
169+
2. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated.
170+
3. Send traffic through the example application using the updated resources and ensure traffic does not flow.
171+
4. Apply the HTTPRoute resources.
172+
173+
```console
174+
kubectl apply -f ../../examples/https-termination/cafe-routes.yaml
175+
```
176+
177+
5. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated.
178+
6. Send traffic through the example application using the updated resources and ensure traffic flows correctly.
179+
180+
#### Restart Node without draining
181+
182+
1. Repeat the above test but remove steps 4-5 which include draining and deleting the Node.
Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
# Results for v1.0.0
2+
3+
<!-- TOC -->
4+
- [Results for v1.0.0](#results-for-v100)
5+
- [Versions](#versions)
6+
- [Tests](#tests)
7+
- [Restart nginx-gateway container](#restart-nginx-gateway-container)
8+
- [Restart NGINX container](#restart-nginx-container)
9+
- [Restart Node with draining](#restart-node-with-draining)
10+
- [Restart Node without draining](#restart-node-without-draining)
11+
- [Future Improvements](#future-improvements)
12+
<!-- TOC -->
13+
14+
15+
## Versions
16+
17+
NGF version:
18+
19+
```text
20+
commit: 72b6c6ef8915c697626eeab88fdb6a3ce15b8da0
21+
date: 2023-10-02T13:13:08Z
22+
version: edge
23+
```
24+
25+
with NGINX:
26+
27+
```text
28+
nginx/1.25.2
29+
built by gcc 12.2.1 20220924 (Alpine 12.2.1_git20220924-r10)
30+
OS: Linux 5.15.49-linuxkit-pr
31+
```
32+
33+
34+
Kubernetes:
35+
36+
```text
37+
Server Version: version.Info{Major:"1", Minor:"28",
38+
GitVersion:"v1.28.0",
39+
GitCommit:"855e7c48de7388eb330da0f8d9d2394ee818fb8d",
40+
GitTreeState:"clean", BuildDate:"2023-08-15T21:26:40Z",
41+
GoVersion:"go1.20.7", Compiler:"gc",
42+
Platform:"linux/arm64"}
43+
```
44+
45+
## Tests
46+
47+
### Restart nginx-gateway container
48+
Passes test with no errors.
49+
50+
### Restart NGINX container
51+
The NGF Pod was unable to recover after sending a SIGKILL signal to the NGINX master process.
52+
The following appeared in the NGINX logs:
53+
54+
```text
55+
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-config-version.sock failed (98: Address in use)
56+
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address in use)
57+
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-500-server.sock failed (98: Address in use)
58+
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
59+
2023/10/10 22:46:54 [notice] 141#141: try again to bind() after 500ms
60+
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-config-version.sock failed (98: Address in use)
61+
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address in use)
62+
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-500-server.sock failed (98: Address in use)
63+
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
64+
2023/10/10 22:46:54 [notice] 141#141: try again to bind() after 500ms
65+
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-config-version.sock failed (98: Address in use)
66+
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address in use)
67+
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-500-server.sock failed (98: Address in use)
68+
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
69+
2023/10/10 22:46:54 [notice] 141#141: try again to bind() after 500ms
70+
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-config-version.sock failed (98: Address in use)
71+
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address in use)
72+
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-500-server.sock failed (98: Address in use)
73+
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
74+
2023/10/10 22:46:54 [notice] 141#141: try again to bind() after 500ms
75+
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-config-version.sock failed (98: Address in use)
76+
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address in use)
77+
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-500-server.sock failed (98: Address in use)
78+
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
79+
2023/10/10 22:46:54 [notice] 141#141: try again to bind() after 500ms
80+
2023/10/10 22:46:54 [emerg] 141#141: still could not bind()
81+
```
82+
83+
Issue Filed: https://github.com/nginxinc/nginx-gateway-fabric/issues/1108
84+
85+
86+
### Restart Node with draining
87+
Passes test with no errors.
88+
89+
### Restart Node without draining
90+
The NGF Pod was unable to recover the majority of times after running `docker restart kind-control-plane`.
91+
92+
The following appeared in the NGINX logs:
93+
94+
```text
95+
2023/10/10 22:57:05 [emerg] 140#140: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
96+
2023/10/10 22:57:05 [notice] 140#140: try again to bind() after 500ms
97+
2023/10/10 22:57:05 [emerg] 140#140: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
98+
2023/10/10 22:57:05 [notice] 140#140: try again to bind() after 500ms
99+
2023/10/10 22:57:05 [emerg] 140#140: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
100+
2023/10/10 22:57:05 [notice] 140#140: try again to bind() after 500ms
101+
2023/10/10 22:57:05 [emerg] 140#140: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
102+
2023/10/10 22:57:05 [notice] 140#140: try again to bind() after 500ms
103+
2023/10/10 22:57:05 [emerg] 140#140: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
104+
2023/10/10 22:57:05 [notice] 140#140: try again to bind() after 500ms
105+
2023/10/10 22:57:05 [emerg] 140#140: still could not bind()
106+
```
107+
108+
The following appeared in the NGF logs:
109+
110+
```text
111+
{"level":"info","ts":"2023-10-10T22:57:05Z","msg":"Starting NGINX Gateway Fabric in static mode","version":"edge","commit":"b3fbf98d906f60ce66d70d7a2373c4b12b7d5606","date":"2023-10-10T22:02:06Z"}
112+
Error: failed to start control loop: cannot create and register metrics collectors: cannot create NGINX status metrics collector: failed to get http://config-status/stub_status: Get "http://config-status/stub_status": dial unix /var/run/nginx/nginx-status.sock: connect: connection refused
113+
Usage:
114+
gateway static-mode [flags]
115+
116+
Flags:
117+
-c, --config string The name of the NginxGateway resource to be used for this controller's dynamic configuration. Lives in the same Namespace as the controller. (default "")
118+
--gateway string The namespaced name of the Gateway resource to use. Must be of the form: NAMESPACE/NAME. If not specified, the control plane will process all Gateways for the configured GatewayClass. However, among them, it will choose the oldest resource by creation timestamp. If the timestamps are equal, it will choose the resource that appears first in alphabetical order by {namespace}/{name}.
119+
--health-disable Disable running the health probe server.
120+
--health-port int Set the port where the health probe server is exposed. Format: [1024 - 65535] (default 8081)
121+
-h, --help help for static-mode
122+
--leader-election-disable Disable leader election. Leader election is used to avoid multiple replicas of the NGINX Gateway Fabric reporting the status of the Gateway API resources. If disabled, all replicas of NGINX Gateway Fabric will update the statuses of the Gateway API resources.
123+
--leader-election-lock-name string The name of the leader election lock. A Lease object with this name will be created in the same Namespace as the controller. (default "nginx-gateway-leader-election-lock")
124+
--metrics-disable Disable exposing metrics in the Prometheus format.
125+
--metrics-port int Set the port where the metrics are exposed. Format: [1024 - 65535] (default 9113)
126+
--metrics-secure-serving Enable serving metrics via https. By default metrics are served via http. Please note that this endpoint will be secured with a self-signed certificate.
127+
--update-gatewayclass-status Update the status of the GatewayClass resource. (default true)
128+
129+
Global Flags:
130+
--gateway-ctlr-name string The name of the Gateway controller. The controller name must be of the form: DOMAIN/PATH. The controller's domain is 'gateway.nginx.org' (default "")
131+
--gatewayclass string The name of the GatewayClass resource. Every NGINX Gateway Fabric must have a unique corresponding GatewayClass resource. (default "")
132+
133+
failed to start control loop: cannot create and register metrics collectors: cannot create NGINX status metrics collector: failed to get http://config-status/stub_status: Get "http://config-status/stub_status": dial unix /var/run/nginx/nginx-status.sock: connect: connection refused
134+
```
135+
136+
Important to note that occasionally the test will pass and the NGF Pod would recover gracefully.
137+
138+
Related to this issue: https://github.com/nginxinc/nginx-gateway-fabric/issues/1108
139+
140+
## Future Improvements
141+
142+
- None

0 commit comments

Comments
 (0)