Skip to content

Commit 15717d3

Browse files
authored
v1.1.0 Longevity test (#1359)
1 parent c0d637f commit 15717d3

File tree

4 files changed

+194
-1
lines changed

4 files changed

+194
-1
lines changed

tests/longevity/longevity.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,8 @@ Test duration - 4 days.
6565
4. Create two CronJobs to re-rollout backends:
6666
1. Coffee - every minute for an hour every 6 hours
6767
2. Tea - every minute for an hour every 6 hours, 3 hours apart from coffee.
68-
5. Configure Prometheus on GKE to pick up NGF metrics.
68+
5. Configure Prometheus on GKE to pick up NGF metrics (NB: Ensure that the `app.kubernetes.io/name` label matches
69+
your NGF deployment).
6970

7071
```shell
7172
kubectl apply -f files
@@ -147,3 +148,4 @@ In case of errors, double check if you prepared the environment and launched the
147148
## Results
148149

149150
- [1.0.0](results/1.0.0/1.0.0.md)
151+
- [1.1.0](results/1.1.0/1.1.0.md)
Lines changed: 191 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,191 @@
1+
# Results for v1.1.0
2+
3+
<!-- TOC -->
4+
5+
- [Results for v1.1.0](#results-for-v110)
6+
- [Versions](#versions)
7+
- [Traffic](#traffic)
8+
- [NGF](#ngf)
9+
- [Error Log](#error-log)
10+
- [NGINX](#nginx)
11+
- [Error Log](#error-log-1)
12+
- [Access Log](#access-log)
13+
- [Key Metrics](#key-metrics)
14+
- [Containers memory](#containers-memory)
15+
- [Containers CPU](#containers-cpu)
16+
- [NGINX metrics](#nginx-metrics)
17+
- [Reloads](#reloads)
18+
- [Existing Issues still relevant](#existing-issues-still-relevant)
19+
20+
<!-- TOC -->
21+
22+
## Versions
23+
24+
NGF version:
25+
26+
```text
27+
commit: "21a2507d3d25ac0428384dce2c042799ed28b856"
28+
date: "2023-12-06T23:47:17Z"
29+
version: "edge"
30+
```
31+
32+
with NGINX:
33+
34+
```text
35+
nginx/1.25.3
36+
built by gcc 12.2.1 20220924 (Alpine 12.2.1_git20220924-r10)
37+
OS: Linux 5.15.109+
38+
```
39+
40+
Kubernetes:
41+
42+
```text
43+
Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.5-gke.200", GitCommit:"f9aad8e51abb509136cb82b4a00cc3d77d3d70d9", GitTreeState:"clean", BuildDate:"2023-08-26T23:26:22Z", GoVersion:"go1.20.7 X:boringcrypto", Compiler:"gc", Platform:"linux/amd64"}
44+
```
45+
46+
## Traffic
47+
48+
HTTP:
49+
50+
```text
51+
wrk -t2 -c100 -d96h http://cafe.example.com/coffee
52+
53+
Running 5760m test @ http://cafe.example.com/coffee
54+
2 threads and 100 connections
55+
Thread Stats Avg Stdev Max +/- Stdev
56+
Latency 182.30ms 146.48ms 2.00s 82.86%
57+
Req/Sec 306.26 204.19 2.18k 65.75%
58+
207104807 requests in 5760.00m, 72.17GB read
59+
Socket errors: connect 0, read 362418, write 218736, timeout 19693
60+
Requests/sec: 599.26
61+
Transfer/sec: 218.97KB
62+
```
63+
64+
HTTPS:
65+
66+
```text
67+
wrk -t2 -c100 -d96h https://cafe.example.com/tea
68+
69+
Running 5760m test @ https://cafe.example.com/tea
70+
2 threads and 100 connections
71+
Thread Stats Avg Stdev Max +/- Stdev
72+
Latency 172.15ms 118.59ms 2.00s 68.16%
73+
Req/Sec 305.26 203.43 2.33k 65.34%
74+
206387831 requests in 5760.00m, 70.81GB read
75+
Socket errors: connect 44059, read 356656, write 0, timeout 126
76+
Requests/sec: 597.19
77+
Transfer/sec: 214.84KB
78+
```
79+
80+
While there are socket errors in the output, there are no connection-related errors in NGINX logs.
81+
Further investigation is out of scope of this test.
82+
83+
### NGF
84+
85+
#### Error Log
86+
87+
```text
88+
resource.type="k8s_container"
89+
resource.labels.cluster_name="ciara-1"
90+
resource.labels.namespace_name="nginx-gateway"
91+
resource.labels.container_name="nginx-gateway"
92+
severity=ERROR
93+
SEARCH("error")
94+
```
95+
96+
There were 36 error logs across 2 pod instances. They came in 2 almost identical batches, one on the first day of
97+
running the test, after approximately 6 hours, and the second 2.5 days later. They were both relating to leader election
98+
loss, and subsequent restart (see https://github.com/nginxinc/nginx-gateway-fabric/issues/1100).
99+
100+
Both error batches caused the pod to restart, but not terminate. However, the first pod was terminated about 10 minutes
101+
after the first error batch and subsequent restart occurred. Exactly why this pod was terminated is not currently clear,
102+
but it looks to be a cluster event (perhaps an upgrade) as the coffee and tea pods were terminated at that time also.
103+
Strangely, there were 6 pod restarts in total of the second pod, but no other errors were observed over the test period
104+
other than what was seen above, and grepping the logs for start-up logs only produced the 2 known restarts relating to
105+
the leader election loss, plus initial start-up of both pods (4 in total).
106+
107+
```console
108+
kubectl get pods -n nginx-gateway
109+
NAME READY STATUS RESTARTS AGE
110+
my-release-nginx-gateway-fabric-78d4b84447-4hss5 2/2 Running 6 (31h ago) 3d22h
111+
```
112+
113+
### NGINX
114+
115+
#### Error Log
116+
117+
Errors:
118+
119+
```text
120+
resource.type=k8s_container AND
121+
resource.labels.pod_name="my-release-nginx-gateway-fabric-78d4b84447-4hss5" AND
122+
resource.labels.container_name="nginx" AND
123+
severity=ERROR AND
124+
SEARCH("`[warn]`") OR SEARCH("`[error]`")
125+
```
126+
127+
No entries found.
128+
129+
#### Access Log
130+
131+
Non-200 response codes in NGINX access logs:
132+
133+
```text
134+
resource.type=k8s_container AND
135+
resource.labels.pod_name="my-release-nginx-gateway-fabric-78d4b84447-4hss5" AND
136+
resource.labels.container_name="nginx"
137+
"GET" "HTTP/1.1" -"200"
138+
```
139+
140+
No such responses.
141+
142+
## Key Metrics
143+
144+
### Containers memory
145+
146+
![memory.png](memory.png)
147+
148+
Memory usage dropped twice which appears to correspond with the restarts seen above relating to leader election.
149+
Interestingly, before the first restart and after the second restart, memory usage sat at about 8.5MiB, but for the
150+
majority of the test run, memory usage was about 9.5-10MiB. The previous release test run also had memory usage at
151+
about 9-10MiB, but more stable usage across the duration of the test. However, there were no restarts observed in the
152+
v1.0.0 test run. I don't think there is anything to investigate here.
153+
154+
### Containers CPU
155+
156+
![cpu.png](cpu.png)
157+
158+
No unexpected spikes or drops.
159+
160+
### NGINX metrics
161+
162+
In this test, NGINX metrics were not correctly exported so no dashboards are available for these.
163+
164+
### Reloads
165+
166+
In this test, NGINX metrics were not correctly exported so no dashboards are available for these.
167+
168+
Reload related metrics at the end:
169+
170+
```text
171+
# TYPE nginx_gateway_fabric_nginx_reloads_milliseconds histogram
172+
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="500"} 1647
173+
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="1000"} 4043
174+
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="5000"} 4409
175+
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="10000"} 4409
176+
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="30000"} 4409
177+
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="+Inf"} 4409
178+
nginx_gateway_fabric_nginx_reloads_milliseconds_sum{class="nginx"} 2.701667e+06
179+
nginx_gateway_fabric_nginx_reloads_milliseconds_count{class="nginx"} 4409
180+
# HELP nginx_gateway_fabric_nginx_reloads_total Number of successful NGINX reloads
181+
# TYPE nginx_gateway_fabric_nginx_reloads_total counter
182+
nginx_gateway_fabric_nginx_reloads_total{class="nginx"} 4409
183+
```
184+
185+
All successful reloads took less than 5 seconds, with most (>90%) under 1 second.
186+
187+
## Existing Issues still relevant
188+
189+
- NGF unnecessary reloads NGINX when it reconciles
190+
Secrets - https://github.com/nginxinc/nginx-gateway-fabric/issues/1112
191+
- Use NGF Logger in Client-Go Library - https://github.com/nginxinc/nginx-gateway-fabric/issues/1101

tests/longevity/results/1.1.0/cpu.png

481 KB
Loading
436 KB
Loading

0 commit comments

Comments
 (0)