Skip to content

Commit e659685

Browse files
authored
Results from re-run of longevity test (#1401)
* Results from re-run of longevity test
1 parent 93d9cd2 commit e659685

File tree

6 files changed

+68
-51
lines changed

6 files changed

+68
-51
lines changed

tests/longevity/results/1.1.0/1.1.0.md

Lines changed: 68 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,8 @@
2424
NGF version:
2525

2626
```text
27-
commit: "21a2507d3d25ac0428384dce2c042799ed28b856"
28-
date: "2023-12-06T23:47:17Z"
27+
commit: "7e075c636f1a40e6dc849c6bb9ba288c9801bd4a"
28+
date: "2023-12-13T17:45:06Z"
2929
version: "edge"
3030
```
3131

@@ -53,12 +53,12 @@ wrk -t2 -c100 -d96h http://cafe.example.com/coffee
5353
Running 5760m test @ http://cafe.example.com/coffee
5454
2 threads and 100 connections
5555
Thread Stats Avg Stdev Max +/- Stdev
56-
Latency 182.30ms 146.48ms 2.00s 82.86%
57-
Req/Sec 306.26 204.19 2.18k 65.75%
58-
207104807 requests in 5760.00m, 72.17GB read
59-
Socket errors: connect 0, read 362418, write 218736, timeout 19693
60-
Requests/sec: 599.26
61-
Transfer/sec: 218.97KB
56+
Latency 177.87ms 143.16ms 2.00s 82.78%
57+
Req/Sec 314.21 211.96 2.32k 65.78%
58+
212474192 requests in 5760.00m, 74.08GB read
59+
Socket errors: connect 0, read 346218, write 508394, timeout 10574
60+
Requests/sec: 614.80
61+
Transfer/sec: 224.78KB
6262
```
6363

6464
HTTPS:
@@ -69,12 +69,12 @@ wrk -t2 -c100 -d96h https://cafe.example.com/tea
6969
Running 5760m test @ https://cafe.example.com/tea
7070
2 threads and 100 connections
7171
Thread Stats Avg Stdev Max +/- Stdev
72-
Latency 172.15ms 118.59ms 2.00s 68.16%
73-
Req/Sec 305.26 203.43 2.33k 65.34%
74-
206387831 requests in 5760.00m, 70.81GB read
75-
Socket errors: connect 44059, read 356656, write 0, timeout 126
76-
Requests/sec: 597.19
77-
Transfer/sec: 214.84KB
72+
Latency 168.63ms 118.44ms 2.00s 69.72%
73+
Req/Sec 313.02 210.77 2.31k 65.50%
74+
211667319 requests in 5760.00m, 72.65GB read
75+
Socket errors: connect 80661, read 338673, write 0, timeout 147
76+
Requests/sec: 612.46
77+
Transfer/sec: 220.42KB
7878
```
7979

8080
While there are socket errors in the output, there are no connection-related errors in NGINX logs.
@@ -86,28 +86,31 @@ Further investigation is out of scope of this test.
8686

8787
```text
8888
resource.type="k8s_container"
89-
resource.labels.cluster_name="ciara-1"
89+
resource.labels.cluster_name="<CLUSTER_NAME>"
9090
resource.labels.namespace_name="nginx-gateway"
9191
resource.labels.container_name="nginx-gateway"
9292
severity=ERROR
9393
SEARCH("error")
9494
```
9595

96-
There were 36 error logs across 2 pod instances. They came in 2 almost identical batches, one on the first day of
97-
running the test, after approximately 6 hours, and the second 2.5 days later. They were both relating to leader election
98-
loss, and subsequent restart (see https://github.com/nginxinc/nginx-gateway-fabric/issues/1100).
99-
100-
Both error batches caused the pod to restart, but not terminate. However, the first pod was terminated about 10 minutes
101-
after the first error batch and subsequent restart occurred. Exactly why this pod was terminated is not currently clear,
102-
but it looks to be a cluster event (perhaps an upgrade) as the coffee and tea pods were terminated at that time also.
103-
Strangely, there were 6 pod restarts in total of the second pod, but no other errors were observed over the test period
104-
other than what was seen above, and grepping the logs for start-up logs only produced the 2 known restarts relating to
105-
the leader election loss, plus initial start-up of both pods (4 in total).
106-
107-
```console
108-
kubectl get pods -n nginx-gateway
109-
NAME READY STATUS RESTARTS AGE
110-
my-release-nginx-gateway-fabric-78d4b84447-4hss5 2/2 Running 6 (31h ago) 3d22h
96+
There were 53 error logs, and 6 restarts, across 3 pod instances. The error logs came in 3 almost identical batches,
97+
starting just over 24 hours after the initial deployment, and then each subsequent error batch just over 24 hours after
98+
the last. They were all relating to leader election loss, and subsequent restart (see https://github.com/nginxinc/nginx-gateway-fabric/issues/1100). There were also 2 termination events, both of these occurred approximately 5 minutes
99+
after a leader election loss and successful restart.
100+
101+
Each error batches caused the pod to restart, but not terminate. The first pod was terminated about 5 minutes
102+
after the first error batch and subsequent restart occurred. A similar occurance happened after the third error batch.
103+
There was no termination event after the second error batch. Exactly why these pods were terminated is not currently
104+
clear, but it looks to be a cluster event (perhaps an upgrade) as the coffee and tea pods were terminated at that time
105+
also. All the restarts happened roughly at the same time each day.
106+
107+
```text
108+
{"level":"info", "msg":"Starting manager", "ts":"2023-12-13T17:45:10Z"} -> Start-up
109+
{"level":"info", "msg":"Starting manager", "ts":"2023-12-14T17:43:11Z"} -> First error batch
110+
{"level":"info", "msg":"Starting manager", "ts":"2023-12-14T17:47:36Z"} -> First pod terminated, second pod start
111+
{"level":"info", "msg":"Starting manager", "ts":"2023-12-15T17:47:22Z"} -> Second error batch
112+
{"level":"info", "msg":"Starting manager", "ts":"2023-12-16T17:52:21Z"} -> Third error batch
113+
{"level":"info", "msg":"Starting manager", "ts":"2023-12-16T17:58:01Z"} -> Second pod terminated, third pod start
111114
```
112115

113116
### NGINX
@@ -118,7 +121,7 @@ Errors:
118121

119122
```text
120123
resource.type=k8s_container AND
121-
resource.labels.pod_name="my-release-nginx-gateway-fabric-78d4b84447-4hss5" AND
124+
resource.labels.cluster_name="<CLUSTER_NAME>" AND
122125
resource.labels.container_name="nginx" AND
123126
severity=ERROR AND
124127
SEARCH("`[warn]`") OR SEARCH("`[error]`")
@@ -132,7 +135,7 @@ Non-200 response codes in NGINX access logs:
132135

133136
```text
134137
resource.type=k8s_container AND
135-
resource.labels.pod_name="my-release-nginx-gateway-fabric-78d4b84447-4hss5" AND
138+
resource.labels.cluster_name="<CLUSTER_NAME>" AND
136139
resource.labels.container_name="nginx"
137140
"GET" "HTTP/1.1" -"200"
138141
```
@@ -145,47 +148,61 @@ No such responses.
145148

146149
![memory.png](memory.png)
147150

148-
Memory usage dropped twice which appears to correspond with the restarts seen above relating to leader election.
149-
Interestingly, before the first restart and after the second restart, memory usage sat at about 8.5MiB, but for the
150-
majority of the test run, memory usage was about 9.5-10MiB. The previous release test run also had memory usage at
151-
about 9-10MiB, but more stable usage across the duration of the test. However, there were no restarts observed in the
152-
v1.0.0 test run. I don't think there is anything to investigate here.
151+
Memory usage dropped once which appears to correspond with a restart event seen above relating to leader election.
152+
There did seem to be a creeping up of memory before this restart, and then memory usage remained consistent for the rest
153+
of the test. Memory usage was slightly higher than what was observed in the v1.0.0 longevity test, but unless we see
154+
this trending upwards on future longevity tests, I don't think there is anything to investigate here.
153155

154156
### Containers CPU
155157

156158
![cpu.png](cpu.png)
157159

158-
No unexpected spikes or drops.
160+
One drop in NGINX CPU that corresponds with a Pod termination event.
159161

160162
### NGINX metrics
161163

162-
In this test, NGINX metrics were not correctly exported so no dashboards are available for these.
164+
![stub-status.png](stub-status.png)
165+
166+
The drop of _requests_ on Dec 16 corresponds to a Pod termination event. There are no correlated errors in NGINX logs.
163167

164168
### Reloads
165169

166-
In this test, NGINX metrics were not correctly exported so no dashboards are available for these.
170+
Rate of reloads - successful and errors:
171+
172+
![reloads.png](reloads.png)
173+
174+
Reload spikes correspond to 1 hour periods of backend re-rollouts.
175+
However, small spikes, correspond to periodic reconciliation of Secrets, which (incorrectly)
176+
triggers a reload -- https://github.com/nginxinc/nginx-gateway-fabric/issues/1112
177+
178+
No reloads finished with an error.
179+
180+
Reload time distribution with 50th, 95th and 99th percentiles and the threshold:
181+
182+
![reload-time.png](reload-time.png)
167183

184+
Note - 60s is the threshold for waiting for NGINX to be reloaded.
168185
Reload related metrics at the end:
169186

170187
```text
188+
# HELP nginx_gateway_fabric_nginx_reloads_milliseconds Duration in milliseconds of NGINX reloads
171189
# TYPE nginx_gateway_fabric_nginx_reloads_milliseconds histogram
172-
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="500"} 1647
173-
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="1000"} 4043
174-
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="5000"} 4409
175-
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="10000"} 4409
176-
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="30000"} 4409
177-
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="+Inf"} 4409
178-
nginx_gateway_fabric_nginx_reloads_milliseconds_sum{class="nginx"} 2.701667e+06
179-
nginx_gateway_fabric_nginx_reloads_milliseconds_count{class="nginx"} 4409
190+
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="500"} 4626
191+
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="1000"} 6831
192+
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="5000"} 7317
193+
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="10000"} 7317
194+
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="30000"} 7317
195+
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="+Inf"} 7317
196+
nginx_gateway_fabric_nginx_reloads_milliseconds_sum{class="nginx"} 3.060948e+06
197+
nginx_gateway_fabric_nginx_reloads_milliseconds_count{class="nginx"} 7317
180198
# HELP nginx_gateway_fabric_nginx_reloads_total Number of successful NGINX reloads
181199
# TYPE nginx_gateway_fabric_nginx_reloads_total counter
182-
nginx_gateway_fabric_nginx_reloads_total{class="nginx"} 4409
200+
nginx_gateway_fabric_nginx_reloads_total{class="nginx"} 7317
183201
```
184202

185203
All successful reloads took less than 5 seconds, with most (>90%) under 1 second.
186204

187205
## Existing Issues still relevant
188206

189-
- NGF unnecessary reloads NGINX when it reconciles
190-
Secrets - https://github.com/nginxinc/nginx-gateway-fabric/issues/1112
207+
- NGF unnecessary reloads NGINX when it reconciles Secrets - https://github.com/nginxinc/nginx-gateway-fabric/issues/1112
191208
- Use NGF Logger in Client-Go Library - https://github.com/nginxinc/nginx-gateway-fabric/issues/1101

tests/longevity/results/1.1.0/cpu.png

-101 KB
Loading
-20.8 KB
Loading
242 KB
Loading
307 KB
Loading
400 KB
Loading

0 commit comments

Comments
 (0)