Skip to content

Commit 55f547d

Browse files
committed
Results from re-run of longevity test
1 parent 93d9cd2 commit 55f547d

File tree

6 files changed

+66
-50
lines changed

6 files changed

+66
-50
lines changed

tests/longevity/results/1.1.0/1.1.0.md

Lines changed: 66 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,8 @@
2424
NGF version:
2525

2626
```text
27-
commit: "21a2507d3d25ac0428384dce2c042799ed28b856"
28-
date: "2023-12-06T23:47:17Z"
27+
commit: "7e075c636f1a40e6dc849c6bb9ba288c9801bd4a"
28+
date: "2023-12-13T17:45:06Z"
2929
version: "edge"
3030
```
3131

@@ -53,12 +53,12 @@ wrk -t2 -c100 -d96h http://cafe.example.com/coffee
5353
Running 5760m test @ http://cafe.example.com/coffee
5454
2 threads and 100 connections
5555
Thread Stats Avg Stdev Max +/- Stdev
56-
Latency 182.30ms 146.48ms 2.00s 82.86%
57-
Req/Sec 306.26 204.19 2.18k 65.75%
58-
207104807 requests in 5760.00m, 72.17GB read
59-
Socket errors: connect 0, read 362418, write 218736, timeout 19693
60-
Requests/sec: 599.26
61-
Transfer/sec: 218.97KB
56+
Latency 177.87ms 143.16ms 2.00s 82.78%
57+
Req/Sec 314.21 211.96 2.32k 65.78%
58+
212474192 requests in 5760.00m, 74.08GB read
59+
Socket errors: connect 0, read 346218, write 508394, timeout 10574
60+
Requests/sec: 614.80
61+
Transfer/sec: 224.78KB
6262
```
6363

6464
HTTPS:
@@ -69,12 +69,12 @@ wrk -t2 -c100 -d96h https://cafe.example.com/tea
6969
Running 5760m test @ https://cafe.example.com/tea
7070
2 threads and 100 connections
7171
Thread Stats Avg Stdev Max +/- Stdev
72-
Latency 172.15ms 118.59ms 2.00s 68.16%
73-
Req/Sec 305.26 203.43 2.33k 65.34%
74-
206387831 requests in 5760.00m, 70.81GB read
75-
Socket errors: connect 44059, read 356656, write 0, timeout 126
76-
Requests/sec: 597.19
77-
Transfer/sec: 214.84KB
72+
Latency 168.63ms 118.44ms 2.00s 69.72%
73+
Req/Sec 313.02 210.77 2.31k 65.50%
74+
211667319 requests in 5760.00m, 72.65GB read
75+
Socket errors: connect 80661, read 338673, write 0, timeout 147
76+
Requests/sec: 612.46
77+
Transfer/sec: 220.42KB
7878
```
7979

8080
While there are socket errors in the output, there are no connection-related errors in NGINX logs.
@@ -93,21 +93,23 @@ severity=ERROR
9393
SEARCH("error")
9494
```
9595

96-
There were 36 error logs across 2 pod instances. They came in 2 almost identical batches, one on the first day of
97-
running the test, after approximately 6 hours, and the second 2.5 days later. They were both relating to leader election
98-
loss, and subsequent restart (see https://github.com/nginxinc/nginx-gateway-fabric/issues/1100).
99-
100-
Both error batches caused the pod to restart, but not terminate. However, the first pod was terminated about 10 minutes
101-
after the first error batch and subsequent restart occurred. Exactly why this pod was terminated is not currently clear,
102-
but it looks to be a cluster event (perhaps an upgrade) as the coffee and tea pods were terminated at that time also.
103-
Strangely, there were 6 pod restarts in total of the second pod, but no other errors were observed over the test period
104-
other than what was seen above, and grepping the logs for start-up logs only produced the 2 known restarts relating to
105-
the leader election loss, plus initial start-up of both pods (4 in total).
106-
107-
```console
108-
kubectl get pods -n nginx-gateway
109-
NAME READY STATUS RESTARTS AGE
110-
my-release-nginx-gateway-fabric-78d4b84447-4hss5 2/2 Running 6 (31h ago) 3d22h
96+
There were 53 error logs across 3 pod instances. They came in 3 almost identical batches, starting just over 24 hours
97+
after the initial deployment, and then each subsequent error batch just over 24 hours after the last. They were all
98+
relating to leader election loss, and subsequent restart (see https://github.com/nginxinc/nginx-gateway-fabric/issues/1100).
99+
100+
Each error batches caused the pod to restart, but not terminate. However, the first pod was terminated about 10 minutes
101+
after the first error batch and subsequent restart occurred. A similar occurance happened after the third error batch.
102+
Exactly why these pods were terminated is not currently clear, but it looks to be a cluster event (perhaps an upgrade)
103+
as the coffee and tea pods were terminated at that time also. All the restarts happened roughly at the same time each
104+
day.
105+
106+
```text
107+
{"level":"info", "msg":"Starting manager", "ts":"2023-12-13T17:45:10Z"} -> Start-up
108+
{"level":"info", "msg":"Starting manager", "ts":"2023-12-14T17:43:11Z"} -> First error batch
109+
{"level":"info", "msg":"Starting manager", "ts":"2023-12-14T17:47:36Z"} -> First pod terminated, second pod start
110+
{"level":"info", "msg":"Starting manager", "ts":"2023-12-15T17:47:22Z"} -> Second error batch
111+
{"level":"info", "msg":"Starting manager", "ts":"2023-12-16T17:52:21Z"} -> Third error batch
112+
{"level":"info", "msg":"Starting manager", "ts":"2023-12-16T17:58:01Z"} -> Second pod terminated, third pod start
111113
```
112114

113115
### NGINX
@@ -118,7 +120,7 @@ Errors:
118120

119121
```text
120122
resource.type=k8s_container AND
121-
resource.labels.pod_name="my-release-nginx-gateway-fabric-78d4b84447-4hss5" AND
123+
resource.labels.cluster_name="ciara-1" AND
122124
resource.labels.container_name="nginx" AND
123125
severity=ERROR AND
124126
SEARCH("`[warn]`") OR SEARCH("`[error]`")
@@ -132,7 +134,7 @@ Non-200 response codes in NGINX access logs:
132134

133135
```text
134136
resource.type=k8s_container AND
135-
resource.labels.pod_name="my-release-nginx-gateway-fabric-78d4b84447-4hss5" AND
137+
resource.labels.cluster_name="ciara-1" AND
136138
resource.labels.container_name="nginx"
137139
"GET" "HTTP/1.1" -"200"
138140
```
@@ -145,47 +147,61 @@ No such responses.
145147

146148
![memory.png](memory.png)
147149

148-
Memory usage dropped twice which appears to correspond with the restarts seen above relating to leader election.
149-
Interestingly, before the first restart and after the second restart, memory usage sat at about 8.5MiB, but for the
150-
majority of the test run, memory usage was about 9.5-10MiB. The previous release test run also had memory usage at
151-
about 9-10MiB, but more stable usage across the duration of the test. However, there were no restarts observed in the
152-
v1.0.0 test run. I don't think there is anything to investigate here.
150+
Memory usage dropped once which appears to correspond with a restart event seen above relating to leader election.
151+
There did seem to be a creeping up of memory before this restart, and then memory usage remained consistent for the rest
152+
of the test. Memory usage was slightly higher than what was observed in the v1.0.0 longevity test, but unless we see
153+
this trending upwards on future longevity tests, I don't think there is anything to investigate here.
153154

154155
### Containers CPU
155156

156157
![cpu.png](cpu.png)
157158

158-
No unexpected spikes or drops.
159+
One drop in NGINX CPU that corresponds with a Pod termination event.
159160

160161
### NGINX metrics
161162

162-
In this test, NGINX metrics were not correctly exported so no dashboards are available for these.
163+
![stub-status.png](stub-status.png)
164+
165+
The drop of _requests_ on Dec 16 corresponds to a Pod termination event. There are no correlated errors in NGINX logs.
163166

164167
### Reloads
165168

166-
In this test, NGINX metrics were not correctly exported so no dashboards are available for these.
169+
Rate of reloads - successful and errors:
170+
171+
![reloads.png](reloads.png)
172+
173+
Reload spikes correspond to 1 hour periods of backend re-rollouts.
174+
However, small spikes, correspond to periodic reconciliation of Secrets, which (incorrectly)
175+
triggers a reload -- https://github.com/nginxinc/nginx-gateway-fabric/issues/1112
176+
177+
No reloads finished with an error.
178+
179+
Reload time distribution with 50th, 95th and 99th percentiles and the threshold:
180+
181+
![reload-time.png](reload-time.png)
167182

183+
Note - 60s is the threshold for waiting for NGINX to be reloaded.
168184
Reload related metrics at the end:
169185

170186
```text
187+
# HELP nginx_gateway_fabric_nginx_reloads_milliseconds Duration in milliseconds of NGINX reloads
171188
# TYPE nginx_gateway_fabric_nginx_reloads_milliseconds histogram
172-
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="500"} 1647
173-
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="1000"} 4043
174-
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="5000"} 4409
175-
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="10000"} 4409
176-
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="30000"} 4409
177-
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="+Inf"} 4409
178-
nginx_gateway_fabric_nginx_reloads_milliseconds_sum{class="nginx"} 2.701667e+06
179-
nginx_gateway_fabric_nginx_reloads_milliseconds_count{class="nginx"} 4409
189+
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="500"} 4626
190+
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="1000"} 6831
191+
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="5000"} 7317
192+
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="10000"} 7317
193+
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="30000"} 7317
194+
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="+Inf"} 7317
195+
nginx_gateway_fabric_nginx_reloads_milliseconds_sum{class="nginx"} 3.060948e+06
196+
nginx_gateway_fabric_nginx_reloads_milliseconds_count{class="nginx"} 7317
180197
# HELP nginx_gateway_fabric_nginx_reloads_total Number of successful NGINX reloads
181198
# TYPE nginx_gateway_fabric_nginx_reloads_total counter
182-
nginx_gateway_fabric_nginx_reloads_total{class="nginx"} 4409
199+
nginx_gateway_fabric_nginx_reloads_total{class="nginx"} 7317
183200
```
184201

185202
All successful reloads took less than 5 seconds, with most (>90%) under 1 second.
186203

187204
## Existing Issues still relevant
188205

189-
- NGF unnecessary reloads NGINX when it reconciles
190-
Secrets - https://github.com/nginxinc/nginx-gateway-fabric/issues/1112
206+
- NGF unnecessary reloads NGINX when it reconciles Secrets - https://github.com/nginxinc/nginx-gateway-fabric/issues/1112
191207
- Use NGF Logger in Client-Go Library - https://github.com/nginxinc/nginx-gateway-fabric/issues/1101

tests/longevity/results/1.1.0/cpu.png

-101 KB
Loading
-20.8 KB
Loading
242 KB
Loading
307 KB
Loading
400 KB
Loading

0 commit comments

Comments
 (0)