24
24
NGF version:
25
25
26
26
``` text
27
- commit: "21a2507d3d25ac0428384dce2c042799ed28b856 "
28
- date: "2023-12-06T23:47:17Z "
27
+ commit: "7e075c636f1a40e6dc849c6bb9ba288c9801bd4a "
28
+ date: "2023-12-13T17:45:06Z "
29
29
version: "edge"
30
30
```
31
31
@@ -53,12 +53,12 @@ wrk -t2 -c100 -d96h http://cafe.example.com/coffee
53
53
Running 5760m test @ http://cafe.example.com/coffee
54
54
2 threads and 100 connections
55
55
Thread Stats Avg Stdev Max +/- Stdev
56
- Latency 182.30ms 146.48ms 2.00s 82.86 %
57
- Req/Sec 306.26 204.19 2.18k 65.75 %
58
- 207104807 requests in 5760.00m, 72.17GB read
59
- Socket errors: connect 0, read 362418 , write 218736 , timeout 19693
60
- Requests/sec: 599.26
61
- Transfer/sec: 218.97KB
56
+ Latency 177.87ms 143.16ms 2.00s 82.78 %
57
+ Req/Sec 314.21 211.96 2.32k 65.78 %
58
+ 212474192 requests in 5760.00m, 74.08GB read
59
+ Socket errors: connect 0, read 346218 , write 508394 , timeout 10574
60
+ Requests/sec: 614.80
61
+ Transfer/sec: 224.78KB
62
62
```
63
63
64
64
HTTPS:
@@ -69,12 +69,12 @@ wrk -t2 -c100 -d96h https://cafe.example.com/tea
69
69
Running 5760m test @ https://cafe.example.com/tea
70
70
2 threads and 100 connections
71
71
Thread Stats Avg Stdev Max +/- Stdev
72
- Latency 172.15ms 118.59ms 2.00s 68.16 %
73
- Req/Sec 305.26 203.43 2.33k 65.34 %
74
- 206387831 requests in 5760.00m, 70.81GB read
75
- Socket errors: connect 44059 , read 356656 , write 0, timeout 126
76
- Requests/sec: 597.19
77
- Transfer/sec: 214.84KB
72
+ Latency 168.63ms 118.44ms 2.00s 69.72 %
73
+ Req/Sec 313.02 210.77 2.31k 65.50 %
74
+ 211667319 requests in 5760.00m, 72.65GB read
75
+ Socket errors: connect 80661 , read 338673 , write 0, timeout 147
76
+ Requests/sec: 612.46
77
+ Transfer/sec: 220.42KB
78
78
```
79
79
80
80
While there are socket errors in the output, there are no connection-related errors in NGINX logs.
@@ -93,21 +93,23 @@ severity=ERROR
93
93
SEARCH("error")
94
94
```
95
95
96
- There were 36 error logs across 2 pod instances. They came in 2 almost identical batches, one on the first day of
97
- running the test, after approximately 6 hours, and the second 2.5 days later. They were both relating to leader election
98
- loss, and subsequent restart (see https://github.com/nginxinc/nginx-gateway-fabric/issues/1100 ).
99
-
100
- Both error batches caused the pod to restart, but not terminate. However, the first pod was terminated about 10 minutes
101
- after the first error batch and subsequent restart occurred. Exactly why this pod was terminated is not currently clear,
102
- but it looks to be a cluster event (perhaps an upgrade) as the coffee and tea pods were terminated at that time also.
103
- Strangely, there were 6 pod restarts in total of the second pod, but no other errors were observed over the test period
104
- other than what was seen above, and grepping the logs for start-up logs only produced the 2 known restarts relating to
105
- the leader election loss, plus initial start-up of both pods (4 in total).
106
-
107
- ``` console
108
- kubectl get pods -n nginx-gateway
109
- NAME READY STATUS RESTARTS AGE
110
- my-release-nginx-gateway-fabric-78d4b84447-4hss5 2/2 Running 6 (31h ago) 3d22h
96
+ There were 53 error logs across 3 pod instances. They came in 3 almost identical batches, starting just over 24 hours
97
+ after the initial deployment, and then each subsequent error batch just over 24 hours after the last. They were all
98
+ relating to leader election loss, and subsequent restart (see https://github.com/nginxinc/nginx-gateway-fabric/issues/1100 ).
99
+
100
+ Each error batches caused the pod to restart, but not terminate. However, the first pod was terminated about 10 minutes
101
+ after the first error batch and subsequent restart occurred. A similar occurance happened after the third error batch.
102
+ Exactly why these pods were terminated is not currently clear, but it looks to be a cluster event (perhaps an upgrade)
103
+ as the coffee and tea pods were terminated at that time also. All the restarts happened roughly at the same time each
104
+ day.
105
+
106
+ ``` text
107
+ {"level":"info", "msg":"Starting manager", "ts":"2023-12-13T17:45:10Z"} -> Start-up
108
+ {"level":"info", "msg":"Starting manager", "ts":"2023-12-14T17:43:11Z"} -> First error batch
109
+ {"level":"info", "msg":"Starting manager", "ts":"2023-12-14T17:47:36Z"} -> First pod terminated, second pod start
110
+ {"level":"info", "msg":"Starting manager", "ts":"2023-12-15T17:47:22Z"} -> Second error batch
111
+ {"level":"info", "msg":"Starting manager", "ts":"2023-12-16T17:52:21Z"} -> Third error batch
112
+ {"level":"info", "msg":"Starting manager", "ts":"2023-12-16T17:58:01Z"} -> Second pod terminated, third pod start
111
113
```
112
114
113
115
### NGINX
@@ -118,7 +120,7 @@ Errors:
118
120
119
121
``` text
120
122
resource.type=k8s_container AND
121
- resource.labels.pod_name="my-release-nginx-gateway-fabric-78d4b84447-4hss5 " AND
123
+ resource.labels.cluster_name="ciara-1 " AND
122
124
resource.labels.container_name="nginx" AND
123
125
severity=ERROR AND
124
126
SEARCH("`[warn]`") OR SEARCH("`[error]`")
@@ -132,7 +134,7 @@ Non-200 response codes in NGINX access logs:
132
134
133
135
``` text
134
136
resource.type=k8s_container AND
135
- resource.labels.pod_name="my-release-nginx-gateway-fabric-78d4b84447-4hss5 " AND
137
+ resource.labels.cluster_name="ciara-1 " AND
136
138
resource.labels.container_name="nginx"
137
139
"GET" "HTTP/1.1" -"200"
138
140
```
@@ -145,47 +147,61 @@ No such responses.
145
147
146
148
![ memory.png] ( memory.png )
147
149
148
- Memory usage dropped twice which appears to correspond with the restarts seen above relating to leader election.
149
- Interestingly, before the first restart and after the second restart, memory usage sat at about 8.5MiB, but for the
150
- majority of the test run, memory usage was about 9.5-10MiB. The previous release test run also had memory usage at
151
- about 9-10MiB, but more stable usage across the duration of the test. However, there were no restarts observed in the
152
- v1.0.0 test run. I don't think there is anything to investigate here.
150
+ Memory usage dropped once which appears to correspond with a restart event seen above relating to leader election.
151
+ There did seem to be a creeping up of memory before this restart, and then memory usage remained consistent for the rest
152
+ of the test. Memory usage was slightly higher than what was observed in the v1.0.0 longevity test, but unless we see
153
+ this trending upwards on future longevity tests, I don't think there is anything to investigate here.
153
154
154
155
### Containers CPU
155
156
156
157
![ cpu.png] ( cpu.png )
157
158
158
- No unexpected spikes or drops .
159
+ One drop in NGINX CPU that corresponds with a Pod termination event .
159
160
160
161
### NGINX metrics
161
162
162
- In this test, NGINX metrics were not correctly exported so no dashboards are available for these.
163
+ ![ stub-status.png] ( stub-status.png )
164
+
165
+ The drop of _ requests_ on Dec 16 corresponds to a Pod termination event. There are no correlated errors in NGINX logs.
163
166
164
167
### Reloads
165
168
166
- In this test, NGINX metrics were not correctly exported so no dashboards are available for these.
169
+ Rate of reloads - successful and errors:
170
+
171
+ ![ reloads.png] ( reloads.png )
172
+
173
+ Reload spikes correspond to 1 hour periods of backend re-rollouts.
174
+ However, small spikes, correspond to periodic reconciliation of Secrets, which (incorrectly)
175
+ triggers a reload -- https://github.com/nginxinc/nginx-gateway-fabric/issues/1112
176
+
177
+ No reloads finished with an error.
178
+
179
+ Reload time distribution with 50th, 95th and 99th percentiles and the threshold:
180
+
181
+ ![ reload-time.png] ( reload-time.png )
167
182
183
+ Note - 60s is the threshold for waiting for NGINX to be reloaded.
168
184
Reload related metrics at the end:
169
185
170
186
``` text
187
+ # HELP nginx_gateway_fabric_nginx_reloads_milliseconds Duration in milliseconds of NGINX reloads
171
188
# TYPE nginx_gateway_fabric_nginx_reloads_milliseconds histogram
172
- nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="500"} 1647
173
- nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="1000"} 4043
174
- nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="5000"} 4409
175
- nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="10000"} 4409
176
- nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="30000"} 4409
177
- nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="+Inf"} 4409
178
- nginx_gateway_fabric_nginx_reloads_milliseconds_sum{class="nginx"} 2.701667e +06
179
- nginx_gateway_fabric_nginx_reloads_milliseconds_count{class="nginx"} 4409
189
+ nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="500"} 4626
190
+ nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="1000"} 6831
191
+ nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="5000"} 7317
192
+ nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="10000"} 7317
193
+ nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="30000"} 7317
194
+ nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="+Inf"} 7317
195
+ nginx_gateway_fabric_nginx_reloads_milliseconds_sum{class="nginx"} 3.060948e +06
196
+ nginx_gateway_fabric_nginx_reloads_milliseconds_count{class="nginx"} 7317
180
197
# HELP nginx_gateway_fabric_nginx_reloads_total Number of successful NGINX reloads
181
198
# TYPE nginx_gateway_fabric_nginx_reloads_total counter
182
- nginx_gateway_fabric_nginx_reloads_total{class="nginx"} 4409
199
+ nginx_gateway_fabric_nginx_reloads_total{class="nginx"} 7317
183
200
```
184
201
185
202
All successful reloads took less than 5 seconds, with most (>90%) under 1 second.
186
203
187
204
## Existing Issues still relevant
188
205
189
- - NGF unnecessary reloads NGINX when it reconciles
190
- Secrets - https://github.com/nginxinc/nginx-gateway-fabric/issues/1112
206
+ - NGF unnecessary reloads NGINX when it reconciles Secrets - https://github.com/nginxinc/nginx-gateway-fabric/issues/1112
191
207
- Use NGF Logger in Client-Go Library - https://github.com/nginxinc/nginx-gateway-fabric/issues/1101
0 commit comments