24
24
NGF version:
25
25
26
26
``` text
27
- commit: "21a2507d3d25ac0428384dce2c042799ed28b856 "
28
- date: "2023-12-06T23:47:17Z "
27
+ commit: "7e075c636f1a40e6dc849c6bb9ba288c9801bd4a "
28
+ date: "2023-12-13T17:45:06Z "
29
29
version: "edge"
30
30
```
31
31
@@ -53,12 +53,12 @@ wrk -t2 -c100 -d96h http://cafe.example.com/coffee
53
53
Running 5760m test @ http://cafe.example.com/coffee
54
54
2 threads and 100 connections
55
55
Thread Stats Avg Stdev Max +/- Stdev
56
- Latency 182.30ms 146.48ms 2.00s 82.86 %
57
- Req/Sec 306.26 204.19 2.18k 65.75 %
58
- 207104807 requests in 5760.00m, 72.17GB read
59
- Socket errors: connect 0, read 362418 , write 218736 , timeout 19693
60
- Requests/sec: 599.26
61
- Transfer/sec: 218.97KB
56
+ Latency 177.87ms 143.16ms 2.00s 82.78 %
57
+ Req/Sec 314.21 211.96 2.32k 65.78 %
58
+ 212474192 requests in 5760.00m, 74.08GB read
59
+ Socket errors: connect 0, read 346218 , write 508394 , timeout 10574
60
+ Requests/sec: 614.80
61
+ Transfer/sec: 224.78KB
62
62
```
63
63
64
64
HTTPS:
@@ -69,12 +69,12 @@ wrk -t2 -c100 -d96h https://cafe.example.com/tea
69
69
Running 5760m test @ https://cafe.example.com/tea
70
70
2 threads and 100 connections
71
71
Thread Stats Avg Stdev Max +/- Stdev
72
- Latency 172.15ms 118.59ms 2.00s 68.16 %
73
- Req/Sec 305.26 203.43 2.33k 65.34 %
74
- 206387831 requests in 5760.00m, 70.81GB read
75
- Socket errors: connect 44059 , read 356656 , write 0, timeout 126
76
- Requests/sec: 597.19
77
- Transfer/sec: 214.84KB
72
+ Latency 168.63ms 118.44ms 2.00s 69.72 %
73
+ Req/Sec 313.02 210.77 2.31k 65.50 %
74
+ 211667319 requests in 5760.00m, 72.65GB read
75
+ Socket errors: connect 80661 , read 338673 , write 0, timeout 147
76
+ Requests/sec: 612.46
77
+ Transfer/sec: 220.42KB
78
78
```
79
79
80
80
While there are socket errors in the output, there are no connection-related errors in NGINX logs.
@@ -86,28 +86,31 @@ Further investigation is out of scope of this test.
86
86
87
87
``` text
88
88
resource.type="k8s_container"
89
- resource.labels.cluster_name="ciara-1 "
89
+ resource.labels.cluster_name="<CLUSTER_NAME> "
90
90
resource.labels.namespace_name="nginx-gateway"
91
91
resource.labels.container_name="nginx-gateway"
92
92
severity=ERROR
93
93
SEARCH("error")
94
94
```
95
95
96
- There were 36 error logs across 2 pod instances. They came in 2 almost identical batches, one on the first day of
97
- running the test, after approximately 6 hours, and the second 2.5 days later. They were both relating to leader election
98
- loss, and subsequent restart (see https://github.com/nginxinc/nginx-gateway-fabric/issues/1100 ).
99
-
100
- Both error batches caused the pod to restart, but not terminate. However, the first pod was terminated about 10 minutes
101
- after the first error batch and subsequent restart occurred. Exactly why this pod was terminated is not currently clear,
102
- but it looks to be a cluster event (perhaps an upgrade) as the coffee and tea pods were terminated at that time also.
103
- Strangely, there were 6 pod restarts in total of the second pod, but no other errors were observed over the test period
104
- other than what was seen above, and grepping the logs for start-up logs only produced the 2 known restarts relating to
105
- the leader election loss, plus initial start-up of both pods (4 in total).
106
-
107
- ``` console
108
- kubectl get pods -n nginx-gateway
109
- NAME READY STATUS RESTARTS AGE
110
- my-release-nginx-gateway-fabric-78d4b84447-4hss5 2/2 Running 6 (31h ago) 3d22h
96
+ There were 53 error logs, and 6 restarts, across 3 pod instances. The error logs came in 3 almost identical batches,
97
+ starting just over 24 hours after the initial deployment, and then each subsequent error batch just over 24 hours after
98
+ the last. They were all relating to leader election loss, and subsequent restart (see https://github.com/nginxinc/nginx-gateway-fabric/issues/1100 ). There were also 2 termination events, both of these occurred approximately 5 minutes
99
+ after a leader election loss and successful restart.
100
+
101
+ Each error batches caused the pod to restart, but not terminate. The first pod was terminated about 5 minutes
102
+ after the first error batch and subsequent restart occurred. A similar occurance happened after the third error batch.
103
+ There was no termination event after the second error batch. Exactly why these pods were terminated is not currently
104
+ clear, but it looks to be a cluster event (perhaps an upgrade) as the coffee and tea pods were terminated at that time
105
+ also. All the restarts happened roughly at the same time each day.
106
+
107
+ ``` text
108
+ {"level":"info", "msg":"Starting manager", "ts":"2023-12-13T17:45:10Z"} -> Start-up
109
+ {"level":"info", "msg":"Starting manager", "ts":"2023-12-14T17:43:11Z"} -> First error batch
110
+ {"level":"info", "msg":"Starting manager", "ts":"2023-12-14T17:47:36Z"} -> First pod terminated, second pod start
111
+ {"level":"info", "msg":"Starting manager", "ts":"2023-12-15T17:47:22Z"} -> Second error batch
112
+ {"level":"info", "msg":"Starting manager", "ts":"2023-12-16T17:52:21Z"} -> Third error batch
113
+ {"level":"info", "msg":"Starting manager", "ts":"2023-12-16T17:58:01Z"} -> Second pod terminated, third pod start
111
114
```
112
115
113
116
### NGINX
@@ -118,7 +121,7 @@ Errors:
118
121
119
122
``` text
120
123
resource.type=k8s_container AND
121
- resource.labels.pod_name="my-release-nginx-gateway-fabric-78d4b84447-4hss5 " AND
124
+ resource.labels.cluster_name="<CLUSTER_NAME> " AND
122
125
resource.labels.container_name="nginx" AND
123
126
severity=ERROR AND
124
127
SEARCH("`[warn]`") OR SEARCH("`[error]`")
@@ -132,7 +135,7 @@ Non-200 response codes in NGINX access logs:
132
135
133
136
``` text
134
137
resource.type=k8s_container AND
135
- resource.labels.pod_name="my-release-nginx-gateway-fabric-78d4b84447-4hss5 " AND
138
+ resource.labels.cluster_name="<CLUSTER_NAME> " AND
136
139
resource.labels.container_name="nginx"
137
140
"GET" "HTTP/1.1" -"200"
138
141
```
@@ -145,47 +148,61 @@ No such responses.
145
148
146
149
![ memory.png] ( memory.png )
147
150
148
- Memory usage dropped twice which appears to correspond with the restarts seen above relating to leader election.
149
- Interestingly, before the first restart and after the second restart, memory usage sat at about 8.5MiB, but for the
150
- majority of the test run, memory usage was about 9.5-10MiB. The previous release test run also had memory usage at
151
- about 9-10MiB, but more stable usage across the duration of the test. However, there were no restarts observed in the
152
- v1.0.0 test run. I don't think there is anything to investigate here.
151
+ Memory usage dropped once which appears to correspond with a restart event seen above relating to leader election.
152
+ There did seem to be a creeping up of memory before this restart, and then memory usage remained consistent for the rest
153
+ of the test. Memory usage was slightly higher than what was observed in the v1.0.0 longevity test, but unless we see
154
+ this trending upwards on future longevity tests, I don't think there is anything to investigate here.
153
155
154
156
### Containers CPU
155
157
156
158
![ cpu.png] ( cpu.png )
157
159
158
- No unexpected spikes or drops .
160
+ One drop in NGINX CPU that corresponds with a Pod termination event .
159
161
160
162
### NGINX metrics
161
163
162
- In this test, NGINX metrics were not correctly exported so no dashboards are available for these.
164
+ ![ stub-status.png] ( stub-status.png )
165
+
166
+ The drop of _ requests_ on Dec 16 corresponds to a Pod termination event. There are no correlated errors in NGINX logs.
163
167
164
168
### Reloads
165
169
166
- In this test, NGINX metrics were not correctly exported so no dashboards are available for these.
170
+ Rate of reloads - successful and errors:
171
+
172
+ ![ reloads.png] ( reloads.png )
173
+
174
+ Reload spikes correspond to 1 hour periods of backend re-rollouts.
175
+ However, small spikes, correspond to periodic reconciliation of Secrets, which (incorrectly)
176
+ triggers a reload -- https://github.com/nginxinc/nginx-gateway-fabric/issues/1112
177
+
178
+ No reloads finished with an error.
179
+
180
+ Reload time distribution with 50th, 95th and 99th percentiles and the threshold:
181
+
182
+ ![ reload-time.png] ( reload-time.png )
167
183
184
+ Note - 60s is the threshold for waiting for NGINX to be reloaded.
168
185
Reload related metrics at the end:
169
186
170
187
``` text
188
+ # HELP nginx_gateway_fabric_nginx_reloads_milliseconds Duration in milliseconds of NGINX reloads
171
189
# TYPE nginx_gateway_fabric_nginx_reloads_milliseconds histogram
172
- nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="500"} 1647
173
- nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="1000"} 4043
174
- nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="5000"} 4409
175
- nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="10000"} 4409
176
- nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="30000"} 4409
177
- nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="+Inf"} 4409
178
- nginx_gateway_fabric_nginx_reloads_milliseconds_sum{class="nginx"} 2.701667e +06
179
- nginx_gateway_fabric_nginx_reloads_milliseconds_count{class="nginx"} 4409
190
+ nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="500"} 4626
191
+ nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="1000"} 6831
192
+ nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="5000"} 7317
193
+ nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="10000"} 7317
194
+ nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="30000"} 7317
195
+ nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="+Inf"} 7317
196
+ nginx_gateway_fabric_nginx_reloads_milliseconds_sum{class="nginx"} 3.060948e +06
197
+ nginx_gateway_fabric_nginx_reloads_milliseconds_count{class="nginx"} 7317
180
198
# HELP nginx_gateway_fabric_nginx_reloads_total Number of successful NGINX reloads
181
199
# TYPE nginx_gateway_fabric_nginx_reloads_total counter
182
- nginx_gateway_fabric_nginx_reloads_total{class="nginx"} 4409
200
+ nginx_gateway_fabric_nginx_reloads_total{class="nginx"} 7317
183
201
```
184
202
185
203
All successful reloads took less than 5 seconds, with most (>90%) under 1 second.
186
204
187
205
## Existing Issues still relevant
188
206
189
- - NGF unnecessary reloads NGINX when it reconciles
190
- Secrets - https://github.com/nginxinc/nginx-gateway-fabric/issues/1112
207
+ - NGF unnecessary reloads NGINX when it reconciles Secrets - https://github.com/nginxinc/nginx-gateway-fabric/issues/1112
191
208
- Use NGF Logger in Client-Go Library - https://github.com/nginxinc/nginx-gateway-fabric/issues/1101
0 commit comments