diff --git a/tests/longevity/results/1.1.0/1.1.0.md b/tests/longevity/results/1.1.0/1.1.0.md index 7bd04a552d..0e86221645 100644 --- a/tests/longevity/results/1.1.0/1.1.0.md +++ b/tests/longevity/results/1.1.0/1.1.0.md @@ -24,8 +24,8 @@ NGF version: ```text -commit: "21a2507d3d25ac0428384dce2c042799ed28b856" -date: "2023-12-06T23:47:17Z" +commit: "7e075c636f1a40e6dc849c6bb9ba288c9801bd4a" +date: "2023-12-13T17:45:06Z" version: "edge" ``` @@ -53,12 +53,12 @@ wrk -t2 -c100 -d96h http://cafe.example.com/coffee Running 5760m test @ http://cafe.example.com/coffee 2 threads and 100 connections Thread Stats Avg Stdev Max +/- Stdev - Latency 182.30ms 146.48ms 2.00s 82.86% - Req/Sec 306.26 204.19 2.18k 65.75% - 207104807 requests in 5760.00m, 72.17GB read - Socket errors: connect 0, read 362418, write 218736, timeout 19693 -Requests/sec: 599.26 -Transfer/sec: 218.97KB + Latency 177.87ms 143.16ms 2.00s 82.78% + Req/Sec 314.21 211.96 2.32k 65.78% + 212474192 requests in 5760.00m, 74.08GB read + Socket errors: connect 0, read 346218, write 508394, timeout 10574 +Requests/sec: 614.80 +Transfer/sec: 224.78KB ``` HTTPS: @@ -69,12 +69,12 @@ wrk -t2 -c100 -d96h https://cafe.example.com/tea Running 5760m test @ https://cafe.example.com/tea 2 threads and 100 connections Thread Stats Avg Stdev Max +/- Stdev - Latency 172.15ms 118.59ms 2.00s 68.16% - Req/Sec 305.26 203.43 2.33k 65.34% - 206387831 requests in 5760.00m, 70.81GB read - Socket errors: connect 44059, read 356656, write 0, timeout 126 -Requests/sec: 597.19 -Transfer/sec: 214.84KB + Latency 168.63ms 118.44ms 2.00s 69.72% + Req/Sec 313.02 210.77 2.31k 65.50% + 211667319 requests in 5760.00m, 72.65GB read + Socket errors: connect 80661, read 338673, write 0, timeout 147 +Requests/sec: 612.46 +Transfer/sec: 220.42KB ``` While there are socket errors in the output, there are no connection-related errors in NGINX logs. @@ -86,28 +86,31 @@ Further investigation is out of scope of this test. ```text resource.type="k8s_container" -resource.labels.cluster_name="ciara-1" +resource.labels.cluster_name="" resource.labels.namespace_name="nginx-gateway" resource.labels.container_name="nginx-gateway" severity=ERROR SEARCH("error") ``` -There were 36 error logs across 2 pod instances. They came in 2 almost identical batches, one on the first day of -running the test, after approximately 6 hours, and the second 2.5 days later. They were both relating to leader election -loss, and subsequent restart (see https://github.com/nginxinc/nginx-gateway-fabric/issues/1100). - -Both error batches caused the pod to restart, but not terminate. However, the first pod was terminated about 10 minutes -after the first error batch and subsequent restart occurred. Exactly why this pod was terminated is not currently clear, -but it looks to be a cluster event (perhaps an upgrade) as the coffee and tea pods were terminated at that time also. -Strangely, there were 6 pod restarts in total of the second pod, but no other errors were observed over the test period -other than what was seen above, and grepping the logs for start-up logs only produced the 2 known restarts relating to -the leader election loss, plus initial start-up of both pods (4 in total). - -```console -kubectl get pods -n nginx-gateway -NAME READY STATUS RESTARTS AGE -my-release-nginx-gateway-fabric-78d4b84447-4hss5 2/2 Running 6 (31h ago) 3d22h +There were 53 error logs, and 6 restarts, across 3 pod instances. The error logs came in 3 almost identical batches, +starting just over 24 hours after the initial deployment, and then each subsequent error batch just over 24 hours after +the last. They were all relating to leader election loss, and subsequent restart (see https://github.com/nginxinc/nginx-gateway-fabric/issues/1100). There were also 2 termination events, both of these occurred approximately 5 minutes +after a leader election loss and successful restart. + +Each error batches caused the pod to restart, but not terminate. The first pod was terminated about 5 minutes +after the first error batch and subsequent restart occurred. A similar occurance happened after the third error batch. +There was no termination event after the second error batch. Exactly why these pods were terminated is not currently +clear, but it looks to be a cluster event (perhaps an upgrade) as the coffee and tea pods were terminated at that time +also. All the restarts happened roughly at the same time each day. + +```text +{"level":"info", "msg":"Starting manager", "ts":"2023-12-13T17:45:10Z"} -> Start-up +{"level":"info", "msg":"Starting manager", "ts":"2023-12-14T17:43:11Z"} -> First error batch +{"level":"info", "msg":"Starting manager", "ts":"2023-12-14T17:47:36Z"} -> First pod terminated, second pod start +{"level":"info", "msg":"Starting manager", "ts":"2023-12-15T17:47:22Z"} -> Second error batch +{"level":"info", "msg":"Starting manager", "ts":"2023-12-16T17:52:21Z"} -> Third error batch +{"level":"info", "msg":"Starting manager", "ts":"2023-12-16T17:58:01Z"} -> Second pod terminated, third pod start ``` ### NGINX @@ -118,7 +121,7 @@ Errors: ```text resource.type=k8s_container AND -resource.labels.pod_name="my-release-nginx-gateway-fabric-78d4b84447-4hss5" AND +resource.labels.cluster_name="" AND resource.labels.container_name="nginx" AND severity=ERROR AND SEARCH("`[warn]`") OR SEARCH("`[error]`") @@ -132,7 +135,7 @@ Non-200 response codes in NGINX access logs: ```text resource.type=k8s_container AND -resource.labels.pod_name="my-release-nginx-gateway-fabric-78d4b84447-4hss5" AND +resource.labels.cluster_name="" AND resource.labels.container_name="nginx" "GET" "HTTP/1.1" -"200" ``` @@ -145,47 +148,61 @@ No such responses. ![memory.png](memory.png) -Memory usage dropped twice which appears to correspond with the restarts seen above relating to leader election. -Interestingly, before the first restart and after the second restart, memory usage sat at about 8.5MiB, but for the -majority of the test run, memory usage was about 9.5-10MiB. The previous release test run also had memory usage at -about 9-10MiB, but more stable usage across the duration of the test. However, there were no restarts observed in the -v1.0.0 test run. I don't think there is anything to investigate here. +Memory usage dropped once which appears to correspond with a restart event seen above relating to leader election. +There did seem to be a creeping up of memory before this restart, and then memory usage remained consistent for the rest +of the test. Memory usage was slightly higher than what was observed in the v1.0.0 longevity test, but unless we see +this trending upwards on future longevity tests, I don't think there is anything to investigate here. ### Containers CPU ![cpu.png](cpu.png) -No unexpected spikes or drops. +One drop in NGINX CPU that corresponds with a Pod termination event. ### NGINX metrics -In this test, NGINX metrics were not correctly exported so no dashboards are available for these. +![stub-status.png](stub-status.png) + +The drop of _requests_ on Dec 16 corresponds to a Pod termination event. There are no correlated errors in NGINX logs. ### Reloads -In this test, NGINX metrics were not correctly exported so no dashboards are available for these. +Rate of reloads - successful and errors: + +![reloads.png](reloads.png) + +Reload spikes correspond to 1 hour periods of backend re-rollouts. +However, small spikes, correspond to periodic reconciliation of Secrets, which (incorrectly) +triggers a reload -- https://github.com/nginxinc/nginx-gateway-fabric/issues/1112 + +No reloads finished with an error. + +Reload time distribution with 50th, 95th and 99th percentiles and the threshold: + +![reload-time.png](reload-time.png) +Note - 60s is the threshold for waiting for NGINX to be reloaded. Reload related metrics at the end: ```text +# HELP nginx_gateway_fabric_nginx_reloads_milliseconds Duration in milliseconds of NGINX reloads # TYPE nginx_gateway_fabric_nginx_reloads_milliseconds histogram -nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="500"} 1647 -nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="1000"} 4043 -nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="5000"} 4409 -nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="10000"} 4409 -nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="30000"} 4409 -nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="+Inf"} 4409 -nginx_gateway_fabric_nginx_reloads_milliseconds_sum{class="nginx"} 2.701667e+06 -nginx_gateway_fabric_nginx_reloads_milliseconds_count{class="nginx"} 4409 +nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="500"} 4626 +nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="1000"} 6831 +nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="5000"} 7317 +nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="10000"} 7317 +nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="30000"} 7317 +nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="+Inf"} 7317 +nginx_gateway_fabric_nginx_reloads_milliseconds_sum{class="nginx"} 3.060948e+06 +nginx_gateway_fabric_nginx_reloads_milliseconds_count{class="nginx"} 7317 # HELP nginx_gateway_fabric_nginx_reloads_total Number of successful NGINX reloads # TYPE nginx_gateway_fabric_nginx_reloads_total counter -nginx_gateway_fabric_nginx_reloads_total{class="nginx"} 4409 +nginx_gateway_fabric_nginx_reloads_total{class="nginx"} 7317 ``` All successful reloads took less than 5 seconds, with most (>90%) under 1 second. ## Existing Issues still relevant -- NGF unnecessary reloads NGINX when it reconciles - Secrets - https://github.com/nginxinc/nginx-gateway-fabric/issues/1112 +- NGF unnecessary reloads NGINX when it reconciles Secrets - https://github.com/nginxinc/nginx-gateway-fabric/issues/1112 - Use NGF Logger in Client-Go Library - https://github.com/nginxinc/nginx-gateway-fabric/issues/1101 diff --git a/tests/longevity/results/1.1.0/cpu.png b/tests/longevity/results/1.1.0/cpu.png index f0601359c2..406f75f65a 100644 Binary files a/tests/longevity/results/1.1.0/cpu.png and b/tests/longevity/results/1.1.0/cpu.png differ diff --git a/tests/longevity/results/1.1.0/memory.png b/tests/longevity/results/1.1.0/memory.png index 0433593ab7..12438d19df 100644 Binary files a/tests/longevity/results/1.1.0/memory.png and b/tests/longevity/results/1.1.0/memory.png differ diff --git a/tests/longevity/results/1.1.0/reload-time.png b/tests/longevity/results/1.1.0/reload-time.png new file mode 100644 index 0000000000..ba4612c713 Binary files /dev/null and b/tests/longevity/results/1.1.0/reload-time.png differ diff --git a/tests/longevity/results/1.1.0/reloads.png b/tests/longevity/results/1.1.0/reloads.png new file mode 100644 index 0000000000..caba98df3f Binary files /dev/null and b/tests/longevity/results/1.1.0/reloads.png differ diff --git a/tests/longevity/results/1.1.0/stub-status.png b/tests/longevity/results/1.1.0/stub-status.png new file mode 100644 index 0000000000..bec563cbde Binary files /dev/null and b/tests/longevity/results/1.1.0/stub-status.png differ