Skip to content

Commit e1ef3d1

Browse files
authored
Added product telemetry performance test results (#1727)
* Added product telemetry performance test results Problem: Ensure the product telemetry feature doesn't negatively affect NGINX Gateway Fabric performance: CPU and memory usage. Solution: - Test performance. - Share results. CLOSES - #1561
1 parent 5b9f37c commit e1ef3d1

29 files changed

+348
-0
lines changed
Lines changed: 317 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,317 @@
1+
# Product Telemetry Performance
2+
3+
## Goals
4+
5+
- Ensure the product telemetry feature doesn't negatively affect NGINX Gateway Fabric performance: CPU and memory usage.
6+
7+
> Note: this is a one-off test not expected to be run regularly.
8+
9+
## Test Environment
10+
11+
### Base
12+
13+
NGINX Plus: false
14+
15+
GKE Cluster:
16+
17+
- Node count: 3
18+
- k8s version: v1.27.8-gke.1067004
19+
- vCPUs per node: 2
20+
- RAM per node: 4022900Ki
21+
- Max pods per node: 110
22+
- Zone: us-central1-c
23+
- Instance Type: e2-medium
24+
25+
> Note: test environment was created by the longevity test.
26+
27+
The cluster also includes an installed OTel collector:
28+
29+
```text
30+
helm install otel-collector open-telemetry/opentelemetry-collector -f values.yaml -n collector --create-namespace
31+
```
32+
33+
Also, the longevity test setup creates traffic generation of HTTP and HTTPs traffic to NGF.
34+
35+
### Environments
36+
37+
- Clusters with NGF configured with telemetry period:
38+
- 1 min, version 799ea762cf8942b1022d766d841cbbdaba7345d4
39+
- 1 hour, version 799ea762cf8942b1022d766d841cbbdaba7345d4
40+
- 1 day (3 clusters):
41+
- One, version 799ea762cf8942b1022d766d841cbbdaba7345d4
42+
- Two, version 799ea762cf8942b1022d766d841cbbdaba7345d4
43+
- Three, version 92c291021526e60cbed67d5f91e91fa64c1a2d71, with [Enabled Memory Profiling](#enabled-memory-profiling)
44+
- Collect every 1 day but not report version 799ea762cf8942b1022d766d841cbbdaba7345d4
45+
- No telemetry enabled, version 92c291021526e60cbed67d5f91e91fa64c1a2d71, with [Enabled Memory Profiling](#enabled-memory-profiling)
46+
- No telemetry features exist in the code, version 9c1d3e9d0746426741e62bf57d20de6c46540fd4, with [Enabled Memory Profiling](#enabled-memory-profiling)
47+
48+
49+
## Steps
50+
51+
1. Prepare images.
52+
2. Provision clusters.
53+
3. Deploy a OTel collector.
54+
4. Deploy NGF and kick off traffic generation using longevity test setup.
55+
5. Collect the results.
56+
57+
## Results
58+
59+
Notes:
60+
61+
- Description of go stats - see [here](#description-of-go-stats).
62+
- High CPU usage of NGINX corresponds to generated HTTP traffic load.
63+
64+
### Cluster 1 min
65+
66+
Pod name: ngf-longevity-nginx-gateway-fabric-6796966485-dn42w
67+
68+
Graphs:
69+
70+
![cluster-1min-ngf-mem-usage.png](img%2Fcluster-1min-ngf-mem-usage.png)
71+
72+
![cluster-1min-ngf-go-stats.png](img%2Fcluster-1min-ngf-go-stats.png)
73+
74+
![cluster-1min-cpu-usage.png](img%2Fcluster-1min-cpu-usage.png)
75+
76+
Error in logs:
77+
78+
```text
79+
ERROR 2024-03-13T01:54:30.481041447Z [resource.labels.containerName: nginx-gateway] error retrieving resource lock nginx-gateway/ngf-longevity-nginx-gateway-fabric-leader-election: Get "https://10.80.0.1:443/apis/coordination.k8s.io/v1/namespaces/nginx-gateway/leases/ngf-longevity-nginx-gateway-fabric-leader-election?timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
80+
ERROR 2024-03-12 21:54:30.481 failed to start control loop: leader election lost
81+
ERROR 2024-03-13T01:54:43.936492804Z [resource.labels.containerName: nginx-gateway] failed to start control loop: error setting initial control plane configuration: NginxGateway nginx-gateway/ngf-longevity-config not found: failed to get API group resources: unable to retrieve the complete list of server APIs: gateway.nginx.org/v1alpha1: Get "https://10.80.0.1:443/apis/gateway.nginx.org/v1alpha1?timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
82+
```
83+
84+
Note: NGF lost connectivity with the Kubernetes API server (for some GKE-related reason). After that,
85+
NGF container was restarted (expected).
86+
87+
### Cluster 1 hour
88+
89+
Pod name: ngf-longevity-nginx-gateway-fabric-54c698bc-dwcgm
90+
91+
Graphs:
92+
93+
![cluster-1hour-ngf-mem-usage.png](img%2Fcluster-1hour-ngf-mem-usage.png)
94+
95+
![cluster-1hour-ngf-go-stats.png](img%2Fcluster-1hour-ngf-go-stats.png)
96+
97+
![cluster-1hour-cpu-usage.png](img%2Fcluster-1hour-cpu-usage.png)
98+
99+
Error in logs:
100+
101+
```text
102+
ERROR 2024-03-13T11:06:12.010500300Z [resource.labels.containerName: nginx-gateway] error retrieving resource lock nginx-gateway/ngf-longevity-nginx-gateway-fabric-leader-election: Get "https://10.67.208.1:443/apis/coordination.k8s.io/v1/namespaces/nginx-gateway/leases/ngf-longevity-nginx-gateway-fabric-leader-election?timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
103+
ERROR 2024-03-13T11:06:12.153485820Z [resource.labels.containerName: nginx-gateway] failed to start control loop: leader election lost
104+
ERROR 2024-03-13T15:55:38.956069598Z [resource.labels.containerName: nginx-gateway] failed to start control loop: error setting initial control plane configuration: NginxGateway nginx-gateway/ngf-longevity-config not found: failed to get API group resources: unable to retrieve the complete list of server APIs: gateway.nginx.org/v1alpha1: Get "https://10.67.208.1:443/apis/gateway.nginx.org/v1alpha1?timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
105+
```
106+
107+
Note: NGF lost connectivity with the Kubernetes API server (for some GKE-related reason). After that,
108+
NGF container was restarted (expected).
109+
110+
```text
111+
INFO 2024-03-14T21:44:02.993719249Z [resource.labels.containerName: nginx-gateway] Trace[255962658]: "DeltaFIFO Pop Process" ID:gatewayclasses.gateway.networking.k8s.io,Depth:26,Reason:slow event handlers blocking the queue (14-Mar-2024 21:44:02.846) (total time: 143ms):
112+
ERROR 2024-03-14T21:44:02.993747394Z [resource.labels.containerName: nginx-gateway] Trace[255962658]: [143.318633ms] [143.318633ms] END
113+
```
114+
115+
Error comes from https://github.com/kubernetes/client-go/blob/1518fca9f06c6a73fc091535b8966c71704e657b/tools/cache/delta_fifo.go#L600
116+
Not related to product telemetry feature.
117+
Further investigation is needed to see if any fix is needed -- https://github.com/nginxinc/nginx-gateway-fabric/issues/1726
118+
119+
### Cluster 1 day One
120+
121+
Pod name: ngf-longevity-nginx-gateway-fabric-76754b6649-brjzm
122+
123+
Graphs:
124+
125+
![cluster-1day-one-ngf-mem-usage.png](img%2Fcluster-1day-one-ngf-mem-usage.png)
126+
127+
![cluster-1day-one-ngf-go-stats.png](img%2Fcluster-1day-one-ngf-go-stats.png)
128+
129+
![cluster-1day-one-cpu-usage.png](img%2Fcluster-1day-one-cpu-usage.png)
130+
131+
132+
Error in logs: NONE
133+
134+
### Cluster 1 day Two
135+
136+
Pod name: ngf-longevity-nginx-gateway-fabric-76754b6649-zcmxq
137+
138+
Graphs:
139+
140+
![cluster-1day-two-ngf-mem-usage.png](img%2Fcluster-1day-two-ngf-mem-usage.png)
141+
142+
![cluster-1day-two-ngf-go-stats.png](img%2Fcluster-1day-two-ngf-go-stats.png)
143+
144+
![cluster-1day-two-cpu-usage.png](img%2Fcluster-1day-two-cpu-usage.png)
145+
146+
Error in logs: NONE
147+
148+
### Cluster 1 day Three with Enabled Memory Profiling
149+
150+
Pod name: ngf-longevity-nginx-gateway-fabric-79f664f755-wggvh
151+
152+
Graphs:
153+
154+
![cluster-1day-three-ngf-mem-usage.png](img%2Fcluster-1day-three-ngf-mem-usage.png)
155+
156+
![cluster-1day-three-ngf-go-stats.png](img%2Fcluster-1day-three-ngf-go-stats.png)
157+
158+
![cluster-1day-three-cpu-usage.png](img%2Fcluster-1day-three-cpu-usage.png)
159+
160+
Error in logs: NONE
161+
162+
Memory profile ~1day after start:
163+
164+
![cluster-1day-three-mem-profile.png](img%2Fcluster-1day-three-mem-profile.png)
165+
166+
### Cluster Collect every 1 day but not report
167+
168+
Pod name: ngf-longevity-nginx-gateway-fabric-6f47684cdc-n7h87
169+
170+
Graphs:
171+
172+
![cluster-1day-collect-not-report-ngf-mem-usage.png](img%2Fcluster-1day-collect-not-report-ngf-mem-usage.png)
173+
174+
![cluster-1day-collect-not-report-ngf-go-stats.png](img%2Fcluster-1day-collect-not-report-ngf-go-stats.png)
175+
176+
![cluster-1day-collect-not-report-cpu-usage.png](img%2Fcluster-1day-collect-not-report-cpu-usage.png)
177+
178+
Error in logs:
179+
180+
```text
181+
ERROR 2024-03-13T08:30:09.930929048Z [resource.labels.containerName: nginx-gateway] error retrieving resource lock nginx-gateway/ngf-longevity-nginx-gateway-fabric-leader-election: Get "https://10.48.16.1:443/apis/coordination.k8s.io/v1/namespaces/nginx-gateway/leases/ngf-longevity-nginx-gateway-fabric-leader-election?timeout=10s": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
182+
ERROR 2024-03-13T08:30:09.944459347Z [resource.labels.containerName: nginx-gateway] failed to start control loop: leader election lost
183+
ERROR 2024-03-13T08:30:22.873021882Z [resource.labels.containerName: nginx-gateway] failed to start control loop: error setting initial control plane configuration: NginxGateway nginx-gateway/ngf-longevity-config not found: failed to get API group resources: unable to retrieve the complete list of server APIs: gateway.nginx.org/v1alpha1: Get "https://10.48.16.1:443/apis/gateway.nginx.org/v1alpha1?timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
184+
ERROR 2024-03-13T08:30:46.186351040Z [resource.labels.containerName: nginx-gateway] failed to start control loop: error setting initial control plane configuration: NginxGateway nginx-gateway/ngf-longevity-config not found: failed to get API group resources: unable to retrieve the complete list of server APIs: gateway.nginx.org/v1alpha1: Get "https://10.48.16.1:443/apis/gateway.nginx.org/v1alpha1?timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
185+
```
186+
187+
Note: NGF lost connectivity with the Kubernetes API server (for some GKE-related reason). After that,
188+
NGF container was restarted (expected).
189+
190+
### Cluster No telemetry enabled with Enabled Memory Profiling
191+
192+
Pod name: ngf-longevity-nginx-gateway-fabric-d5b9c8cfc-bnlgs
193+
194+
Graphs:
195+
196+
![cluster-tel-disabled-ngf-mem-usage.png](img%2Fcluster-tel-disabled-ngf-mem-usage.png)
197+
198+
![cluster-tel-disabled-ngf-go-stats.png](img%2Fcluster-tel-disabled-ngf-go-stats.png)
199+
200+
![cluster-tel-disabled-cpu-usage.png](img%2Fcluster-tel-disabled-cpu-usage.png)
201+
202+
Error in logs: NONE
203+
204+
Memory profile ~1day after start:
205+
![cluster-tel-disabled-mem-profile.png](img%2Fcluster-tel-disabled-mem-profile.png)
206+
207+
### Cluster No telemetry features exist in the code
208+
209+
Pod name: ngf-longevity-nginx-gateway-fabric-67c488fdbc-xbnf5
210+
211+
Graphs:
212+
213+
![cluster-no-tel-ngf-mem-usage.png](img%2Fcluster-no-tel-ngf-mem-usage.png)
214+
215+
![cluster-no-tel-ngf-go-stats.png](img%2Fcluster-no-tel-ngf-go-stats.png)
216+
217+
![cluster-no-tel-cpu-usage.png](img%2Fcluster-no-tel-cpu-usage.png)
218+
219+
Error in logs: NONE
220+
221+
Memory profile ~1day after start:
222+
223+
![cluster-no-tel-mem-profile.png](img%2Fcluster-no-tel-mem-profile.png)
224+
225+
## Conclusion
226+
227+
- The product telemetry feature doesn't introduce any memory leaks.
228+
- The product telemetry feature doesn't introduce any visible CPU usage spikes.
229+
- NGF memory usage without telemetry added in the code is slightly lower.
230+
- There is an increase in evictable memory usage in NGF with telemetry enabled in 2 out of 3 similar clusters, but
231+
non-evictable memory usage stays constant (note: from the docs of `container/memory/used_bytes` memory here
232+
https://cloud.google.com/monitoring/api/metrics_kubernetes - Evictable memory is memory that can be easily reclaimed
233+
by the kernel, while non-evictable memory cannot.)
234+
- Memory profile with telemetry enabled doesn't show any telemetry related allocations.
235+
- Unrelated to product telemetry but captured potential issue -- https://github.com/nginxinc/nginx-gateway-fabric/issues/1726
236+
237+
## Appendix
238+
239+
### Enabled Memory Profiling
240+
241+
```diff
242+
diff --git a/cmd/gateway/main.go b/cmd/gateway/main.go
243+
index 8761e3f1..2b914395 100644
244+
--- a/cmd/gateway/main.go
245+
+++ b/cmd/gateway/main.go
246+
@@ -2,6 +2,8 @@ package main
247+
248+
import (
249+
"fmt"
250+
+ "net/http"
251+
+ _ "net/http/pprof"
252+
"os"
253+
)
254+
255+
@@ -20,6 +22,10 @@ var (
256+
)
257+
258+
func main() {
259+
+ go func() {
260+
+ fmt.Println(http.ListenAndServe("localhost:6060", nil))
261+
+ }()
262+
+
263+
rootCmd := createRootCommand()
264+
265+
rootCmd.AddCommand(
266+
```
267+
268+
Getting memory profile after port-forwarding into the NGF pod:
269+
270+
```text
271+
go tool pprof -png http://localhost:6060/debug/pprof/heap > out-1.png
272+
```
273+
274+
### Description of Go Stats
275+
276+
- heap bytes
277+
278+
```text
279+
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
280+
# TYPE go_memstats_alloc_bytes gauge
281+
```
282+
283+
- sys bytes
284+
285+
```text
286+
# HELP go_memstats_sys_bytes Number of bytes obtained from system.
287+
# TYPE go_memstats_sys_bytes gauge
288+
```
289+
290+
- heap inuse bytes
291+
292+
```text
293+
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
294+
# TYPE go_memstats_heap_inuse_bytes gauge
295+
296+
- heap idle bytes
297+
298+
```text
299+
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
300+
# TYPE go_memstats_heap_idle_bytes gauge
301+
```
302+
303+
- heap released bytes
304+
305+
```text
306+
# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
307+
# TYPE go_memstats_heap_released_bytes gauge
308+
```
309+
310+
- stack inuse bytes
311+
312+
```text
313+
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
314+
# TYPE go_memstats_stack_inuse_bytes gauge
315+
```
316+
317+
See also https://pkg.go.dev/runtime#MemStats
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
mode: deployment
2+
replicaCount: 1
3+
config:
4+
exporters:
5+
debug:
6+
verbosity: detailed
7+
logging: {}
8+
extensions:
9+
health_check: {}
10+
memory_ballast:
11+
size_in_percentage: 40
12+
processors:
13+
batch: {}
14+
memory_limiter:
15+
check_interval: 5s
16+
limit_percentage: 80
17+
spike_limit_percentage: 25
18+
receivers:
19+
otlp:
20+
protocols:
21+
grpc:
22+
endpoint: 0.0.0.0:4317
23+
service:
24+
extensions:
25+
- health_check
26+
pipelines:
27+
traces:
28+
exporters:
29+
- debug
30+
receivers:
31+
- otlp
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading

0 commit comments

Comments
 (0)