Skip to content

Rate limiting on AKS starting version 0.17.0 #975

Closed
@ivanstanev

Description

@ivanstanev

Describe the bug

Hey, after upgrading the client library in our project https://github.com/snyk/kubernetes-monitor to version 0.17.0, we started getting reports from some of our AKS customers that the Kubernetes API server is issuing HTTP 429s and does not let the app recover.

Our customers provided logs of our app, and I can see that shortly after start-up (~15 secs to ~2 mins), AKS starts to heavily rate limit every informer connection. Even if we retry starting the informer, the rate-limiting continues. This triggers the informer.on(ERROR) handlers. It seems to happen after the informer has been set up and a couple of UPDATE events arrive in our app.

Interestingly, this issue does not occur on version 0.15.1 of the client. Something recent must have changed that caused the API server to heavily rate limit the client. Customers are stuck at the moment and need help to upgrade our app past the version that uses the 0.15.1 client.

Do you have any idea what could be causing this? Can I provide any additional information to help narrow this down?

For context, here's all that I know:

  • Our app opens 8 namespaced informers to the Kubernetes API
  • The customer deploys the app in 3 namespaces; each deployment watches only that namespace (so 24 connections in total)
  • Each namespace has around ~50 workloads

EDIT:

We are starting to get more and more bug reports, not only on AKS. It looks like it specifically occurs when watching resources in a single namespace. Based on netstat output, there are thousands of connections to the Kubernetes API:

~ $ netstat -n | grep "172.20.0.1:443" | grep -e "FIN_WAIT2" -e "CLOSE_WAIT" | wc -l
2922
~ $ netstat -n | grep "172.20.0.1:443" | grep -e "FIN_WAIT2" -e "CLOSE_WAIT" | wc -l
2757
~ $ netstat -n | grep "172.20.0.1:443" | grep -e "FIN_WAIT2" -e "CLOSE_WAIT" | wc -l
3692
~ $ netstat -n | grep "172.20.0.1:443" | grep -e "FIN_WAIT2" -e "CLOSE_WAIT" | wc -l
3387
~ $ netstat -n | grep "172.20.0.1:443" | grep -e "FIN_WAIT2" -e "CLOSE_WAIT" | wc -l
4885

Client Version

0.17.0

Server Version

1.23.12

Environment (please complete the following information):

  • OS: Linux
  • NodeJS 16
  • Cloud runtime: AKS

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions