-
Notifications
You must be signed in to change notification settings - Fork 1k
Do less Kubernetes API requests #2699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @alexsomesan, @jrhouston, and @arybolovlev, Hope you're having a good week. I submitted this PR about a month ago and just wanted to gently follow up to make sure the notification came through and it landed on your radar. This PR has two main benefits: it addresses a significant performance issue (reducing I'd be grateful if you could take a look when you have a moment. Thank you for your consideration and all your work on this provider! |
Hello @g7r Our whole team is away this week, taking part in a company event. I will have a look at your PR first thing next week, when I'm back to a regular schedule. |
@alexsomesan, I still badly need this fix merged 🙏🏻🙏🏻🙏🏻 This little PR only fixes an erroneous behaviour under the hood without introducing anything new. It, however, makes the life of those with hundreds of Kubernetes resources in their Terraform stacks much much easier. |
@g7r Apologies, I'm reviewing this today. |
421717f
to
57c2fb9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the changes look technically correct. However, I'm finding it difficult to understand where the performance gains are coming from. See my spot comment about the caching mechanism, as an example.
Could you please elaborate a bit how you came to this solution?
err error | ||
} | ||
|
||
func (c *cache[T]) Get(f func() (T, error)) (T, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this pattern looks better than the existing approach, but I fail to see the functional difference that using sync.Once() would bring. Code esthetics aside, is there something that I am missing functionality wise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cache
just encapsulates sync.Once
with its result and a possible error. If you think it'd be better to inline the logic into RawProviderServer
, I'm OK with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or do you mean that sync.Once
shouldn't be used at all? Without sync.Once
we are going to need a custom mechanism for the same thing. I have no problem implementing an alternative using sync.RWMutex
but it is going to be functionally identical albeit more verbose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, sync.Once
should be good enough here.
Just wasn't considering it from a concurrency perspective, which is where it's definitely the right choice compared to what we had before.
Sure. Terraform may use multiple concurrent requests to the provider when there are multiple Kubernetes resources in the stack. When the provider serves a request, it first tries to retrieve CRDs from Kubernetes cluster. The provider has caches but the implementation lacks goroutine-safety. So, when multiple concurrent requests come, the provider checks whether it already has CRDs cached. But it doesn't. So it starts multiple concurrent requests to Kubernetes API, because there are no syncronization primitives. And the main issue with this behaviour is that these redundant requests consume Kubernetes client request budget. The Kubernetes client uses client-level throttling by default. And its request budget is quite limited. When the budget is exhausted, all Kubernetes API requests stall for a minute or so. We have stacks with a lot of Kubernetes resources. When we run Terraform on these stacks with higher |
The first reason why provider does so many requests is because its caches aren't goroutine-safe. For example, Terraform invokes provider concurrently and every individual goroutine starts its own getOAPIv2Foundry() invocation. Every getOAPIv2Foundry() starts its own Kubernetes API request and these requests consume Kubernetes client Burst leading eventually to a stall. The second reason is that CRDs should also be cached because production Kubernetes clusters may have lots and lots of CRDs and getting them all is not cheap. Furthermore, getting them over and over consumes Kubernetes client Burst leading to a major stall.
305ee34
to
6080b7a
Compare
@g7r Thanks a lot for the thorough explanation. That all makes good sense. I wasn't considering the perspective of concurent go routines when reading through your changes, but it does look great with that perspective in mind. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks for this work!
err error | ||
} | ||
|
||
func (c *cache[T]) Get(f func() (T, error)) (T, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, sync.Once
should be good enough here.
Just wasn't considering it from a concurrency perspective, which is where it's definitely the right choice compared to what we had before.
Description
The first reason why provider does so many requests is because its caches aren't goroutine-safe. For example, Terraform invokes provider concurrently and every individual goroutine starts its own getOAPIv2Foundry() invocation. Every getOAPIv2Foundry() starts its own Kubernetes API request and these requests consume Kubernetes client Burst leading eventually to a stall.
The second reason is that CRDs should also be cached because production Kubernetes clusters may have lots and lots of CRDs and getting them all is not cheap. Furthermore, getting them over and over consumes Kubernetes client Burst leading to a major stall.
Release Note
Release note for CHANGELOG:
Community Note