Skip to content

Do less Kubernetes API requests #2699

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 3, 2025
Merged

Conversation

g7r
Copy link
Contributor

@g7r g7r commented Mar 8, 2025

Description

The first reason why provider does so many requests is because its caches aren't goroutine-safe. For example, Terraform invokes provider concurrently and every individual goroutine starts its own getOAPIv2Foundry() invocation. Every getOAPIv2Foundry() starts its own Kubernetes API request and these requests consume Kubernetes client Burst leading eventually to a stall.

The second reason is that CRDs should also be cached because production Kubernetes clusters may have lots and lots of CRDs and getting them all is not cheap. Furthermore, getting them over and over consumes Kubernetes client Burst leading to a major stall.

Release Note

Release note for CHANGELOG:

Do less Kubernetes API invocations
Make `RawProviderServer` caches goroutine-safe

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

@g7r g7r requested a review from a team as a code owner March 8, 2025 17:33
Copy link

hashicorp-cla-app bot commented Mar 8, 2025

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added the size/L label Mar 8, 2025
@g7r
Copy link
Contributor Author

g7r commented Apr 10, 2025

Hello @alexsomesan, @jrhouston, and @arybolovlev,

Hope you're having a good week.

I submitted this PR about a month ago and just wanted to gently follow up to make sure the notification came through and it landed on your radar.

This PR has two main benefits: it addresses a significant performance issue (reducing terraform plan time from ~3 minutes back down to ~15 seconds) and also helps make the provider's behavior more correct in this context.

I'd be grateful if you could take a look when you have a moment.

Thank you for your consideration and all your work on this provider!

@alexsomesan
Copy link
Member

Hello @alexsomesan, @jrhouston, and @arybolovlev,

Hope you're having a good week.

I submitted this PR about a month ago and just wanted to gently follow up to make sure the notification came through and it landed on your radar.

This PR has two main benefits: it addresses a significant performance issue (reducing terraform plan time from ~3 minutes back down to ~15 seconds) and also helps make the provider's behavior more correct in this context.

I'd be grateful if you could take a look when you have a moment.

Thank you for your consideration and all your work on this provider!

Hello @g7r
Thanks for taking the time to look into these performance issues.

Our whole team is away this week, taking part in a company event. I will have a look at your PR first thing next week, when I'm back to a regular schedule.

@g7r
Copy link
Contributor Author

g7r commented May 21, 2025

@alexsomesan, I still badly need this fix merged 🙏🏻🙏🏻🙏🏻

This little PR only fixes an erroneous behaviour under the hood without introducing anything new. It, however, makes the life of those with hundreds of Kubernetes resources in their Terraform stacks much much easier.

@alexsomesan
Copy link
Member

@g7r Apologies, I'm reviewing this today.

@alexsomesan alexsomesan force-pushed the goroutine-safe-cache branch from 421717f to 57c2fb9 Compare May 21, 2025 13:42
@github-actions github-actions bot added size/XL and removed size/L labels May 22, 2025
Copy link
Member

@alexsomesan alexsomesan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the changes look technically correct. However, I'm finding it difficult to understand where the performance gains are coming from. See my spot comment about the caching mechanism, as an example.

Could you please elaborate a bit how you came to this solution?

err error
}

func (c *cache[T]) Get(f func() (T, error)) (T, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this pattern looks better than the existing approach, but I fail to see the functional difference that using sync.Once() would bring. Code esthetics aside, is there something that I am missing functionality wise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cache just encapsulates sync.Once with its result and a possible error. If you think it'd be better to inline the logic into RawProviderServer, I'm OK with that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or do you mean that sync.Once shouldn't be used at all? Without sync.Once we are going to need a custom mechanism for the same thing. I have no problem implementing an alternative using sync.RWMutex but it is going to be functionally identical albeit more verbose.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, sync.Once should be good enough here.
Just wasn't considering it from a concurrency perspective, which is where it's definitely the right choice compared to what we had before.

@g7r
Copy link
Contributor Author

g7r commented May 26, 2025

Could you please elaborate a bit how you came to this solution?

Sure. Terraform may use multiple concurrent requests to the provider when there are multiple Kubernetes resources in the stack. When the provider serves a request, it first tries to retrieve CRDs from Kubernetes cluster. The provider has caches but the implementation lacks goroutine-safety.

So, when multiple concurrent requests come, the provider checks whether it already has CRDs cached. But it doesn't. So it starts multiple concurrent requests to Kubernetes API, because there are no syncronization primitives. And the main issue with this behaviour is that these redundant requests consume Kubernetes client request budget. The Kubernetes client uses client-level throttling by default. And its request budget is quite limited. When the budget is exhausted, all Kubernetes API requests stall for a minute or so.

We have stacks with a lot of Kubernetes resources. When we run Terraform on these stacks with higher N in -parallelism=N, the Kubernetes request queue in the provider may take several minutes to be served by Kubernetes client. This PR solves the problem by fixing goroutine safety in CRDs requests. The provider now makes at most one request for the CRDs. All requests except the first either wait for the first request to finish or take the cached result.

g7r added 2 commits June 3, 2025 18:07
The first reason why provider does so many requests is because its
caches aren't goroutine-safe. For example, Terraform invokes provider
concurrently and every individual goroutine starts its own
getOAPIv2Foundry() invocation. Every getOAPIv2Foundry() starts its own
Kubernetes API request and these requests consume Kubernetes client
Burst leading eventually to a stall.

The second reason is that CRDs should also be cached because production
Kubernetes clusters may have lots and lots of CRDs and getting them all
is not cheap. Furthermore, getting them over and over consumes
Kubernetes client Burst leading to a major stall.
@alexsomesan alexsomesan force-pushed the goroutine-safe-cache branch from 305ee34 to 6080b7a Compare June 3, 2025 16:07
@alexsomesan
Copy link
Member

@g7r Thanks a lot for the thorough explanation. That all makes good sense. I wasn't considering the perspective of concurent go routines when reading through your changes, but it does look great with that perspective in mind.

Copy link
Member

@alexsomesan alexsomesan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks for this work!

err error
}

func (c *cache[T]) Get(f func() (T, error)) (T, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, sync.Once should be good enough here.
Just wasn't considering it from a concurrency perspective, which is where it's definitely the right choice compared to what we had before.

@alexsomesan alexsomesan merged commit d68711a into hashicorp:main Jun 3, 2025
90 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants