Do less Kubernetes API requests #2699

g7r · 2025-03-08T17:33:31Z

Description

The first reason why provider does so many requests is because its caches aren't goroutine-safe. For example, Terraform invokes provider concurrently and every individual goroutine starts its own getOAPIv2Foundry() invocation. Every getOAPIv2Foundry() starts its own Kubernetes API request and these requests consume Kubernetes client Burst leading eventually to a stall.

The second reason is that CRDs should also be cached because production Kubernetes clusters may have lots and lots of CRDs and getting them all is not cheap. Furthermore, getting them over and over consumes Kubernetes client Burst leading to a major stall.

Release Note

Release note for CHANGELOG:

Do less Kubernetes API invocations
Make `RawProviderServer` caches goroutine-safe

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

hashicorp-cla-app · 2025-03-08T17:33:44Z

All committers have signed the CLA.

g7r · 2025-04-10T17:43:02Z

Hello @alexsomesan, @jrhouston, and @arybolovlev,

Hope you're having a good week.

I submitted this PR about a month ago and just wanted to gently follow up to make sure the notification came through and it landed on your radar.

This PR has two main benefits: it addresses a significant performance issue (reducing terraform plan time from ~3 minutes back down to ~15 seconds) and also helps make the provider's behavior more correct in this context.

I'd be grateful if you could take a look when you have a moment.

Thank you for your consideration and all your work on this provider!

alexsomesan · 2025-04-10T17:49:18Z

Hello @alexsomesan, @jrhouston, and @arybolovlev,

Hope you're having a good week.

I submitted this PR about a month ago and just wanted to gently follow up to make sure the notification came through and it landed on your radar.

This PR has two main benefits: it addresses a significant performance issue (reducing terraform plan time from ~3 minutes back down to ~15 seconds) and also helps make the provider's behavior more correct in this context.

I'd be grateful if you could take a look when you have a moment.

Thank you for your consideration and all your work on this provider!

Hello @g7r
Thanks for taking the time to look into these performance issues.

Our whole team is away this week, taking part in a company event. I will have a look at your PR first thing next week, when I'm back to a regular schedule.

g7r · 2025-05-21T08:15:59Z

@alexsomesan, I still badly need this fix merged 🙏🏻🙏🏻🙏🏻

This little PR only fixes an erroneous behaviour under the hood without introducing anything new. It, however, makes the life of those with hundreds of Kubernetes resources in their Terraform stacks much much easier.

alexsomesan · 2025-05-21T12:31:11Z

@g7r Apologies, I'm reviewing this today.

alexsomesan

I think the changes look technically correct. However, I'm finding it difficult to understand where the performance gains are coming from. See my spot comment about the caching mechanism, as an example.

Could you please elaborate a bit how you came to this solution?

alexsomesan · 2025-05-22T19:20:41Z

manifest/provider/cache.go

+	err   error
+}
+
+func (c *cache[T]) Get(f func() (T, error)) (T, error) {


I think this pattern looks better than the existing approach, but I fail to see the functional difference that using sync.Once() would bring. Code esthetics aside, is there something that I am missing functionality wise?

cache just encapsulates sync.Once with its result and a possible error. If you think it'd be better to inline the logic into RawProviderServer, I'm OK with that.

Or do you mean that sync.Once shouldn't be used at all? Without sync.Once we are going to need a custom mechanism for the same thing. I have no problem implementing an alternative using sync.RWMutex but it is going to be functionally identical albeit more verbose.

No, sync.Once should be good enough here.
Just wasn't considering it from a concurrency perspective, which is where it's definitely the right choice compared to what we had before.

g7r · 2025-05-26T10:18:46Z

Could you please elaborate a bit how you came to this solution?

Sure. Terraform may use multiple concurrent requests to the provider when there are multiple Kubernetes resources in the stack. When the provider serves a request, it first tries to retrieve CRDs from Kubernetes cluster. The provider has caches but the implementation lacks goroutine-safety.

So, when multiple concurrent requests come, the provider checks whether it already has CRDs cached. But it doesn't. So it starts multiple concurrent requests to Kubernetes API, because there are no syncronization primitives. And the main issue with this behaviour is that these redundant requests consume Kubernetes client request budget. The Kubernetes client uses client-level throttling by default. And its request budget is quite limited. When the budget is exhausted, all Kubernetes API requests stall for a minute or so.

We have stacks with a lot of Kubernetes resources. When we run Terraform on these stacks with higher N in -parallelism=N, the Kubernetes request queue in the provider may take several minutes to be served by Kubernetes client. This PR solves the problem by fixing goroutine safety in CRDs requests. The provider now makes at most one request for the CRDs. All requests except the first either wait for the first request to finish or take the cached result.

The first reason why provider does so many requests is because its caches aren't goroutine-safe. For example, Terraform invokes provider concurrently and every individual goroutine starts its own getOAPIv2Foundry() invocation. Every getOAPIv2Foundry() starts its own Kubernetes API request and these requests consume Kubernetes client Burst leading eventually to a stall. The second reason is that CRDs should also be cached because production Kubernetes clusters may have lots and lots of CRDs and getting them all is not cheap. Furthermore, getting them over and over consumes Kubernetes client Burst leading to a major stall.

alexsomesan · 2025-06-03T16:11:35Z

@g7r Thanks a lot for the thorough explanation. That all makes good sense. I wasn't considering the perspective of concurent go routines when reading through your changes, but it does look great with that perspective in mind.

alexsomesan

Looks good. Thanks for this work!

alexsomesan · 2025-06-03T16:16:02Z

manifest/provider/cache.go

+	err   error
+}
+
+func (c *cache[T]) Get(f func() (T, error)) (T, error) {


No, sync.Once should be good enough here.
Just wasn't considering it from a concurrency perspective, which is where it's definitely the right choice compared to what we had before.

g7r requested a review from a team as a code owner March 8, 2025 17:33

github-actions bot added the size/L label Mar 8, 2025

bateller approved these changes May 15, 2025

View reviewed changes

alexsomesan force-pushed the goroutine-safe-cache branch from 421717f to 57c2fb9 Compare May 21, 2025 13:42

github-actions bot added size/XL and removed size/L labels May 22, 2025

alexsomesan reviewed May 22, 2025

View reviewed changes

g7r added 2 commits June 3, 2025 18:07

Add copyright and changelog

6080b7a

alexsomesan force-pushed the goroutine-safe-cache branch from 305ee34 to 6080b7a Compare June 3, 2025 16:07

alexsomesan approved these changes Jun 3, 2025

View reviewed changes

alexsomesan merged commit d68711a into hashicorp:main Jun 3, 2025
90 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Do less Kubernetes API requests #2699

Do less Kubernetes API requests #2699

Uh oh!

g7r commented Mar 8, 2025

Uh oh!

hashicorp-cla-app bot commented Mar 8, 2025 •

edited

Loading

Uh oh!

g7r commented Apr 10, 2025

Uh oh!

alexsomesan commented Apr 10, 2025

Uh oh!

g7r commented May 21, 2025

Uh oh!

alexsomesan commented May 21, 2025

Uh oh!

alexsomesan left a comment •

edited

Loading

Uh oh!

alexsomesan May 22, 2025

Uh oh!

g7r May 26, 2025

Uh oh!

g7r May 26, 2025

Uh oh!

alexsomesan Jun 3, 2025

Uh oh!

g7r commented May 26, 2025

Uh oh!

alexsomesan commented Jun 3, 2025

Uh oh!

alexsomesan left a comment

Uh oh!

alexsomesan Jun 3, 2025

Uh oh!

Uh oh!

Uh oh!

Do less Kubernetes API requests #2699

Do less Kubernetes API requests #2699

Uh oh!

Conversation

g7r commented Mar 8, 2025

Description

Release Note

Community Note

Uh oh!

hashicorp-cla-app bot commented Mar 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

g7r commented Apr 10, 2025

Uh oh!

alexsomesan commented Apr 10, 2025

Uh oh!

g7r commented May 21, 2025

Uh oh!

alexsomesan commented May 21, 2025

Uh oh!

alexsomesan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexsomesan May 22, 2025

Choose a reason for hiding this comment

Uh oh!

g7r May 26, 2025

Choose a reason for hiding this comment

Uh oh!

g7r May 26, 2025

Choose a reason for hiding this comment

Uh oh!

alexsomesan Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

g7r commented May 26, 2025

Uh oh!

alexsomesan commented Jun 3, 2025

Uh oh!

alexsomesan left a comment

Choose a reason for hiding this comment

Uh oh!

alexsomesan Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hashicorp-cla-app bot commented Mar 8, 2025 •

edited

Loading

alexsomesan left a comment •

edited

Loading