Skip to content

📖 Add a design for supporting warm replicas. #3121

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

godwinpang
Copy link
Contributor

This change describes the motivation and implementation details for supporting warm replicas in controller-runtime. I have floated this idea offline with @alvaroaleman to address some really slow sources that we work with that take 10s of minutes to serve the initial list.There is no open issue discussing it. Let me know if that is preferred and I can open one.

Previously discussed in #2005 and #2600

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 18, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @godwinpang!

It looks like this is your first PR to kubernetes-sigs/controller-runtime 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/controller-runtime has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: godwinpang
Once this PR has been reviewed and has the lgtm label, please assign vincepri for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 18, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @godwinpang. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 18, 2025

## Concerns/Questions
1. Controllers opted into this feature will break the workqueue.depth metric as the controller will
have a pre filled queue before it starts processing items.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, one way to avoid this is to use a metrics wrapper that supresses them until the leader election is won. But not sure if its worth bothering

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually break the metric? Sounds like the metric will just show the reality

It might break alerts that assume the queue length should be pretty low, but that's an issue of the alerts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might break alerts that assume the queue length should be pretty low, but that's an issue of the alerts.

Not sure I quite agree with that. The alerts work today, if we change the behavior here, we break them. To my knowledge there also isn't a metric that indicates if a given replica is the leader, so I don't even see a good way to unbreak them

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but the current definition of the metric is that it should show the length of the queue

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my knowledge there also isn't a metric that indicates if a given replica is the leader

That could be solved, I hope :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • logs via logState
  • other workqueue metrics: adds_total, queue_duration_seconds
    • Although I guess we can also fake these. What would happen when the controller starts? I assume we would set the length metric immediately to it's correct value. Similar for adds_total and probably also queue_duration_seconds

I also think folks have programmatic access to the queue (at least by instantiating the queue in controller.Options.NewQueue)

So we don't know what kind of things folks are doing with the queue, e.g. accessing queue length or taking items out of the queue even if the leader election controllers are not running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we maybe shouldn't store the items in the queue at this time because that's observable behavior as well (not only through the metric) and not just make it look like the queue is empty through the metric

That makes sense to me. I guess reading between the lines, the suggestion here is that non-leader runnables should The specific behavior I want to solve for is to have the queue ready to go asap after leader election, so would we implement something like "populate queue from cache on winning leader election"? I am not too familiar with the details here so would appreciate some concrete nudges in the right direction

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me. I guess reading between the lines, the suggestion here is that non-leader runnables should The specific behavior I want to solve for is to have the queue ready to go asap after leader election, so would we implement something like "populate queue from cache on winning leader election"?

I don't know how that would be possible, we would need to implement a dummy queue to buffer.

Lets punt on this problem until we make this the default behavior, we don't have to solve it right away. For as long as its opt-in and there is a metric indicating if a given replica is a leader, it won't break anyone by default and people can adjust their alerts accordingly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought more about this. Probably the initial suggestion is fine. We should just make sure that the metrics popping up once the controller becomes leader make sense

## Concerns/Questions
1. Controllers opted into this feature will break the workqueue.depth metric as the controller will
have a pre filled queue before it starts processing items.
2. Ideally, non-leader runnables should block readyz and healthz checks until they are in sync. I am
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break conversion webhooks. I don't know if there is a good way to figure out if the binary contains a conversion webhook, but if in doubt we have to retain the current behavior

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why this breaks conversion webhooks? Taking a step back, why does controller-runtime offer a way to run conversion webhooks in the first place?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why this breaks conversion webhooks?

Because they need to be up to sync the cache, so blocking readyz until the cache is ready creates a deadlock.

Taking a step back, why does controller-runtime offer a way to run conversion webhooks in the first place?

Because they are part of some controllers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, we can just put the burden on the user to register readyz as they want then

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially we can provide a util func like we did with mgr.GetWebhookServer().StartedChecker() to make it easier (can be done in a follow-up PR)

But agree we can't add this per default as it can lead to deadlocks for controllers that serve conversion webhooks.

2. Ideally, non-leader runnables should block readyz and healthz checks until they are in sync. I am
not sure what the best way of implementing this is, because we would have to add a healthz check
that blocks on WaitForSync for all the sources started as part of the non-leader runnables.
3. An alternative way of implementing the above is to moving the source starting / management code
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is kind-of already the case, the source uses the cache which is a runnable to get the actual informer and the cache is shared and started before anything else except conversion webhooks (as conversion webhooks might be needed to start it). The problem is just that the cache does not actually cache anything for resources for which no informer was requested and that in turn only happens after the source is started which happens post controller start and thus post leader election


## Motivation
Controllers reconcile all objects during startup / leader election failover to account for changes
in the reconciliation logic. For certain sources, the time to serve the initial list can be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious. Is it literally the list call that takes minutes or the subsequent reconciliation?

Does this change when the new ListWatch feature is used?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is purely about the time it takes to start reconciling after a replica went down because of a rollout or an unforseen event. Right now, that means we first acquire the leader lease, then sync all caches, then start reconciling. The goal of this doc is to do the second step before we even try to aquire the leader lease, as that will take the time it takes to sync the caches out of the transition time.
Agree the description could be a bit clearer.

Copy link
Member

@sbueringer sbueringer Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's fine. I was just curious about the list call :) (aka the cache sync)

downtime as even after leader election, the controller has to wait for the initial list to be served
before it can start reconciling.

## Proposal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if there's a lot of churn and the controller doesn't become leader maybe not at all or maybe after a few days?

The queue length will increase while there is nothing that takes items out of the queue.

I know the queue doesn't require significant memory to store an item but is there something we should be concerned about if the queue has eg millions of items (let's say we watch pods and we don't become leader for a month)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worst case it at some point gets oom killed, restarts, does everything again, I don't think this is likely to become an actual issue

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah definitely not a big problem. Maybe just good to know

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the queue have deduplication built in? I thought that you couldn't re-add a key to the queue if the key already existed within the queue.

In which case, when you perform the initial list, it adds every item to the queue, and then every object created after would be added to the queue, but the queue length would always be equal to the number of those objects in the cluster

We should double check if this is actually how they work

Copy link
Member

@sbueringer sbueringer May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the queue have deduplication built in? I thought that you couldn't re-add a key to the queue if the key already existed within the queue.

Yes there is de-duplication. We are basically talking about a map :)

queue length would always be equal to the number of those objects in the cluster

I don't think that deleted objects are removed from our queue. So if you have churn (i.e. any objects are deleted) the queue length won't be equal to the number of objects in the cluster.

But the queue only stores namespace + name. So it doesn't require a lot of memory per object.


## Concerns/Questions
1. Controllers opted into this feature will break the workqueue.depth metric as the controller will
have a pre filled queue before it starts processing items.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually break the metric? Sounds like the metric will just show the reality

It might break alerts that assume the queue length should be pretty low, but that's an issue of the alerts.

@sbueringer
Copy link
Member

Took another look. Definitely fine for now.

I would suggest let's review/merge #3192 and then we can finalize this PR afterwards

Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a POC or any implementation that would show what the difference is for an end user? Sources generally are informers right? So, as someone implementing a controller, how would I start the informers in this case vs what I would have normally done?

A worked example of the usage would probably be good to include in the design doc

downtime as even after leader election, the controller has to wait for the initial list to be served
before it can start reconciling.

## Proposal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the queue have deduplication built in? I thought that you couldn't re-add a key to the queue if the key already existed within the queue.

In which case, when you perform the initial list, it adds every item to the queue, and then every object created after would be added to the queue, but the queue length would always be equal to the number of those objects in the cluster

We should double check if this is actually how they work

// when the manager is not the leader.
// Defaults to false, which means that the controller will wait for leader election to start
// before starting sources.
ShouldWarmupWithoutLeadership *bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why *bool over bool?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually like to use *bool so we can differentiate between default value and whatever a user configured.

This is also necessary as we have two levels of configuration ("manager" and per controller) and we have to be able to figure out if the per-controller config overrides the manager config

@JoelSpeed
Copy link
Contributor

Has the impact to API servers been assessed for this proposal? IIUC, the slow to start sources are likely informers that are having to page through large volumes of resources, right? So there's already a higher than average load on the API server for these types of resources, and this EP is basically going to double the API server load? Am I following this correctly?

@alvaroaleman
Copy link
Member

Has the impact to API servers been assessed for this proposal? IIUC, the slow to start sources are likely informers that are having to page through large volumes of resources, right? So there's already a higher than average load on the API server for these types of resources, and this EP is basically going to double the API server load? Am I following this correctly?

Its an opt-in feature and in the vast majority of cases, the kube-apiserver is managed. The problem this solves is that failover is currently slow due to the fact that the watches are only started after leader election is won.

@sbueringer
Copy link
Member

sbueringer commented May 12, 2025

Maybe additional apiserver load is worth mentioning in the godoc of the field that enables the feature?

Do you have a POC or any implementation that would show what the difference is for an end user? Sources generally are informers right? So, as someone implementing a controller, how would I start the informers in this case vs what I would have normally done?

The implementation can be found here: #3192

You only have to set NeedWarmup on the manager or per-controller and everything else just works out-of-the-box.

I would recommend reviewing the PR as we are figuring out some details there at the moment (e.g. I raised the question if WarmupRunnable should be exported, xref: #3192 (comment), this would drastically reduce the API surface of this feature)

@JoelSpeed
Copy link
Contributor

Its an opt-in feature and in the vast majority of cases, the kube-apiserver is managed. The problem this solves is that failover is currently slow due to the fact that the watches are only started after leader election is won.

Opt-in for the operator author, or opt-in for the end user who deploys that operator top their cluster? I suspect we might want to recommend that folks expose an option to end users to be able to turn this on/off on a cluster by cluster basis

I guess that would be exposing the NeedWarmup flag to the end user right?

@alvaroaleman
Copy link
Member

alvaroaleman commented May 12, 2025

Opt-in for the operator author, or opt-in for the end user who deploys that operator top their cluster? I suspect we might want to recommend that folks expose an option to end users to be able to turn this on/off on a cluster by cluster basis

That is up to the operator author to decide. I personally don't think the duplicates watches are an issue for the apiserver, if that can make it fall over, its too small anyways. But generally speaking, making a call as to what to do or not to do by default and if an end-user option should be exposed or not is up to the controller author, I don't believe controller-runtime is in a good position to make a recommendation there.

@sbueringer
Copy link
Member

Usually we don't go as far as recommending command line flags in controller-runtime. That's maybe more kubebuilder territory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants