Skip to content

Revising the worker pattern #879

Open
@fischerman

Description

@fischerman

Trying to understand the problem described in #809, I've learned a lot about the internals of the Postgres operator. I've noticed that there is only a limited number of event types in the queues -- all regarding the Postgres CRD. The workers process one event after another, obviously. However, processing an event might take quite some time. The worst cases is that some timeout (by default 10m) is triggered, leaving the worker blocking for a long time. Given the fact, that any cluster is mapped to a single worker, the latter can quickly become a bottleneck. Effectively a single cluster influences all clusters of that worker, including creating new ones. There are many reasons why pods might fail: resource quota exceeded, volume missing, no suitable node, ... In our setting we can't make sure that all clusters are running fine, but we have to make sure that the operator runs stable.

I was wondering if one of the authors can shed some light on the design decision behind the worker pattern and fixing clusters to workers.

I believe that the stability could be improved by either ...

  • ... removing the workers as the central component between clusters and have go routines per cluster. Go routines are fairly cheap.
  • ... split the worker jobs down and process the next event if we have something to wait for. For example, when creating a cluster we wouldn't wait for pods to be ready. Instead the pod informer inserts an event for every new pod (or pod change) and we continue from there. This is somewhat inspired from the Javascript event loop.

I realize that both of those changes are quite large, I just want to get the discussion started.

In the meatime, I was wondering if it is a good workaround to increase the number of worker beyond the number of clusters, say above 100. Is that reasonable?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions