Skip to content

Cluster fails to restart if lock is held by other pod than the first one in the statefulset #1978

Open
@christiaangoossens

Description

@christiaangoossens

Please, answer some short questions which should help us to understand your problem / question better?

  • Which image of the operator are you using? registry.opensource.zalan.do/acid/postgres-operator:v1.8.2
  • Where do you run it - cloud or metal? Kubernetes or OpenShift? k0s
  • Are you running Postgres Operator in production? yes
  • Type of issue? Bug

Some general remarks when posting a bug report:

  • Please, check the operator, pod (Patroni) and postgresql logs first. When copy-pasting many log lines please do it in a separate GitHub gist together with your Postgres CRD and configuration manifest.
  • If you feel this issue might be more related to the Spilo docker image or Patroni, consider opening issues in the respective repos.

Cluster fails to restart if the lock is held by pod-1, while the statefulset is restarting. Pod-0 will start, recognise that pod-1 holds the lock and wait forever, eventually falling into a state where it knows its not up to date and thus waits forever.

Due to using podManagementPolicy Ordered, instead of Parallel, pod-1 will never be started, and thus this configuration stays broken.

In my case this happened when a node failed, there was a failover to pod-1 and then both died.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions