Skip to content

Can't failover due to replica lag #1476

Open
@davidkarlsen

Description

@davidkarlsen

Please, answer some short questions which should help us to understand your problem / question better?

  • Which image of the operator are you using?
    1.6.2

  • Where do you run it - cloud or metal? Kubernetes or OpenShift? [AWS K8s | GCP ... | Bare Metal K8s]
    openshift

  • Are you running Postgres Operator in production? [yes | no]
    no

  • Type of issue? [Bug report, question, feature request, etc.]
    question/bug

I am cordoning the node which hosts the master pod, and the operator reports:

time="2021-04-29T12:03:36Z" level=info msg="moving pods: node \"/alt-eos-g-c01oco03\" became unschedulable and does not have a ready label: map[]" pkg=controller
time="2021-04-29T12:03:36Z" level=info msg="starting process to migrate master pod \"adc-dev/adc-batchinator-db-1\"" cluster-name=adc-dev/adc-batchinator-db pkg=cluster
time="2021-04-29T12:03:36Z" level=debug msg="Waiting for any replica pod to become ready" cluster-name=adc-dev/adc-batchinator-db pkg=cluster
time="2021-04-29T12:03:36Z" level=debug msg="Found 1 running replica pods" cluster-name=adc-dev/adc-batchinator-db pkg=cluster
time="2021-04-29T12:03:36Z" level=info msg="check failed: pod \"adc-dev/adc-batchinator-db-0\" is already on a live node" cluster-name=adc-dev/adc-batchinator-db pkg=cluster
time="2021-04-29T12:03:36Z" level=debug msg="switching over from \"adc-batchinator-db-1\" to \"adc-dev/adc-batchinator-db-0\"" cluster-name=adc-dev/adc-batchinator-db pkg=cluster
time="2021-04-29T12:03:36Z" level=debug msg="making POST http request: http://10.200.12.5:8008/failover" cluster-name=adc-dev/adc-batchinator-db pkg=cluster
time="2021-04-29T12:03:36Z" level=debug msg="subscribing to pod \"adc-dev/adc-batchinator-db-0\"" cluster-name=adc-dev/adc-batchinator-db pkg=cluster
time="2021-04-29T12:03:37Z" level=debug msg="unsubscribing from pod \"adc-dev/adc-batchinator-db-0\" events" cluster-name=adc-dev/adc-batchinator-db pkg=cluster
time="2021-04-29T12:03:37Z" level=error msg="could not move master pod \"adc-dev/adc-batchinator-db-1\": could not failover to pod \"adc-dev/adc-batchinator-db-0\": could not switch over from \"adc-batchinator-db-1\" to \"adc-dev/adc-batchinator-db-0\": patroni returned 'Failover failed'" pkg=controller
time="2021-04-29T12:03:37Z" level=info msg="0/1 master pods have been moved out from the \"/alt-eos-g-c01oco03\" node" pkg=controller
time="2021-04-29T12:03:37Z" level=warning msg="failed to move master pods from the node \"alt-eos-g-c01oco03\": could not move master 1/1 pods from the \"/alt-eos-g-c01oco03\" node" pkg=controller

The reason for it failing seems to be replication lag:

2021-04-29 12:03:36,505 INFO: received failover request with leader=adc-batchinator-db-1 candidate=adc-batchinator-db-0 scheduled_at=None
2021-04-29 12:03:36,515 INFO: Got response from adc-batchinator-db-0 http://10.200.16.5:8008/patroni: {"state": "running", "postmaster_start_time": "2021-04-29 09:13:56.578 UTC", "role": "replica", "server_version": 130001, "cluster_unlocked": false, "xlog": {"received_location": 5100273664, "replayed_location": 5100273664, "replayed_timestamp": null, "paused": false}, "timeline": 205, "database_system_identifier": "6929937442175926342", "patroni": {"version": "2.0.1", "scope": "adc-batchinator-db"}}
2021-04-29 12:03:36,653 INFO: Lock owner: adc-batchinator-db-1; I am adc-batchinator-db-1
2021-04-29 12:03:36,711 INFO: Got response from adc-batchinator-db-0 http://10.200.16.5:8008/patroni: {"state": "running", "postmaster_start_time": "2021-04-29 09:13:56.578 UTC", "role": "replica", "server_version": 130001, "cluster_unlocked": false, "xlog": {"received_location": 5100273664, "replayed_location": 5100273664, "replayed_timestamp": null, "paused": false}, "timeline": 205, "database_system_identifier": "6929937442175926342", "patroni": {"version": "2.0.1", "scope": "adc-batchinator-db"}}
2021-04-29 12:03:36,801 INFO: Member adc-batchinator-db-0 exceeds maximum replication lag
2021-04-29 12:03:36,801 WARNING: manual failover: no healthy members found, failover is not possible

my manifest is:

apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
  namespace: "adc-dev"
  name: "adc-batchinator-db"
spec:
  teamId: "adc"
  volume:
    storageClass: "openebs-local"
    size: "2Gi"
  numberOfInstances: 2
  users:
    batchinator:
    - superuser
    - createdb
    batchinator_user: []
  databases:
    # name: owner
    batchinator: batchinator
  postgresql:
    version: "13"
  patroni:
    pg_hba:
      - "local    all all trust"
      - "host     all all locahost trust"
      - "host     postgres all localhost ident"
      - "hostssl  replication standby all md5"
      - "hostssl  all all 0.0.0.0/0 md5"
      - "host     all all 0.0.0.0/0 md5"
      - "hostssl  all +pamrole all pam"

what could cause the replication lag, and why is it not picking up? the database is basically idle
is there a metric one can track, in order to raise alerts when there is too large lag?

Metadata

Metadata

Assignees

No one assigned

    Labels

    patroniIssue more related to PatronipostgresIssue more related to PostgreSQL

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions