Can't failover due to replica lag

Please, answer some short questions which should help us to understand your problem / question better?

- **Which image of the operator are you using?**
1.6.2

- **Where do you run it - cloud or metal? Kubernetes or OpenShift?** [AWS K8s | GCP ... | Bare Metal K8s]
openshift

- **Are you running Postgres Operator in production?** [yes | no]
no

- **Type of issue?** [Bug report, question, feature request, etc.]
question/bug


I am cordoning the node which hosts the master pod, and the operator reports:

```
time="2021-04-29T12:03:36Z" level=info msg="moving pods: node \"/alt-eos-g-c01oco03\" became unschedulable and does not have a ready label: map[]" pkg=controller
time="2021-04-29T12:03:36Z" level=info msg="starting process to migrate master pod \"adc-dev/adc-batchinator-db-1\"" cluster-name=adc-dev/adc-batchinator-db pkg=cluster
time="2021-04-29T12:03:36Z" level=debug msg="Waiting for any replica pod to become ready" cluster-name=adc-dev/adc-batchinator-db pkg=cluster
time="2021-04-29T12:03:36Z" level=debug msg="Found 1 running replica pods" cluster-name=adc-dev/adc-batchinator-db pkg=cluster
time="2021-04-29T12:03:36Z" level=info msg="check failed: pod \"adc-dev/adc-batchinator-db-0\" is already on a live node" cluster-name=adc-dev/adc-batchinator-db pkg=cluster
time="2021-04-29T12:03:36Z" level=debug msg="switching over from \"adc-batchinator-db-1\" to \"adc-dev/adc-batchinator-db-0\"" cluster-name=adc-dev/adc-batchinator-db pkg=cluster
time="2021-04-29T12:03:36Z" level=debug msg="making POST http request: http://10.200.12.5:8008/failover" cluster-name=adc-dev/adc-batchinator-db pkg=cluster
time="2021-04-29T12:03:36Z" level=debug msg="subscribing to pod \"adc-dev/adc-batchinator-db-0\"" cluster-name=adc-dev/adc-batchinator-db pkg=cluster
time="2021-04-29T12:03:37Z" level=debug msg="unsubscribing from pod \"adc-dev/adc-batchinator-db-0\" events" cluster-name=adc-dev/adc-batchinator-db pkg=cluster
time="2021-04-29T12:03:37Z" level=error msg="could not move master pod \"adc-dev/adc-batchinator-db-1\": could not failover to pod \"adc-dev/adc-batchinator-db-0\": could not switch over from \"adc-batchinator-db-1\" to \"adc-dev/adc-batchinator-db-0\": patroni returned 'Failover failed'" pkg=controller
time="2021-04-29T12:03:37Z" level=info msg="0/1 master pods have been moved out from the \"/alt-eos-g-c01oco03\" node" pkg=controller
time="2021-04-29T12:03:37Z" level=warning msg="failed to move master pods from the node \"alt-eos-g-c01oco03\": could not move master 1/1 pods from the \"/alt-eos-g-c01oco03\" node" pkg=controller

```

The reason for it failing seems to be replication lag:

```
2021-04-29 12:03:36,505 INFO: received failover request with leader=adc-batchinator-db-1 candidate=adc-batchinator-db-0 scheduled_at=None
2021-04-29 12:03:36,515 INFO: Got response from adc-batchinator-db-0 http://10.200.16.5:8008/patroni: {"state": "running", "postmaster_start_time": "2021-04-29 09:13:56.578 UTC", "role": "replica", "server_version": 130001, "cluster_unlocked": false, "xlog": {"received_location": 5100273664, "replayed_location": 5100273664, "replayed_timestamp": null, "paused": false}, "timeline": 205, "database_system_identifier": "6929937442175926342", "patroni": {"version": "2.0.1", "scope": "adc-batchinator-db"}}
2021-04-29 12:03:36,653 INFO: Lock owner: adc-batchinator-db-1; I am adc-batchinator-db-1
2021-04-29 12:03:36,711 INFO: Got response from adc-batchinator-db-0 http://10.200.16.5:8008/patroni: {"state": "running", "postmaster_start_time": "2021-04-29 09:13:56.578 UTC", "role": "replica", "server_version": 130001, "cluster_unlocked": false, "xlog": {"received_location": 5100273664, "replayed_location": 5100273664, "replayed_timestamp": null, "paused": false}, "timeline": 205, "database_system_identifier": "6929937442175926342", "patroni": {"version": "2.0.1", "scope": "adc-batchinator-db"}}
2021-04-29 12:03:36,801 INFO: Member adc-batchinator-db-0 exceeds maximum replication lag
2021-04-29 12:03:36,801 WARNING: manual failover: no healthy members found, failover is not possible
```

my manifest is:
```
apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
  namespace: "adc-dev"
  name: "adc-batchinator-db"
spec:
  teamId: "adc"
  volume:
    storageClass: "openebs-local"
    size: "2Gi"
  numberOfInstances: 2
  users:
    batchinator:
    - superuser
    - createdb
    batchinator_user: []
  databases:
    # name: owner
    batchinator: batchinator
  postgresql:
    version: "13"
  patroni:
    pg_hba:
      - "local    all all trust"
      - "host     all all locahost trust"
      - "host     postgres all localhost ident"
      - "hostssl  replication standby all md5"
      - "hostssl  all all 0.0.0.0/0 md5"
      - "host     all all 0.0.0.0/0 md5"
      - "hostssl  all +pamrole all pam"
``` 


what could cause the replication lag, and why is it not picking up? the database is basically idle
is there a metric one can track, in order to raise alerts when there is too large lag?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can't failover due to replica lag #1476

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Can't failover due to replica lag #1476

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions