Open
Description
Please, answer some short questions which should help us to understand your problem / question better?
-
Which image of the operator are you using?
1.6.2 -
Where do you run it - cloud or metal? Kubernetes or OpenShift? [AWS K8s | GCP ... | Bare Metal K8s]
openshift -
Are you running Postgres Operator in production? [yes | no]
no -
Type of issue? [Bug report, question, feature request, etc.]
question/bug
I am cordoning the node which hosts the master pod, and the operator reports:
time="2021-04-29T12:03:36Z" level=info msg="moving pods: node \"/alt-eos-g-c01oco03\" became unschedulable and does not have a ready label: map[]" pkg=controller
time="2021-04-29T12:03:36Z" level=info msg="starting process to migrate master pod \"adc-dev/adc-batchinator-db-1\"" cluster-name=adc-dev/adc-batchinator-db pkg=cluster
time="2021-04-29T12:03:36Z" level=debug msg="Waiting for any replica pod to become ready" cluster-name=adc-dev/adc-batchinator-db pkg=cluster
time="2021-04-29T12:03:36Z" level=debug msg="Found 1 running replica pods" cluster-name=adc-dev/adc-batchinator-db pkg=cluster
time="2021-04-29T12:03:36Z" level=info msg="check failed: pod \"adc-dev/adc-batchinator-db-0\" is already on a live node" cluster-name=adc-dev/adc-batchinator-db pkg=cluster
time="2021-04-29T12:03:36Z" level=debug msg="switching over from \"adc-batchinator-db-1\" to \"adc-dev/adc-batchinator-db-0\"" cluster-name=adc-dev/adc-batchinator-db pkg=cluster
time="2021-04-29T12:03:36Z" level=debug msg="making POST http request: http://10.200.12.5:8008/failover" cluster-name=adc-dev/adc-batchinator-db pkg=cluster
time="2021-04-29T12:03:36Z" level=debug msg="subscribing to pod \"adc-dev/adc-batchinator-db-0\"" cluster-name=adc-dev/adc-batchinator-db pkg=cluster
time="2021-04-29T12:03:37Z" level=debug msg="unsubscribing from pod \"adc-dev/adc-batchinator-db-0\" events" cluster-name=adc-dev/adc-batchinator-db pkg=cluster
time="2021-04-29T12:03:37Z" level=error msg="could not move master pod \"adc-dev/adc-batchinator-db-1\": could not failover to pod \"adc-dev/adc-batchinator-db-0\": could not switch over from \"adc-batchinator-db-1\" to \"adc-dev/adc-batchinator-db-0\": patroni returned 'Failover failed'" pkg=controller
time="2021-04-29T12:03:37Z" level=info msg="0/1 master pods have been moved out from the \"/alt-eos-g-c01oco03\" node" pkg=controller
time="2021-04-29T12:03:37Z" level=warning msg="failed to move master pods from the node \"alt-eos-g-c01oco03\": could not move master 1/1 pods from the \"/alt-eos-g-c01oco03\" node" pkg=controller
The reason for it failing seems to be replication lag:
2021-04-29 12:03:36,505 INFO: received failover request with leader=adc-batchinator-db-1 candidate=adc-batchinator-db-0 scheduled_at=None
2021-04-29 12:03:36,515 INFO: Got response from adc-batchinator-db-0 http://10.200.16.5:8008/patroni: {"state": "running", "postmaster_start_time": "2021-04-29 09:13:56.578 UTC", "role": "replica", "server_version": 130001, "cluster_unlocked": false, "xlog": {"received_location": 5100273664, "replayed_location": 5100273664, "replayed_timestamp": null, "paused": false}, "timeline": 205, "database_system_identifier": "6929937442175926342", "patroni": {"version": "2.0.1", "scope": "adc-batchinator-db"}}
2021-04-29 12:03:36,653 INFO: Lock owner: adc-batchinator-db-1; I am adc-batchinator-db-1
2021-04-29 12:03:36,711 INFO: Got response from adc-batchinator-db-0 http://10.200.16.5:8008/patroni: {"state": "running", "postmaster_start_time": "2021-04-29 09:13:56.578 UTC", "role": "replica", "server_version": 130001, "cluster_unlocked": false, "xlog": {"received_location": 5100273664, "replayed_location": 5100273664, "replayed_timestamp": null, "paused": false}, "timeline": 205, "database_system_identifier": "6929937442175926342", "patroni": {"version": "2.0.1", "scope": "adc-batchinator-db"}}
2021-04-29 12:03:36,801 INFO: Member adc-batchinator-db-0 exceeds maximum replication lag
2021-04-29 12:03:36,801 WARNING: manual failover: no healthy members found, failover is not possible
my manifest is:
apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
namespace: "adc-dev"
name: "adc-batchinator-db"
spec:
teamId: "adc"
volume:
storageClass: "openebs-local"
size: "2Gi"
numberOfInstances: 2
users:
batchinator:
- superuser
- createdb
batchinator_user: []
databases:
# name: owner
batchinator: batchinator
postgresql:
version: "13"
patroni:
pg_hba:
- "local all all trust"
- "host all all locahost trust"
- "host postgres all localhost ident"
- "hostssl replication standby all md5"
- "hostssl all all 0.0.0.0/0 md5"
- "host all all 0.0.0.0/0 md5"
- "hostssl all +pamrole all pam"
what could cause the replication lag, and why is it not picking up? the database is basically idle
is there a metric one can track, in order to raise alerts when there is too large lag?