Description
Please, answer some short questions which should help us to understand your problem / question better?
- Which image of the operator are you using? registry.opensource.zalan.do/acid/postgres-operator:v1.5.0-44-gab95eaa-dirty
- Where do you run it - cloud or metal? Kubernetes or OpenShift? OpenShift
- Are you running Postgres Operator in production? [yes | no] No
- Type of issue? [Bug report, question, feature request, etc.] Bug Report
Hi Team,
We have a replica pod which needs to re-initialized the pg_basebackup fails with an error.
We checked the pg_logs on the running master to find the reason, and in the csv logs we found:
ERROR:
2021-05-17 09:06:47.558 UTC,"standby","",577,"10.254.5.196:34590",60a23225.241,3,"sending backup ""pg_basebackup base backup""",2021-05-17 09:06:45 UTC,13/0,0,ERROR,XX000,"invalid segment number 0 in file ""pg_internal.init.304584""",,,,,,,,,"pg_basebackup"
We identified on the master that the file pg_internal.init.304584
exists, and it's empty.
When we check for such files (pg_internal.init.*), we figure out there are many such files under $PGDATA/base
& $PGDATA/global
directories, with size 0 bytes.
Note:
These suffix numbers (304584) don't match with the pid's inside the container.
We assume these pid's are the pid's as seen outside of the container (probably are the real pid's on the Openshift nodes?).
To recover from this situation we had to stop slave temporary and on the master find and delete pg_internal.init.* files from the master.
When replica gets restarted, it was able to resync. (we deleted the replica temporarily as it the backup processes get restarted every 5s and probably that's why new pg_internal.init.* get created constantly).
-
I am not sure if this is at all related to OCP or it's there in general.
Can someone confirm if in K8S the suffixes are with the in container pid ? -
Can you share some thoughts on what could be the issue?
-
BTW, for a non-production scenario, we noticed a possible workaround could be to set "pg_basebackup --no-verify-checksums".
Is there a way to make patroni use "--no-verify-checksums" for the pg_basebackup ?
Note: we are using PGSQL version 12.2.