Description
Overview
I have a full and incremental backup schedule defined for a postgres cluster. I consistently see more then one job get scheduled, causing the namespace to be filled with pods in an Error
state, with duplicate jobs (the ones that didn't run first) throwing error:
[ERROR: [050]: unable to acquire lock on file '/tmp/pgbackrest/db-backup.lock': Resource temporarily unavailable\n HINT: is another pgBackRest process running?\n]
As far as I can tell, this is not preventing backups from running, as at least one of the jobs succeeds, leaving the others in an error state.
Environment
Please provide the following details:
- Platform:
Kubernetes
- Platform Version:
1.20.15
- PGO Image Tag:
ubi8-5.0.4-0
- Postgres Version:
13
- Storage:
isci
PVCs for pods,s3
for pgbackrest
Steps to Reproduce
REPRO
- Build a postgres cluster with the operator, use S3 as a backend for pgbackrest
- Schedule a full and incremental backup job schedule
- See the full backup jobs conflict with each other
EXPECTED
- I would expect only one job per schedule invocation get created at a time, until it is finished
ACTUAL
- More the one job gets created, causing the errors seen above
Logs
time="2022-04-05T01:00:19Z" level=info msg="crunchy-pgbackrest starts"
time="2022-04-05T01:00:19Z" level=info msg="debug flag set to false"
time="2022-04-05T01:00:19Z" level=info msg="backrest backup command requested"
time="2022-04-05T01:00:19Z" level=info msg="command to execute is [pgbackrest backup --stanza=db --repo=1 --type=full]"
time="2022-04-05T01:00:19Z" level=info msg="output=[]"
time="2022-04-05T01:00:19Z" level=info msg="stderr=[ERROR: [050]: unable to acquire lock on file '/tmp/pgbackrest/db-backup.lock': Resource temporarily unavailable\n HINT: is another pgBackRest process running?\n]"
time="2022-04-05T01:00:19Z" level=fatal msg="command terminated with exit code 50"
Additional Information
When inspecting the CronJob definition, there is a field ConcurrencyPolicy
set to Allow
. Would it not make more sense to set this value to Forbid
(or at least have the option of doing so via the CRD), to prevent this type of scheduling conflict from happening?