Description
Please, answer some short questions which should help us to understand your problem / question better?
- Which image of the operator are you using? registry.opensource.zalan.do/acid/spilo-13:2.0-p7
- **Where do you run it - Azure / AWS
- Are you running Postgres Operator in production? Yes
- Type of issue? Bug report
We are seeing the Spilo pods occasionally end up in a deadlock where Patroni keeps restarting over and over, but failing because Postgres is already running.
This issue is already known on Patroni's side: patroni/patroni#1733
However, it appears that @CyberDem0n decided to close it without resolution due to his belief the user was using Patroni incorrectly.
Unfortunately, it appears normal operations for the postgres-operator are also causing this bug to trigger.
We're not doing anything special for this to occur - it seems to just happen on its own, and it's not clear exactly what the trigger is.
There's only one log entry prior to this occurring:
Dec 14 17:05:17.903 aks-nodepool1-28461340-vmss000001 spilo-13 prod-celanese /run/service/patroni: finished with code=-1 signal=9
This lines up with what @CyberDem0n mentions in the Patroni issue, that kill -9
on Patroni can trigger this bug. However, no one is manually running kill -9
on our side - this appears to be something either happening from within Spilo or Kubernetes itself.
As I mention in the other issue as well, simply doing a pkill postgresql
within the deadlocked pod is enough to fix the issue.
So here's the issues at play here:
- What is causing this to happen in postgres-operator?
- Is this a problem with the postgres-operator, or with Kubernetes?
- How can one prevent this from happening?
- Can this be fixed on the postgres-operator side?