SyncFailed cluster issue

Hi all. Thanks for postgres-operator. Its amazing!

Introductory:
- postgres-operator 1.5 version (updated from 1.4)
- aws ebs (gp2, resize enabled)
- wal-e replication to aws s3
- cluster was created according to the instructions (https://postgres-operator.readthedocs.io/en/stable/quickstart/) using helm
- database size no more than 100 mb

**I faced the following problem:**
The postgres cluster went into SyncFailed status, the database did not respond for a while, I initiated failover by deleting the leading pod, and after a while postgres began to respond.

But SyncFailed status remains.

**Next** I checked the logs, and saw a message about insufficient disk space (aws-ebs), based on this problem.
leader log:
```
2020-08-05 14:59:18,408 INFO: Lock owner: stats-service-postgres-0; I am stats-service-postgres-0
2020-08-05 14:59:18,459 INFO: no action.  i am the leader with the lock
2020-08-05 14:59:24.863 35 LOG {ticks: 0, maint: 0, retry: 0}
2020-08-05 14:59:28,408 INFO: Lock owner: stats-service-postgres-0; I am stats-service-postgres-0
2020-08-05 14:59:28,458 INFO: no action.  i am the leader with the lock
```

replica log:
```
2020-08-05 14:45:24 UTC [25670]: [1-1] 5f2ac604.6446 0     FATAL:  could not write lock file "postmaster.pid": No space left on device
2020-08-05 14:45:24,599 INFO: postmaster pid=25670
/var/run/postgresql:5432 - no response
2020-08-05 14:45:24,610 WARNING: Postgresql is not running.
2020-08-05 14:45:24,611 INFO: Lock owner: stats-service-postgres-0; I am stats-service-postgres-1
2020-08-05 14:45:24,614 INFO: pg_controldata:
  pg_control version number: 1201
  Catalog version number: 201909212
  Database system identifier: 6846819719288311883
  Database cluster state: in crash recovery
  pg_control last modified: Tue Aug  4 07:26:19 2020
  Latest checkpoint location: 5/F5041740
  Latest checkpoint's REDO location: 5/F5041708
  Latest checkpoint's REDO WAL file: 0000000600000005000000F5
  Latest checkpoint's TimeLineID: 6
  Latest checkpoint's PrevTimeLineID: 6
  Latest checkpoint's full_page_writes: on
  Latest checkpoint's NextXID: 0:46345
  Latest checkpoint's NextOID: 129481
  Latest checkpoint's NextMultiXactId: 1
  Latest checkpoint's NextMultiOffset: 0
  Latest checkpoint's oldestXID: 480
  Latest checkpoint's oldestXID's DB: 1
  Latest checkpoint's oldestActiveXID: 46345
  Latest checkpoint's oldestMultiXid: 1
  Latest checkpoint's oldestMulti's DB: 1
  Latest checkpoint's oldestCommitTsXid: 0
  Latest checkpoint's newestCommitTsXid: 0
  Time of latest checkpoint: Mon Aug  3 20:15:21 2020
  Fake LSN counter for unlogged rels: 0/3E8
  Minimum recovery ending location: 0/0
  Min recovery ending loc's timeline: 0
  Backup start location: 0/0
  Backup end location: 0/0
  End-of-backup record required: no
  wal_level setting: replica
  wal_log_hints setting: on
  max_connections setting: 100
  max_worker_processes setting: 8
  max_wal_senders setting: 10
  max_prepared_xacts setting: 0
  max_locks_per_xact setting: 64
  track_commit_timestamp setting: off
  Maximum data alignment: 8
  Database block size: 8192
  Blocks per segment of large relation: 131072
  WAL block size: 8192
  Bytes per WAL segment: 16777216
  Maximum length of identifiers: 64
  Maximum columns in an index: 32
  Maximum size of a TOAST chunk: 1996
  Size of a large-object chunk: 2048
  Date/time type storage: 64-bit integers
  Float4 argument passing: by value
  Float8 argument passing: by value
  Data page checksum version: 1
  Mock authentication nonce: 5229a15400d9c8af60e52ed563c09d5843e4b3b37100937f4ab129cd87794846

2020-08-05 14:45:24,614 INFO: Lock owner: stats-service-postgres-0; I am stats-service-postgres-1
2020-08-05 14:45:24,631 INFO: Local timeline=None lsn=None
2020-08-05 14:45:24,631 INFO: Lock owner: stats-service-postgres-0; I am stats-service-postgres-1
2020-08-05 14:45:24,632 INFO: starting as a secondary

```

`df -h` on replica pod:
```
root@stats-service-postgres-1:/home/postgres# df -h
Filesystem      Size  Used Avail Use% Mounted on
overlay          80G  8.1G   72G  11% /
tmpfs            64M     0   64M   0% /dev
tmpfs           7.9G     0  7.9G   0% /sys/fs/cgroup
tmpfs           7.9G     0  7.9G   0% /dev/shm
/dev/xvda1       80G  8.1G   72G  11% /etc/hosts
/dev/xvdbo      7.8G  7.8G     0 100% /home/postgres/pgdata
tmpfs           7.9G   12K  7.9G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs           7.9G     0  7.9G   0% /proc/acpi
tmpfs           7.9G     0  7.9G   0% /proc/scsi
tmpfs           7.9G     0  7.9G   0% /sys/firmware
```

**Next** I tried to resize values by changing the cluster manifest, it did not help, the disks used remained the same size.
  I see it in postgres-operator pod log:
`
time="2020-08-05T13:56:12Z" level=info msg="Event(v1.ObjectReference{Kind:\"postgresql\", Namespace:\"prod\", Name:\"stats-service-postgres\", UID:\"d7057230-c085-11ea-9b25-02260b2a4d18\", APIVersion:\"acid.zalan.do/v1\", ResourceVersion:\"76480885\", FieldPath:\"\"}): type: 'Warning' reason: 'Sync' could not sync cluster: could not sync persistent volumes: could not sync volumes: could not resize EBS volume \"vol-01bc1a7307c758de4\": could not modify persistent volume: UnauthorizedOperation: You are not authorized to perform this operation. Encoded authorization failure message: ziGbQt...f9VjrGR\n\tstatus code: 403, request id: 6b6b...137"
`

**Next** I tried cloning the cluster from s3 but that also failed.
  new cluster status: CreateFailed.
  new cluster leader log:
```
...
REVOKE
GRANT
GRANT
RESET
server signaled
2020-08-05 14:32:54,398 WARNING: Could not activate Linux watchdog device: "Can't open watchdog device: [Errno 2] No such file or directory: '/dev/watchdog'"
2020-08-05 14:32:54,399 INFO: establishing a new patroni connection to the postgres cluster
2020-08-05 14:32:54,517 INFO: initialized a new cluster
2020-08-05 14:33:04,349 INFO: Lock owner: stats-service-db-0; I am stats-service-db-0
2020-08-05 14:33:04,403 INFO: Lock owner: stats-service-db-0; I am stats-service-db-0
...
2020-08-05 14:49:44,349 INFO: Lock owner: stats-service-db-0; I am stats-service-db-0
2020-08-05 14:49:44,399 INFO: no action.  i am the leader with the lock
2020-08-05 14:49:48.055 42 LOG {ticks: 0, maint: 0, retry: 0}
WARNING:  still waiting for all required WAL segments to be archived (960 seconds elapsed)
HINT:  Check that your archive_command is executing properly.  You can safely cancel this backup, but the database backup will not be usable without all the WAL segments.
2020-08-05 14:49:54,349 INFO: Lock owner: stats-service-db-0; I am stats-service-db-0
2020-08-05 14:49:54,398 INFO: no action.  i am the leader with the lock
```

new custer replica log:
```
...
2020-08-05 15:15:37,456 INFO: Lock owner: stats-service-db-0; I am stats-service-db-1
2020-08-05 15:15:37,456 INFO: bootstrap from leader 'stats-service-db-0' in progress
2020-08-05 15:15:47,456 INFO: Lock owner: stats-service-db-0; I am stats-service-db-1
2020-08-05 15:15:47,456 INFO: bootstrap from leader 'stats-service-db-0' in progress

```

**Questions:**
1. How can I restore a cluster? (or clone it)
2. What should be done to prevent this problem for other clusters where pg_wal also takes up all disk space? Do I need to assign BACKUP_NUM_TO_RETAIN env via pod_environment_configmap (I haven't created this yet)?

Thanks
 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SyncFailed cluster issue #1089

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SyncFailed cluster issue #1089

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions