Description
Hi all. Thanks for postgres-operator. Its amazing!
Introductory:
- postgres-operator 1.5 version (updated from 1.4)
- aws ebs (gp2, resize enabled)
- wal-e replication to aws s3
- cluster was created according to the instructions (https://postgres-operator.readthedocs.io/en/stable/quickstart/) using helm
- database size no more than 100 mb
I faced the following problem:
The postgres cluster went into SyncFailed status, the database did not respond for a while, I initiated failover by deleting the leading pod, and after a while postgres began to respond.
But SyncFailed status remains.
Next I checked the logs, and saw a message about insufficient disk space (aws-ebs), based on this problem.
leader log:
2020-08-05 14:59:18,408 INFO: Lock owner: stats-service-postgres-0; I am stats-service-postgres-0
2020-08-05 14:59:18,459 INFO: no action. i am the leader with the lock
2020-08-05 14:59:24.863 35 LOG {ticks: 0, maint: 0, retry: 0}
2020-08-05 14:59:28,408 INFO: Lock owner: stats-service-postgres-0; I am stats-service-postgres-0
2020-08-05 14:59:28,458 INFO: no action. i am the leader with the lock
replica log:
2020-08-05 14:45:24 UTC [25670]: [1-1] 5f2ac604.6446 0 FATAL: could not write lock file "postmaster.pid": No space left on device
2020-08-05 14:45:24,599 INFO: postmaster pid=25670
/var/run/postgresql:5432 - no response
2020-08-05 14:45:24,610 WARNING: Postgresql is not running.
2020-08-05 14:45:24,611 INFO: Lock owner: stats-service-postgres-0; I am stats-service-postgres-1
2020-08-05 14:45:24,614 INFO: pg_controldata:
pg_control version number: 1201
Catalog version number: 201909212
Database system identifier: 6846819719288311883
Database cluster state: in crash recovery
pg_control last modified: Tue Aug 4 07:26:19 2020
Latest checkpoint location: 5/F5041740
Latest checkpoint's REDO location: 5/F5041708
Latest checkpoint's REDO WAL file: 0000000600000005000000F5
Latest checkpoint's TimeLineID: 6
Latest checkpoint's PrevTimeLineID: 6
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0:46345
Latest checkpoint's NextOID: 129481
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 480
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 46345
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTsXid: 0
Latest checkpoint's newestCommitTsXid: 0
Time of latest checkpoint: Mon Aug 3 20:15:21 2020
Fake LSN counter for unlogged rels: 0/3E8
Minimum recovery ending location: 0/0
Min recovery ending loc's timeline: 0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
wal_level setting: replica
wal_log_hints setting: on
max_connections setting: 100
max_worker_processes setting: 8
max_wal_senders setting: 10
max_prepared_xacts setting: 0
max_locks_per_xact setting: 64
track_commit_timestamp setting: off
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Data page checksum version: 1
Mock authentication nonce: 5229a15400d9c8af60e52ed563c09d5843e4b3b37100937f4ab129cd87794846
2020-08-05 14:45:24,614 INFO: Lock owner: stats-service-postgres-0; I am stats-service-postgres-1
2020-08-05 14:45:24,631 INFO: Local timeline=None lsn=None
2020-08-05 14:45:24,631 INFO: Lock owner: stats-service-postgres-0; I am stats-service-postgres-1
2020-08-05 14:45:24,632 INFO: starting as a secondary
df -h
on replica pod:
root@stats-service-postgres-1:/home/postgres# df -h
Filesystem Size Used Avail Use% Mounted on
overlay 80G 8.1G 72G 11% /
tmpfs 64M 0 64M 0% /dev
tmpfs 7.9G 0 7.9G 0% /sys/fs/cgroup
tmpfs 7.9G 0 7.9G 0% /dev/shm
/dev/xvda1 80G 8.1G 72G 11% /etc/hosts
/dev/xvdbo 7.8G 7.8G 0 100% /home/postgres/pgdata
tmpfs 7.9G 12K 7.9G 1% /run/secrets/kubernetes.io/serviceaccount
tmpfs 7.9G 0 7.9G 0% /proc/acpi
tmpfs 7.9G 0 7.9G 0% /proc/scsi
tmpfs 7.9G 0 7.9G 0% /sys/firmware
Next I tried to resize values by changing the cluster manifest, it did not help, the disks used remained the same size.
I see it in postgres-operator pod log:
time="2020-08-05T13:56:12Z" level=info msg="Event(v1.ObjectReference{Kind:\"postgresql\", Namespace:\"prod\", Name:\"stats-service-postgres\", UID:\"d7057230-c085-11ea-9b25-02260b2a4d18\", APIVersion:\"acid.zalan.do/v1\", ResourceVersion:\"76480885\", FieldPath:\"\"}): type: 'Warning' reason: 'Sync' could not sync cluster: could not sync persistent volumes: could not sync volumes: could not resize EBS volume \"vol-01bc1a7307c758de4\": could not modify persistent volume: UnauthorizedOperation: You are not authorized to perform this operation. Encoded authorization failure message: ziGbQt...f9VjrGR\n\tstatus code: 403, request id: 6b6b...137"
Next I tried cloning the cluster from s3 but that also failed.
new cluster status: CreateFailed.
new cluster leader log:
...
REVOKE
GRANT
GRANT
RESET
server signaled
2020-08-05 14:32:54,398 WARNING: Could not activate Linux watchdog device: "Can't open watchdog device: [Errno 2] No such file or directory: '/dev/watchdog'"
2020-08-05 14:32:54,399 INFO: establishing a new patroni connection to the postgres cluster
2020-08-05 14:32:54,517 INFO: initialized a new cluster
2020-08-05 14:33:04,349 INFO: Lock owner: stats-service-db-0; I am stats-service-db-0
2020-08-05 14:33:04,403 INFO: Lock owner: stats-service-db-0; I am stats-service-db-0
...
2020-08-05 14:49:44,349 INFO: Lock owner: stats-service-db-0; I am stats-service-db-0
2020-08-05 14:49:44,399 INFO: no action. i am the leader with the lock
2020-08-05 14:49:48.055 42 LOG {ticks: 0, maint: 0, retry: 0}
WARNING: still waiting for all required WAL segments to be archived (960 seconds elapsed)
HINT: Check that your archive_command is executing properly. You can safely cancel this backup, but the database backup will not be usable without all the WAL segments.
2020-08-05 14:49:54,349 INFO: Lock owner: stats-service-db-0; I am stats-service-db-0
2020-08-05 14:49:54,398 INFO: no action. i am the leader with the lock
new custer replica log:
...
2020-08-05 15:15:37,456 INFO: Lock owner: stats-service-db-0; I am stats-service-db-1
2020-08-05 15:15:37,456 INFO: bootstrap from leader 'stats-service-db-0' in progress
2020-08-05 15:15:47,456 INFO: Lock owner: stats-service-db-0; I am stats-service-db-1
2020-08-05 15:15:47,456 INFO: bootstrap from leader 'stats-service-db-0' in progress
Questions:
- How can I restore a cluster? (or clone it)
- What should be done to prevent this problem for other clusters where pg_wal also takes up all disk space? Do I need to assign BACKUP_NUM_TO_RETAIN env via pod_environment_configmap (I haven't created this yet)?
Thanks