Skip to content

SyncFailed cluster issue #1089

Open
Open
@JazzShapka

Description

@JazzShapka

Hi all. Thanks for postgres-operator. Its amazing!

Introductory:

I faced the following problem:
The postgres cluster went into SyncFailed status, the database did not respond for a while, I initiated failover by deleting the leading pod, and after a while postgres began to respond.

But SyncFailed status remains.

Next I checked the logs, and saw a message about insufficient disk space (aws-ebs), based on this problem.
leader log:

2020-08-05 14:59:18,408 INFO: Lock owner: stats-service-postgres-0; I am stats-service-postgres-0
2020-08-05 14:59:18,459 INFO: no action.  i am the leader with the lock
2020-08-05 14:59:24.863 35 LOG {ticks: 0, maint: 0, retry: 0}
2020-08-05 14:59:28,408 INFO: Lock owner: stats-service-postgres-0; I am stats-service-postgres-0
2020-08-05 14:59:28,458 INFO: no action.  i am the leader with the lock

replica log:

2020-08-05 14:45:24 UTC [25670]: [1-1] 5f2ac604.6446 0     FATAL:  could not write lock file "postmaster.pid": No space left on device
2020-08-05 14:45:24,599 INFO: postmaster pid=25670
/var/run/postgresql:5432 - no response
2020-08-05 14:45:24,610 WARNING: Postgresql is not running.
2020-08-05 14:45:24,611 INFO: Lock owner: stats-service-postgres-0; I am stats-service-postgres-1
2020-08-05 14:45:24,614 INFO: pg_controldata:
  pg_control version number: 1201
  Catalog version number: 201909212
  Database system identifier: 6846819719288311883
  Database cluster state: in crash recovery
  pg_control last modified: Tue Aug  4 07:26:19 2020
  Latest checkpoint location: 5/F5041740
  Latest checkpoint's REDO location: 5/F5041708
  Latest checkpoint's REDO WAL file: 0000000600000005000000F5
  Latest checkpoint's TimeLineID: 6
  Latest checkpoint's PrevTimeLineID: 6
  Latest checkpoint's full_page_writes: on
  Latest checkpoint's NextXID: 0:46345
  Latest checkpoint's NextOID: 129481
  Latest checkpoint's NextMultiXactId: 1
  Latest checkpoint's NextMultiOffset: 0
  Latest checkpoint's oldestXID: 480
  Latest checkpoint's oldestXID's DB: 1
  Latest checkpoint's oldestActiveXID: 46345
  Latest checkpoint's oldestMultiXid: 1
  Latest checkpoint's oldestMulti's DB: 1
  Latest checkpoint's oldestCommitTsXid: 0
  Latest checkpoint's newestCommitTsXid: 0
  Time of latest checkpoint: Mon Aug  3 20:15:21 2020
  Fake LSN counter for unlogged rels: 0/3E8
  Minimum recovery ending location: 0/0
  Min recovery ending loc's timeline: 0
  Backup start location: 0/0
  Backup end location: 0/0
  End-of-backup record required: no
  wal_level setting: replica
  wal_log_hints setting: on
  max_connections setting: 100
  max_worker_processes setting: 8
  max_wal_senders setting: 10
  max_prepared_xacts setting: 0
  max_locks_per_xact setting: 64
  track_commit_timestamp setting: off
  Maximum data alignment: 8
  Database block size: 8192
  Blocks per segment of large relation: 131072
  WAL block size: 8192
  Bytes per WAL segment: 16777216
  Maximum length of identifiers: 64
  Maximum columns in an index: 32
  Maximum size of a TOAST chunk: 1996
  Size of a large-object chunk: 2048
  Date/time type storage: 64-bit integers
  Float4 argument passing: by value
  Float8 argument passing: by value
  Data page checksum version: 1
  Mock authentication nonce: 5229a15400d9c8af60e52ed563c09d5843e4b3b37100937f4ab129cd87794846

2020-08-05 14:45:24,614 INFO: Lock owner: stats-service-postgres-0; I am stats-service-postgres-1
2020-08-05 14:45:24,631 INFO: Local timeline=None lsn=None
2020-08-05 14:45:24,631 INFO: Lock owner: stats-service-postgres-0; I am stats-service-postgres-1
2020-08-05 14:45:24,632 INFO: starting as a secondary

df -h on replica pod:

root@stats-service-postgres-1:/home/postgres# df -h
Filesystem      Size  Used Avail Use% Mounted on
overlay          80G  8.1G   72G  11% /
tmpfs            64M     0   64M   0% /dev
tmpfs           7.9G     0  7.9G   0% /sys/fs/cgroup
tmpfs           7.9G     0  7.9G   0% /dev/shm
/dev/xvda1       80G  8.1G   72G  11% /etc/hosts
/dev/xvdbo      7.8G  7.8G     0 100% /home/postgres/pgdata
tmpfs           7.9G   12K  7.9G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs           7.9G     0  7.9G   0% /proc/acpi
tmpfs           7.9G     0  7.9G   0% /proc/scsi
tmpfs           7.9G     0  7.9G   0% /sys/firmware

Next I tried to resize values by changing the cluster manifest, it did not help, the disks used remained the same size.
I see it in postgres-operator pod log:
time="2020-08-05T13:56:12Z" level=info msg="Event(v1.ObjectReference{Kind:\"postgresql\", Namespace:\"prod\", Name:\"stats-service-postgres\", UID:\"d7057230-c085-11ea-9b25-02260b2a4d18\", APIVersion:\"acid.zalan.do/v1\", ResourceVersion:\"76480885\", FieldPath:\"\"}): type: 'Warning' reason: 'Sync' could not sync cluster: could not sync persistent volumes: could not sync volumes: could not resize EBS volume \"vol-01bc1a7307c758de4\": could not modify persistent volume: UnauthorizedOperation: You are not authorized to perform this operation. Encoded authorization failure message: ziGbQt...f9VjrGR\n\tstatus code: 403, request id: 6b6b...137"

Next I tried cloning the cluster from s3 but that also failed.
new cluster status: CreateFailed.
new cluster leader log:

...
REVOKE
GRANT
GRANT
RESET
server signaled
2020-08-05 14:32:54,398 WARNING: Could not activate Linux watchdog device: "Can't open watchdog device: [Errno 2] No such file or directory: '/dev/watchdog'"
2020-08-05 14:32:54,399 INFO: establishing a new patroni connection to the postgres cluster
2020-08-05 14:32:54,517 INFO: initialized a new cluster
2020-08-05 14:33:04,349 INFO: Lock owner: stats-service-db-0; I am stats-service-db-0
2020-08-05 14:33:04,403 INFO: Lock owner: stats-service-db-0; I am stats-service-db-0
...
2020-08-05 14:49:44,349 INFO: Lock owner: stats-service-db-0; I am stats-service-db-0
2020-08-05 14:49:44,399 INFO: no action.  i am the leader with the lock
2020-08-05 14:49:48.055 42 LOG {ticks: 0, maint: 0, retry: 0}
WARNING:  still waiting for all required WAL segments to be archived (960 seconds elapsed)
HINT:  Check that your archive_command is executing properly.  You can safely cancel this backup, but the database backup will not be usable without all the WAL segments.
2020-08-05 14:49:54,349 INFO: Lock owner: stats-service-db-0; I am stats-service-db-0
2020-08-05 14:49:54,398 INFO: no action.  i am the leader with the lock

new custer replica log:

...
2020-08-05 15:15:37,456 INFO: Lock owner: stats-service-db-0; I am stats-service-db-1
2020-08-05 15:15:37,456 INFO: bootstrap from leader 'stats-service-db-0' in progress
2020-08-05 15:15:47,456 INFO: Lock owner: stats-service-db-0; I am stats-service-db-1
2020-08-05 15:15:47,456 INFO: bootstrap from leader 'stats-service-db-0' in progress

Questions:

  1. How can I restore a cluster? (or clone it)
  2. What should be done to prevent this problem for other clusters where pg_wal also takes up all disk space? Do I need to assign BACKUP_NUM_TO_RETAIN env via pod_environment_configmap (I haven't created this yet)?

Thanks

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions