K8SPSMDB-1319 Don't run scheduled backup if DB is unhealthy and startingDeadlineSeconds added. #1940

nmarukovich · 2025-05-20T19:59:02Z

CHANGE DESCRIPTION

Problem:
Short explanation of the problem.

Cause:
Short explanation of the root cause of the issue if applicable.

Solution:

We don't want the backup failed immediatly on unhealthy cluster
and the feature startingDeadlineSeconds was added. It should work similar one that we have in PXC

percona/percona-xtradb-cluster-operator@3a4b8f6

We introduce a new field to PerconaServerMongoDBBackup: spec.startingDeadlineSeconds. If this field is set and if backup is not started in configured duration, operator will automatically fail the backup. This duration is checked by comparing the current time with backup job's creation timestamp.

In PXC it works like this:
When we start a cluster and initiate a backup while the cluster is still in the initialization state, the backup fails immediately after the startingDeadlineSeconds timeout is reached.

=== Wed May 21 21:42:29 CEST 2025 ===
NAME      CLUSTER    STORAGE   DESTINATION   STATUS   COMPLETED   AGE
backup1   cluster1                                                0s
----------------
=== Wed May 21 21:42:34 CEST 2025 ===
NAME      CLUSTER    STORAGE   DESTINATION   STATUS   COMPLETED   AGE
backup1   cluster1                                                6s
----------------
=== Wed May 21 21:42:40 CEST 2025 ===
NAME      CLUSTER    STORAGE   DESTINATION   STATUS   COMPLETED   AGE
backup1   cluster1                                                11s
----------------
=== Wed May 21 21:42:45 CEST 2025 ===
NAME      CLUSTER    STORAGE   DESTINATION   STATUS   COMPLETED   AGE
backup1   cluster1                           Failed               17s

In PSMDB:
In main:
If we run backup before cluster object created, we failed immediately with the error
error: cluster not found
When we launch a cluster and trigger a backup during its initialization phase, the system patiently awaits the cluster's transition to a READY state before assigning a failed status to the backup operation.

    status:
      error: 'failed to run backup: spec.unsafeFlags.backupIfUnhealthy must be true
      to run backup on cluster with status initializing'

Check Timestamps for cluster and backup.

=== Wed May 21 17:09:09 CEST 2025 ===
NAME              ENDPOINT                                                STATUS   AGE
my-cluster-name   my-cluster-name-mongos.main-1.svc.cluster.local:27017   ready    2m35s

=== Wed May 21 17:10:16 CEST 2025 ===
NAME      CLUSTER           STORAGE   DESTINATION   TYPE   STATUS   COMPLETED   AGE
backup1   my-cluster-name   aws-s3                                              3m28s
----------------
=== Wed May 21 17:10:21 CEST 2025 === 
NAME      CLUSTER           STORAGE   DESTINATION   TYPE   STATUS   COMPLETED   AGE
backup1   my-cluster-name   aws-s3                         error                3m34s

2025-05-21T17:10:18.789Z ERROR failed to make backup {"controller": "psmdbbackup-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDBBackup", "PerconaServerMongoDBBackup": {"name":"backup1","namespace":"main-1"}, "namespace": "main-1", "name": "backup1", "reconcileID": "4d40b3aa-726f-466b-91d6-fdc43219cbe2", "backup": "backup1", "error": "failed to run

In branch K8SPSMDB-1319:
When we start a cluster and trigger a backup during its initialization, it behaves like the main branch, failing the backup once the cluster is READY. This happens even with startingDeadlineSeconds set, as the backup fails if the timeout expires before the cluster is ready.

When we wait for the cluster to reach a READY state, then transition it back to the initialization state before triggering a backup, the startingDeadlineSeconds setting functions as expected.

#startingDeadlineSeconds: 30  in backup cfg:
  === Thu May 22 10:44:09 CEST 2025 ===
  NAME             CLUSTER           STORAGE   DESTINATION   TYPE   STATUS   COMPLETED   AGE
  backup4          my-cluster-name   aws-s3                         error                16h
  backup5          my-cluster-name   aws-s3                         error                16h
  backup6-cfg      my-cluster-name   aws-s3                                              28s
  backup6-mongos   my-cluster-name   aws-s3                         error                21m
  ----------------
  === Thu May 22 10:44:15 CEST 2025 ===
  NAME             CLUSTER           STORAGE   DESTINATION   TYPE   STATUS   COMPLETED   AGE
  backup4          my-cluster-name   aws-s3                         error                16h
  backup5          my-cluster-name   aws-s3                         error                16h
  backup6-cfg      my-cluster-name   aws-s3                         error                33s
  backup6-mongos   my-cluster-name   aws-s3                         error                21m
  ----------------
status:
  error: starting deadline seconds exceeded
  state: error

CHECKLIST

Jira

Is the Jira ticket created and referenced properly?
Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

Is an E2E test/test case added for the new feature/change?
Are unit tests added where appropriate?
Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

Are all needed new/changed options added to default YAML files?
Are all needed new/changed options added to the Helm Chart?
Did we add proper logging messages for operator actions?
Did we ensure compatibility with the previous version or cluster upgrade process?
Does the change support oldest and newest supported MongoDB version?
Does the change support oldest and newest supported Kubernetes version?

…godb-operator into K8SPSMDB-1319

gkech · 2025-05-22T09:08:27Z

pkg/controller/perconaservermongodbbackup/deadline.go

+	errStartingDeadlineExceeded = errors.New("starting deadline seconds exceeded")
+)
+
+func (r *ReconcilePerconaServerMongoDBBackup) checkDeadlines(ctx context.Context, cluster *psmdbv1.PerconaServerMongoDB, cr *psmdbv1.PerconaServerMongoDBBackup) error {


The receiver on this function is not utilized, do we have a reason to keep it?

pkg/controller/perconaservermongodbbackup/deadline.go

gkech · 2025-05-22T09:24:19Z

pkg/controller/perconaservermongodbbackup/perconaservermongodbbackup_controller.go

@@ -181,7 +193,12 @@ func (r *ReconcilePerconaServerMongoDBBackup) Reconcile(ctx context.Context, req

 		err = cluster.CheckNSetDefaults(ctx, svr.Platform)
 		if err != nil {
-			return rr, errors.Wrapf(err, "set defaults for %s/%s", cluster.Namespace, cluster.Name)
+			err := errors.Wrapf(err, "wrong PSMDB options for %s/%s", cluster.Namespace, cluster.Name)


we perform some kind of validations inside the CheckNSetDefaults . Maybe we should adjust the message to invalid cr options used for...

…godb-operator into K8SPSMDB-1319

gkech · 2025-05-26T08:54:46Z

we have conflicts that need to be resolved before reviewing again

egegunes · 2025-05-26T15:28:41Z

pkg/controller/perconaservermongodbbackup/perconaservermongodbbackup_controller.go

+		if cr.Status.State == psmdbv1.BackupStateError && status.State == "" {
+			return
+		}


why is this needed?

During implementation, I encountered an issue:
when the starting deadline was exceeded, we correctly set the backup to an error state. However, this state was later overwritten in the defer block with an empty status. As a result, no error was logged and the backup did not stop as expected after the deadline was missed.

2025-05-21T09:55:52.894Z INFO Backup state changed {"controller": "psmdbbackup-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDBBackup", "PerconaServerMongoDBBackup": {"name":"backup1","namespace":"pr-33-30cr"}, "namespace": "pr-33-30cr", "name": "backup1", "reconcileID": "42ed262d-ef9b-4df9-a9cd-5776d148250a", "previous": "error", "current": ""}

So it looks this :

=== Wed May 28 11:16:20 CEST 2025 === NAME CLUSTER STORAGE DESTINATION TYPE SIZE STATUS COMPLETED AGE backup1 my-cluster-name azure-blob error 19m backup2 my-cluster-name azure-blob error 4m54s backup3 my-cluster-name azure-blob 72s ---------------- === Wed May 28 11:16:25 CEST 2025 === NAME CLUSTER STORAGE DESTINATION TYPE SIZE STATUS COMPLETED AGE backup1 my-cluster-name azure-blob error 19m backup2 my-cluster-name azure-blob error 5m backup3 my-cluster-name azure-blob error 78s 2025-05-28T09:18:15.058Z INFO Backup didn't start in startingDeadlineSeconds, failing the backup {"controller": "psmdbbackup-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDBBackup", "PerconaServerMongoDBBackup": {"name":"backup3","namespace":"pr"}, "namespace": "pr", "name": "backup3", "reconcileID": "8e5cd909-a7e8-496d-a3e9-c89088b09005", "startingDeadlineSeconds": 20, "passedSeconds": 187.058930347} 2025-05-28T09:18:15.115Z INFO Backup state changed {"controller": "psmdbbackup-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDBBackup", "PerconaServerMongoDBBackup": {"name":"backup3","namespace":"pr"}, "namespace": "pr", "name": "backup3", "reconcileID": "8e5cd909-a7e8-496d-a3e9-c89088b09005", "previous": "error", "current": ""}

we got correct error in psmdb-backup object but later change it to empty status.

status: error: starting deadline seconds exceeded state: error

pkg/controller/perconaservermongodbbackup/deadline.go

CLAassistant · 2025-05-28T07:53:55Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ nmarukovich
❌ Natalia Marukovich

Natalia Marukovich seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

pooknull · 2025-05-29T16:10:37Z

pkg/controller/perconaservermongodbbackup/perconaservermongodbbackup_controller.go

+			err := errors.Wrapf(err, "invalid cr oprions used for %s/%s", cluster.Namespace, cluster.Name)
+			if err := r.setFailedStatus(ctx, cr, err); err != nil {
+				return rr, errors.Wrap(err, "update status")
+			}
+			return reconcile.Result{}, err


Suggested change

err := errors.Wrapf(err, "invalid cr oprions used for %s/%s", cluster.Namespace, cluster.Name)

if err := r.setFailedStatus(ctx, cr, err); err != nil {

return rr, errors.Wrap(err, "update status")

}

return reconcile.Result{}, err

return reconcile.Result{}, errors.Wrapf(err, "invalid cr oprions used for %s/%s", cluster.Namespace, cluster.Name)

pooknull · 2025-05-29T16:11:35Z

pkg/controller/perconaservermongodbbackup/perconaservermongodbbackup_controller.go

+	if err := checkStartingDeadline(ctx, cluster, cr); err != nil {
+		if err := r.setFailedStatus(ctx, cr, err); err != nil {
+			return rr, errors.Wrap(err, "update status")
+		}
+		return reconcile.Result{}, nil
+	}


By using the global err it's possible to set an error in the defer block without using the setFailedStatus method

Suggested change

if err := checkStartingDeadline(ctx, cluster, cr); err != nil {

if err := r.setFailedStatus(ctx, cr, err); err != nil {

return rr, errors.Wrap(err, "update status")

}

return reconcile.Result{}, nil

}

err = checkStartingDeadline(ctx, cluster, cr)

if err != nil {

return reconcile.Result{}, err

}

I believe it's better to return Error status immediately and don't wait defer. What do you think?

Why? Setting the status in the defer or outside of it makes no difference in terms of time

In my opinion, we should follow a consistent approach in the code. We should either always set the failed status in the defer block, or set it right before the return statement, but we shouldn't mix both. By using inconsistent approaches makes the code harder to understand

I would stick to the current approach and not use setFailedStatus method at all. This would also eliminate the need for this check: #1940 (comment)

egegunes · 2025-05-30T08:43:42Z

@nmarukovich I wonder how will this work with our current starting deadline logic:

percona-server-mongodb-operator/pkg/controller/perconaservermongodbbackup/backup.go

Lines 23 to 25 in 5670cbf

    
           // pbmStartingDeadline is timeout after which continuous starting state is considered as error 
        
           pbmStartingDeadline       = time.Duration(120) * time.Second 
        
           pbmStartingDeadlineErrMsg = "starting deadline exceeded"

Deadline you added is for backups in New state. The logic above is for backups in Starting state. So they can be separate it's OK but error messages are too similar and it'll confuse people.

nmarukovich · 2025-05-30T10:20:51Z

@nmarukovich I wonder how will this work with our current starting deadline logic:

percona-server-mongodb-operator/pkg/controller/perconaservermongodbbackup/backup.go

Lines 23 to 25 in 5670cbf

// pbmStartingDeadline is timeout after which continuous starting state is considered as error

pbmStartingDeadline = time.Duration(120) * time.Second

pbmStartingDeadlineErrMsg = "starting deadline exceeded"

Deadline you added is for backups in New state. The logic above is for backups in Starting state. So they can be separate it's OK but error messages are too similar and it'll confuse people.

Yes, Agree. I will update this message. I believe to avoid confusion we should use the same error message for the same logic in all operators. Why are wondering how it works? Do you have any other comments?

egegunes · 2025-06-02T08:26:17Z

pkg/controller/perconaservermongodbbackup/perconaservermongodbbackup_controller.go

@@ -187,6 +183,7 @@ func (r *ReconcilePerconaServerMongoDBBackup) Reconcile(ctx context.Context, req
 		if err := r.setFailedStatus(ctx, cr, err); err != nil {
 			return rr, errors.Wrap(err, "update status")
 		}
+		status = cr.Status


a comment to explain why this is needed would be nice

@egegunes could you please check @pooknull comment above? If you agree that it's better to handle the error state in the defer block, I'll go ahead and move it there.

yeah I agree with @pooknull in general

…godb-operator into K8SPSMDB-1319

JNKPercona · 2025-06-02T21:01:56Z

Test name	Status
arbiter	passed
balancer	passed
cross-site-sharded	passed
custom-replset-name	passed
custom-tls	passed
custom-users-roles	passed
custom-users-roles-sharded	passed
data-at-rest-encryption	passed
data-sharded	passed
demand-backup	passed
demand-backup-eks-credentials-irsa	passed
demand-backup-fs	passed
demand-backup-incremental	passed
demand-backup-incremental-sharded	passed
demand-backup-physical-parallel	passed
demand-backup-physical-aws	passed
demand-backup-physical-azure	passed
demand-backup-physical-gcp	passed
demand-backup-physical-minio	passed
demand-backup-physical-sharded-parallel	passed
demand-backup-physical-sharded-aws	passed
demand-backup-physical-sharded-azure	passed
demand-backup-physical-sharded-gcp	passed
demand-backup-physical-sharded-minio	passed
demand-backup-sharded	passed
expose-sharded	passed
finalizer	passed
ignore-labels-annotations	passed
init-deploy	passed
ldap	passed
ldap-tls	passed
limits	passed
liveness	passed
mongod-major-upgrade	passed
mongod-major-upgrade-sharded	passed
monitoring-2-0	passed
monitoring-pmm3	passed
multi-cluster-service	passed
multi-storage	passed
non-voting-and-hidden	passed
one-pod	passed
operator-self-healing-chaos	passed
pitr	passed
pitr-physical	passed
pitr-sharded	passed
pitr-physical-backup-source	passed
preinit-updates	passed
pvc-resize	passed
recover-no-primary	passed
replset-overrides	passed
rs-shard-migration	passed
scaling	passed
scheduled-backup	passed
security-context	passed
self-healing-chaos	passed
service-per-pod	passed
serviceless-external-nodes	passed
smart-update	passed
split-horizon	passed
stable-resource-version	passed
storage	passed
tls-issue-cert-manager	passed
upgrade	passed
upgrade-consistency	passed
upgrade-consistency-sharded-tls	passed
upgrade-sharded	passed
users	passed
version-service	passed
We run 68 out of 68

commit: c1f2e14
image: perconalab/percona-server-mongodb-operator:PR-1940-c1f2e14e

K8SPSMDB-1319

473ce17

pull-request-size bot added the size/L 100-499 lines label May 20, 2025

Merge branch 'main' into K8SPSMDB-1319

1b6e798

github-actions bot added the tests label May 20, 2025

nmarukovich added 5 commits May 20, 2025 22:20

fix test

2a8f4ef

Merge branch 'K8SPSMDB-1319' of github.com:percona/percona-server-mon…

eb3f017

…godb-operator into K8SPSMDB-1319

fix test

7e0a339

fix imports

50bdf8b

fix

ade5daa

nmarukovich marked this pull request as ready for review May 21, 2025 11:23

nmarukovich requested review from jvpasinatto, eleo007, hors, egegunes, pooknull and gkech as code owners May 21, 2025 11:23

nmarukovich added 2 commits May 21, 2025 13:26

fix images

30e44b8

fix images

9927bc8

nmarukovich changed the title ~~K8SPSMDB-1319~~ K8SPSMDB-1319 Don't run scheduled backup if DB is unhealthy May 21, 2025

nmarukovich changed the title ~~K8SPSMDB-1319 Don't run scheduled backup if DB is unhealthy~~ K8SPSMDB-1319 Don't run scheduled backup if DB is unhealthy and startingDeadlineSeconds added. May 21, 2025

nmarukovich and others added 2 commits May 21, 2025 16:04

fix

6688fc2

Merge branch 'main' into K8SPSMDB-1319

6458c4e

gkech reviewed May 22, 2025

View reviewed changes

nmarukovich added 2 commits May 22, 2025 12:46

fix PR comments

fd5e811

Merge branch 'K8SPSMDB-1319' of github.com:percona/percona-server-mon…

46db20a

…godb-operator into K8SPSMDB-1319

nmarukovich requested a review from gkech May 22, 2025 11:27

egegunes reviewed May 26, 2025

View reviewed changes

hors added this to the v1.21.0 milestone May 27, 2025

fix PR comments

d288d96

nmarukovich requested a review from valmiranogueira as a code owner May 28, 2025 07:53

Natalia Marukovich added 5 commits May 28, 2025 10:44

fix conflicts

cdf9ba9

fix labels after 1304 merge

5f1dbb1

fix labels

1f3a96c

fix

38000b5

fix

eb00565

nmarukovich requested a review from egegunes May 28, 2025 13:15

gkech previously approved these changes May 29, 2025

View reviewed changes

pooknull requested changes May 29, 2025

View reviewed changes

fix PR comments

a84f886

nmarukovich dismissed gkech’s stale review via a84f886 May 30, 2025 12:09

Merge branch 'main' into K8SPSMDB-1319

65e7e76

nmarukovich requested review from pooknull and gkech May 30, 2025 16:22

Merge branch 'main' into K8SPSMDB-1319

266ec20

gkech previously approved these changes Jun 2, 2025

View reviewed changes

Merge branch 'main' into K8SPSMDB-1319

aeccfaf

egegunes reviewed Jun 2, 2025

View reviewed changes

nmarukovich added 2 commits June 2, 2025 17:18

fix PR comments

3ffdfe9

Merge branch 'K8SPSMDB-1319' of github.com:percona/percona-server-mon…

c1f2e14

…godb-operator into K8SPSMDB-1319

nmarukovich dismissed gkech’s stale review via c1f2e14 June 2, 2025 15:18

nmarukovich requested review from egegunes and gkech June 3, 2025 07:50

K8SPSMDB-1319 Don't run scheduled backup if DB is unhealthy and startingDeadlineSeconds added. #1940

Are you sure you want to change the base?

K8SPSMDB-1319 Don't run scheduled backup if DB is unhealthy and startingDeadlineSeconds added. #1940

Conversation

nmarukovich commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CHANGE DESCRIPTION

CHECKLIST

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gkech commented May 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nmarukovich May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

CLAassistant commented May 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nmarukovich May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

egegunes commented May 30, 2025

Uh oh!

nmarukovich commented May 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JNKPercona commented Jun 2, 2025

Uh oh!

Uh oh!

nmarukovich commented May 20, 2025 •

edited

Loading

nmarukovich May 28, 2025 •

edited

Loading

nmarukovich May 30, 2025 •

edited

Loading