Skip to content

K8SPSMDB-1319 Don't run scheduled backup if DB is unhealthy and startingDeadlineSeconds added. #1940

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

nmarukovich
Copy link
Contributor

@nmarukovich nmarukovich commented May 20, 2025

K8SPSMDB-1319 Powered by Pull Request Badge

CHANGE DESCRIPTION

Problem:
Short explanation of the problem.

Cause:
Short explanation of the root cause of the issue if applicable.

Solution:

  • We don't want the backup failed immediatly on unhealthy cluster
    and the feature startingDeadlineSeconds was added. It should work similar one that we have in PXC

percona/percona-xtradb-cluster-operator@3a4b8f6

We introduce a new field to PerconaServerMongoDBBackup: spec.startingDeadlineSeconds. If this field is set and if backup is not started in configured duration, operator will automatically fail the backup. This duration is checked by comparing the current time with backup job's creation timestamp.

In PXC it works like this:
When we start a cluster and initiate a backup while the cluster is still in the initialization state, the backup fails immediately after the startingDeadlineSeconds timeout is reached.

=== Wed May 21 21:42:29 CEST 2025 ===
NAME      CLUSTER    STORAGE   DESTINATION   STATUS   COMPLETED   AGE
backup1   cluster1                                                0s
----------------
=== Wed May 21 21:42:34 CEST 2025 ===
NAME      CLUSTER    STORAGE   DESTINATION   STATUS   COMPLETED   AGE
backup1   cluster1                                                6s
----------------
=== Wed May 21 21:42:40 CEST 2025 ===
NAME      CLUSTER    STORAGE   DESTINATION   STATUS   COMPLETED   AGE
backup1   cluster1                                                11s
----------------
=== Wed May 21 21:42:45 CEST 2025 ===
NAME      CLUSTER    STORAGE   DESTINATION   STATUS   COMPLETED   AGE
backup1   cluster1                           Failed               17s

In PSMDB:
In main:
If we run backup before cluster object created, we failed immediately with the error
error: cluster not found
When we launch a cluster and trigger a backup during its initialization phase, the system patiently awaits the cluster's transition to a READY state before assigning a failed status to the backup operation.

    status:
      error: 'failed to run backup: spec.unsafeFlags.backupIfUnhealthy must be true
      to run backup on cluster with status initializing'

Check Timestamps for cluster and backup.

=== Wed May 21 17:09:09 CEST 2025 ===
NAME              ENDPOINT                                                STATUS   AGE
my-cluster-name   my-cluster-name-mongos.main-1.svc.cluster.local:27017   ready    2m35s
=== Wed May 21 17:10:16 CEST 2025 ===
NAME      CLUSTER           STORAGE   DESTINATION   TYPE   STATUS   COMPLETED   AGE
backup1   my-cluster-name   aws-s3                                              3m28s
----------------
=== Wed May 21 17:10:21 CEST 2025 === 
NAME      CLUSTER           STORAGE   DESTINATION   TYPE   STATUS   COMPLETED   AGE
backup1   my-cluster-name   aws-s3                         error                3m34s

2025-05-21T17:10:18.789Z ERROR failed to make backup {"controller": "psmdbbackup-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDBBackup", "PerconaServerMongoDBBackup": {"name":"backup1","namespace":"main-1"}, "namespace": "main-1", "name": "backup1", "reconcileID": "4d40b3aa-726f-466b-91d6-fdc43219cbe2", "backup": "backup1", "error": "failed to run

In branch K8SPSMDB-1319:
When we start a cluster and trigger a backup during its initialization, it behaves like the main branch, failing the backup once the cluster is READY. This happens even with startingDeadlineSeconds set, as the backup fails if the timeout expires before the cluster is ready.

When we wait for the cluster to reach a READY state, then transition it back to the initialization state before triggering a backup, the startingDeadlineSeconds setting functions as expected.

#startingDeadlineSeconds: 30  in backup cfg:
  === Thu May 22 10:44:09 CEST 2025 ===
  NAME             CLUSTER           STORAGE   DESTINATION   TYPE   STATUS   COMPLETED   AGE
  backup4          my-cluster-name   aws-s3                         error                16h
  backup5          my-cluster-name   aws-s3                         error                16h
  backup6-cfg      my-cluster-name   aws-s3                                              28s
  backup6-mongos   my-cluster-name   aws-s3                         error                21m
  ----------------
  === Thu May 22 10:44:15 CEST 2025 ===
  NAME             CLUSTER           STORAGE   DESTINATION   TYPE   STATUS   COMPLETED   AGE
  backup4          my-cluster-name   aws-s3                         error                16h
  backup5          my-cluster-name   aws-s3                         error                16h
  backup6-cfg      my-cluster-name   aws-s3                         error                33s
  backup6-mongos   my-cluster-name   aws-s3                         error                21m
  ----------------
status:
  error: starting deadline seconds exceeded
  state: error

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?
  • Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are all needed new/changed options added to the Helm Chart?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported MongoDB version?
  • Does the change support oldest and newest supported Kubernetes version?

@pull-request-size pull-request-size bot added the size/L 100-499 lines label May 20, 2025
@github-actions github-actions bot added the tests label May 20, 2025
@nmarukovich nmarukovich marked this pull request as ready for review May 21, 2025 11:23
@nmarukovich nmarukovich changed the title K8SPSMDB-1319 K8SPSMDB-1319 Don't run scheduled backup if DB is unhealthy May 21, 2025
@nmarukovich nmarukovich changed the title K8SPSMDB-1319 Don't run scheduled backup if DB is unhealthy K8SPSMDB-1319 Don't run scheduled backup if DB is unhealthy and startingDeadlineSeconds added. May 21, 2025
errStartingDeadlineExceeded = errors.New("starting deadline seconds exceeded")
)

func (r *ReconcilePerconaServerMongoDBBackup) checkDeadlines(ctx context.Context, cluster *psmdbv1.PerconaServerMongoDB, cr *psmdbv1.PerconaServerMongoDBBackup) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The receiver on this function is not utilized, do we have a reason to keep it?

@@ -181,7 +193,12 @@ func (r *ReconcilePerconaServerMongoDBBackup) Reconcile(ctx context.Context, req

err = cluster.CheckNSetDefaults(ctx, svr.Platform)
if err != nil {
return rr, errors.Wrapf(err, "set defaults for %s/%s", cluster.Namespace, cluster.Name)
err := errors.Wrapf(err, "wrong PSMDB options for %s/%s", cluster.Namespace, cluster.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we perform some kind of validations inside the CheckNSetDefaults . Maybe we should adjust the message to invalid cr options used for...

@nmarukovich nmarukovich requested a review from gkech May 22, 2025 11:27
@gkech
Copy link
Contributor

gkech commented May 26, 2025

we have conflicts that need to be resolved before reviewing again

Comment on lines 142 to 144
if cr.Status.State == psmdbv1.BackupStateError && status.State == "" {
return
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During implementation, I encountered an issue:
when the starting deadline was exceeded, we correctly set the backup to an error state. However, this state was later overwritten in the defer block with an empty status. As a result, no error was logged and the backup did not stop as expected after the deadline was missed.

2025-05-21T09:55:52.894Z INFO Backup state changed {"controller": "psmdbbackup-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDBBackup", "PerconaServerMongoDBBackup": {"name":"backup1","namespace":"pr-33-30cr"}, "namespace": "pr-33-30cr", "name": "backup1", "reconcileID": "42ed262d-ef9b-4df9-a9cd-5776d148250a", "previous": "error", "current": ""}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it looks this :

=== Wed May 28 11:16:20 CEST 2025 ===
NAME      CLUSTER           STORAGE      DESTINATION   TYPE   SIZE   STATUS   COMPLETED   AGE
backup1   my-cluster-name   azure-blob                               error                19m
backup2   my-cluster-name   azure-blob                               error                4m54s
backup3   my-cluster-name   azure-blob                                                    72s
----------------
=== Wed May 28 11:16:25 CEST 2025 ===
NAME      CLUSTER           STORAGE      DESTINATION   TYPE   SIZE   STATUS   COMPLETED   AGE
backup1   my-cluster-name   azure-blob                               error                19m
backup2   my-cluster-name   azure-blob                               error                5m
backup3   my-cluster-name   azure-blob                               error                78s

2025-05-28T09:18:15.058Z        INFO    Backup didn't start in startingDeadlineSeconds, failing the backup      {"controller": "psmdbbackup-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDBBackup", "PerconaServerMongoDBBackup": {"name":"backup3","namespace":"pr"}, "namespace": "pr", "name": "backup3", "reconcileID": "8e5cd909-a7e8-496d-a3e9-c89088b09005", "startingDeadlineSeconds": 20, "passedSeconds": 187.058930347}
2025-05-28T09:18:15.115Z        INFO    Backup state changed    {"controller": "psmdbbackup-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDBBackup", "PerconaServerMongoDBBackup": {"name":"backup3","namespace":"pr"}, "namespace": "pr", "name": "backup3", "reconcileID": "8e5cd909-a7e8-496d-a3e9-c89088b09005", "previous": "error", "current": ""}

Copy link
Contributor Author

@nmarukovich nmarukovich May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we got correct error in psmdb-backup object but later change it to empty status.

status:
  error: starting deadline seconds exceeded
  state: error

@hors hors added this to the v1.21.0 milestone May 27, 2025
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ nmarukovich
❌ Natalia Marukovich


Natalia Marukovich seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@nmarukovich nmarukovich requested a review from egegunes May 28, 2025 13:15
gkech
gkech previously approved these changes May 29, 2025
Comment on lines 196 to 200
err := errors.Wrapf(err, "invalid cr oprions used for %s/%s", cluster.Namespace, cluster.Name)
if err := r.setFailedStatus(ctx, cr, err); err != nil {
return rr, errors.Wrap(err, "update status")
}
return reconcile.Result{}, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
err := errors.Wrapf(err, "invalid cr oprions used for %s/%s", cluster.Namespace, cluster.Name)
if err := r.setFailedStatus(ctx, cr, err); err != nil {
return rr, errors.Wrap(err, "update status")
}
return reconcile.Result{}, err
return reconcile.Result{}, errors.Wrapf(err, "invalid cr oprions used for %s/%s", cluster.Namespace, cluster.Name)

Comment on lines 180 to 185
if err := checkStartingDeadline(ctx, cluster, cr); err != nil {
if err := r.setFailedStatus(ctx, cr, err); err != nil {
return rr, errors.Wrap(err, "update status")
}
return reconcile.Result{}, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By using the global err it's possible to set an error in the defer block without using the setFailedStatus method

Suggested change
if err := checkStartingDeadline(ctx, cluster, cr); err != nil {
if err := r.setFailedStatus(ctx, cr, err); err != nil {
return rr, errors.Wrap(err, "update status")
}
return reconcile.Result{}, nil
}
err = checkStartingDeadline(ctx, cluster, cr)
if err != nil {
return reconcile.Result{}, err
}

Copy link
Contributor Author

@nmarukovich nmarukovich May 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's better to return Error status immediately and don't wait defer. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Setting the status in the defer or outside of it makes no difference in terms of time

In my opinion, we should follow a consistent approach in the code. We should either always set the failed status in the defer block, or set it right before the return statement, but we shouldn't mix both. By using inconsistent approaches makes the code harder to understand

I would stick to the current approach and not use setFailedStatus method at all. This would also eliminate the need for this check: #1940 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@egegunes
Copy link
Contributor

@nmarukovich I wonder how will this work with our current starting deadline logic:

// pbmStartingDeadline is timeout after which continuous starting state is considered as error
pbmStartingDeadline = time.Duration(120) * time.Second
pbmStartingDeadlineErrMsg = "starting deadline exceeded"

Deadline you added is for backups in New state. The logic above is for backups in Starting state. So they can be separate it's OK but error messages are too similar and it'll confuse people.

@nmarukovich
Copy link
Contributor Author

@nmarukovich I wonder how will this work with our current starting deadline logic:

// pbmStartingDeadline is timeout after which continuous starting state is considered as error
pbmStartingDeadline = time.Duration(120) * time.Second
pbmStartingDeadlineErrMsg = "starting deadline exceeded"

Deadline you added is for backups in New state. The logic above is for backups in Starting state. So they can be separate it's OK but error messages are too similar and it'll confuse people.

Yes, Agree. I will update this message. I believe to avoid confusion we should use the same error message for the same logic in all operators. Why are wondering how it works? Do you have any other comments?

@nmarukovich nmarukovich requested review from pooknull and gkech May 30, 2025 16:22
gkech
gkech previously approved these changes Jun 2, 2025
@@ -187,6 +183,7 @@ func (r *ReconcilePerconaServerMongoDBBackup) Reconcile(ctx context.Context, req
if err := r.setFailedStatus(ctx, cr, err); err != nil {
return rr, errors.Wrap(err, "update status")
}
status = cr.Status
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a comment to explain why this is needed would be nice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@egegunes could you please check @pooknull comment above? If you agree that it's better to handle the error state in the defer block, I'll go ahead and move it there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I agree with @pooknull in general

@JNKPercona
Copy link
Collaborator

Test name Status
arbiter passed
balancer passed
cross-site-sharded passed
custom-replset-name passed
custom-tls passed
custom-users-roles passed
custom-users-roles-sharded passed
data-at-rest-encryption passed
data-sharded passed
demand-backup passed
demand-backup-eks-credentials-irsa passed
demand-backup-fs passed
demand-backup-incremental passed
demand-backup-incremental-sharded passed
demand-backup-physical-parallel passed
demand-backup-physical-aws passed
demand-backup-physical-azure passed
demand-backup-physical-gcp passed
demand-backup-physical-minio passed
demand-backup-physical-sharded-parallel passed
demand-backup-physical-sharded-aws passed
demand-backup-physical-sharded-azure passed
demand-backup-physical-sharded-gcp passed
demand-backup-physical-sharded-minio passed
demand-backup-sharded passed
expose-sharded passed
finalizer passed
ignore-labels-annotations passed
init-deploy passed
ldap passed
ldap-tls passed
limits passed
liveness passed
mongod-major-upgrade passed
mongod-major-upgrade-sharded passed
monitoring-2-0 passed
monitoring-pmm3 passed
multi-cluster-service passed
multi-storage passed
non-voting-and-hidden passed
one-pod passed
operator-self-healing-chaos passed
pitr passed
pitr-physical passed
pitr-sharded passed
pitr-physical-backup-source passed
preinit-updates passed
pvc-resize passed
recover-no-primary passed
replset-overrides passed
rs-shard-migration passed
scaling passed
scheduled-backup passed
security-context passed
self-healing-chaos passed
service-per-pod passed
serviceless-external-nodes passed
smart-update passed
split-horizon passed
stable-resource-version passed
storage passed
tls-issue-cert-manager passed
upgrade passed
upgrade-consistency passed
upgrade-consistency-sharded-tls passed
upgrade-sharded passed
users passed
version-service passed
We run 68 out of 68

commit: c1f2e14
image: perconalab/percona-server-mongodb-operator:PR-1940-c1f2e14e

@nmarukovich nmarukovich requested review from egegunes and gkech June 3, 2025 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/L 100-499 lines tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants