-
Notifications
You must be signed in to change notification settings - Fork 151
K8SPSMDB-1319 Don't run scheduled backup if DB is unhealthy and startingDeadlineSeconds added. #1940
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…godb-operator into K8SPSMDB-1319
errStartingDeadlineExceeded = errors.New("starting deadline seconds exceeded") | ||
) | ||
|
||
func (r *ReconcilePerconaServerMongoDBBackup) checkDeadlines(ctx context.Context, cluster *psmdbv1.PerconaServerMongoDB, cr *psmdbv1.PerconaServerMongoDBBackup) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The receiver on this function is not utilized, do we have a reason to keep it?
@@ -181,7 +193,12 @@ func (r *ReconcilePerconaServerMongoDBBackup) Reconcile(ctx context.Context, req | |||
|
|||
err = cluster.CheckNSetDefaults(ctx, svr.Platform) | |||
if err != nil { | |||
return rr, errors.Wrapf(err, "set defaults for %s/%s", cluster.Namespace, cluster.Name) | |||
err := errors.Wrapf(err, "wrong PSMDB options for %s/%s", cluster.Namespace, cluster.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we perform some kind of validations inside the CheckNSetDefaults
. Maybe we should adjust the message to invalid cr options used for...
…godb-operator into K8SPSMDB-1319
we have conflicts that need to be resolved before reviewing again |
if cr.Status.State == psmdbv1.BackupStateError && status.State == "" { | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During implementation, I encountered an issue:
when the starting deadline was exceeded, we correctly set the backup to an error state. However, this state was later overwritten in the defer block with an empty status. As a result, no error was logged and the backup did not stop as expected after the deadline was missed.
2025-05-21T09:55:52.894Z INFO Backup state changed {"controller": "psmdbbackup-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDBBackup", "PerconaServerMongoDBBackup": {"name":"backup1","namespace":"pr-33-30cr"}, "namespace": "pr-33-30cr", "name": "backup1", "reconcileID": "42ed262d-ef9b-4df9-a9cd-5776d148250a", "previous": "error", "current": ""}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it looks this :
=== Wed May 28 11:16:20 CEST 2025 ===
NAME CLUSTER STORAGE DESTINATION TYPE SIZE STATUS COMPLETED AGE
backup1 my-cluster-name azure-blob error 19m
backup2 my-cluster-name azure-blob error 4m54s
backup3 my-cluster-name azure-blob 72s
----------------
=== Wed May 28 11:16:25 CEST 2025 ===
NAME CLUSTER STORAGE DESTINATION TYPE SIZE STATUS COMPLETED AGE
backup1 my-cluster-name azure-blob error 19m
backup2 my-cluster-name azure-blob error 5m
backup3 my-cluster-name azure-blob error 78s
2025-05-28T09:18:15.058Z INFO Backup didn't start in startingDeadlineSeconds, failing the backup {"controller": "psmdbbackup-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDBBackup", "PerconaServerMongoDBBackup": {"name":"backup3","namespace":"pr"}, "namespace": "pr", "name": "backup3", "reconcileID": "8e5cd909-a7e8-496d-a3e9-c89088b09005", "startingDeadlineSeconds": 20, "passedSeconds": 187.058930347}
2025-05-28T09:18:15.115Z INFO Backup state changed {"controller": "psmdbbackup-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDBBackup", "PerconaServerMongoDBBackup": {"name":"backup3","namespace":"pr"}, "namespace": "pr", "name": "backup3", "reconcileID": "8e5cd909-a7e8-496d-a3e9-c89088b09005", "previous": "error", "current": ""}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we got correct error in psmdb-backup object but later change it to empty status.
status:
error: starting deadline seconds exceeded
state: error
Natalia Marukovich seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
err := errors.Wrapf(err, "invalid cr oprions used for %s/%s", cluster.Namespace, cluster.Name) | ||
if err := r.setFailedStatus(ctx, cr, err); err != nil { | ||
return rr, errors.Wrap(err, "update status") | ||
} | ||
return reconcile.Result{}, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
err := errors.Wrapf(err, "invalid cr oprions used for %s/%s", cluster.Namespace, cluster.Name) | |
if err := r.setFailedStatus(ctx, cr, err); err != nil { | |
return rr, errors.Wrap(err, "update status") | |
} | |
return reconcile.Result{}, err | |
return reconcile.Result{}, errors.Wrapf(err, "invalid cr oprions used for %s/%s", cluster.Namespace, cluster.Name) |
if err := checkStartingDeadline(ctx, cluster, cr); err != nil { | ||
if err := r.setFailedStatus(ctx, cr, err); err != nil { | ||
return rr, errors.Wrap(err, "update status") | ||
} | ||
return reconcile.Result{}, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By using the global err
it's possible to set an error in the defer block without using the setFailedStatus
method
if err := checkStartingDeadline(ctx, cluster, cr); err != nil { | |
if err := r.setFailedStatus(ctx, cr, err); err != nil { | |
return rr, errors.Wrap(err, "update status") | |
} | |
return reconcile.Result{}, nil | |
} | |
err = checkStartingDeadline(ctx, cluster, cr) | |
if err != nil { | |
return reconcile.Result{}, err | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it's better to return Error status immediately and don't wait defer. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? Setting the status in the defer
or outside of it makes no difference in terms of time
In my opinion, we should follow a consistent approach in the code. We should either always set the failed status in the defer
block, or set it right before the return
statement, but we shouldn't mix both. By using inconsistent approaches makes the code harder to understand
I would stick to the current approach and not use setFailedStatus
method at all. This would also eliminate the need for this check: #1940 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nmarukovich I wonder how will this work with our current starting deadline logic: percona-server-mongodb-operator/pkg/controller/perconaservermongodbbackup/backup.go Lines 23 to 25 in 5670cbf
Deadline you added is for backups in New state. The logic above is for backups in Starting state. So they can be separate it's OK but error messages are too similar and it'll confuse people. |
Yes, Agree. I will update this message. I believe to avoid confusion we should use the same error message for the same logic in all operators. Why are wondering how it works? Do you have any other comments? |
@@ -187,6 +183,7 @@ func (r *ReconcilePerconaServerMongoDBBackup) Reconcile(ctx context.Context, req | |||
if err := r.setFailedStatus(ctx, cr, err); err != nil { | |||
return rr, errors.Wrap(err, "update status") | |||
} | |||
status = cr.Status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a comment to explain why this is needed would be nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I agree with @pooknull in general
…godb-operator into K8SPSMDB-1319
commit: c1f2e14 |
CHANGE DESCRIPTION
Problem:
Short explanation of the problem.
Cause:
Short explanation of the root cause of the issue if applicable.
Solution:
and the feature startingDeadlineSeconds was added. It should work similar one that we have in PXC
percona/percona-xtradb-cluster-operator@3a4b8f6
We introduce a new field to PerconaServerMongoDBBackup: spec.startingDeadlineSeconds. If this field is set and if backup is not started in configured duration, operator will automatically fail the backup. This duration is checked by comparing the current time with backup job's creation timestamp.
In PXC it works like this:
When we start a cluster and initiate a backup while the cluster is still in the initialization state, the backup fails immediately after the startingDeadlineSeconds timeout is reached.
In PSMDB:
In main:
If we run backup before cluster object created, we failed immediately with the error
error: cluster not found
When we launch a cluster and trigger a backup during its initialization phase, the system patiently awaits the cluster's transition to a READY state before assigning a failed status to the backup operation.
Check Timestamps for cluster and backup.
2025-05-21T17:10:18.789Z ERROR failed to make backup {"controller": "psmdbbackup-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDBBackup", "PerconaServerMongoDBBackup": {"name":"backup1","namespace":"main-1"}, "namespace": "main-1", "name": "backup1", "reconcileID": "4d40b3aa-726f-466b-91d6-fdc43219cbe2", "backup": "backup1", "error": "failed to run
In branch K8SPSMDB-1319:
When we start a cluster and trigger a backup during its initialization, it behaves like the main branch, failing the backup once the cluster is READY. This happens even with startingDeadlineSeconds set, as the backup fails if the timeout expires before the cluster is ready.
When we wait for the cluster to reach a READY state, then transition it back to the initialization state before triggering a backup, the startingDeadlineSeconds setting functions as expected.
CHECKLIST
Jira
Needs Doc
) and QA (Needs QA
)?Tests
compare/*-oc.yml
)?Config/Logging/Testability