Description
Motivation
Some issues related to exception handling have been reported over time:
- Spring State Machine Error Handling not working #548
- Exposing exception in StateMachine #601
- reactive error handling #997
- Not able to catch and rethrow spring state machine error #553
- Is there way to throw an Exception that occurred inside the Statemachine to the outside? #1055
- Make exceptions that prevent transitions available #970
- Error in Action doesn't trigger rollback #1076
- Exceptionhandling in guards #608
- Spring Security Exception and how to handle it. #340
- how to error handling in StateMachineInterceptorAdapter override method preStateChange #1090
What they have in common is that users want to know how they can handle exceptions from self-written components at the caller level of the statemachine. They often ask why exceptions get caught inside the machine but are not passed to the outside.
It seems there are 2 principles that try to explain why Spring Statemachine catches exceptions extensively:
While one could argue that number 2 is merely a requirement derived from number 1 the "Run To Completion" actually refers only to how events should be processed. The "one by one" approach as explained on the linked sites is important because the machine needs to be in a well-defined and stable state before it can go on.
However, this does not take exceptions or errors into account. From my personal point of view I doubt these are even considered events (at least not the ones that occur unexpectedly). That is why I believe when it comes to exception handling we should move away from the very fundamental RTC paradigm and take a look at how users can operate a statemachine meaningfully.
Room For Improvement
In the following I will refer to exceptions in actions or guards but most of the points can be applied to other user-written components as well (e.g. listeners, interceptors).
The current (3.2.0) implementation has at least 2 shortcomings that force the implementer to take extra measures against exceptions:
- Exceptions are not always propagated to the caller starting the statemachine or sending an event.
- As a consequence, callers must choose a container (e.g. extended state) to store an exception that happens during statemachine execution. This container must be accessible from outside the machine so that one can read and evaluate the exception after execution.
- Exceptions can make a statemachine impossible to reuse. E.g. when they occur in an action bound to a triggerless transition the machine literally hangs in the transit state where the transition originates.
- Users that wish to reuse the same statemachine instance between events must extend their exception handling by one of the following:
- Add a looping event transition to the transit state, i.e. one that leads to the same state again. Once an exception occurs the event has to be sent to continue on the former path since triggerless transitions get executed only once a state is entered.
- Create a machine backup, e.g. with the help of Spring
StateMachinePersister
. This can be used to restore the statemachine from a stable state. In technical terms this means the machine experiences a reset.
- Users that wish to reuse the same statemachine instance between events must extend their exception handling by one of the following:
Exception Handling Test
To demonstrate these points I created test based on the following statemachine:
- S1 .. S5 := states
- S1 := initial state
- S2 := choice
- S3 := event accepting state
- S4 := transit state (with state entry + behavior + exit action)
- S5 := end state
- E := event
- a := action
- g := guard
The action is always registered with an errorAction
that writes the exception into a container, specifically the ExtendedState
. The guard is registered using a wrapper around it that provides the same exception handling.
The last column states whether the statemachine can be reused for transition execution. In case of a result state with only a triggerless transition or the end state the machine is stopped, reset and started again to see if its state changes. In case of S3 that accepts an event another event is sent with no exception to see if the transition is taken.
Test Project
Attached. Built with Java 17.0.6 and Gradle 8.1.1.
statemachine-exception-handling.zip
Note that test cases were written not to fail at issues listed below. Instead, a comment was added to assertions proving an error.
Test Report
Item # | Path To Exception | Exception Origin | Exception Type | Exception Propagated? | Exception In Container? | Result State | SSM Reusable? |
---|---|---|---|---|---|---|---|
1 | start → S1 | initial action | RuntimeException |
✓ | ✓ | ✗ | ✗ |
2 | start → S1 | initial action | Error |
✓ | ✗ | ✗ | ✗ |
3 | start → S1 → S2 | S1 → S2 action | RuntimeException |
✓ | ✓ | S1 | ✗ |
4 | start → S1 → S2 | S1 → S2 action | Error |
✓ | ✗ | S1 | ✗ |
5 | start → S1 → S2 | S2 option 1 guard | RuntimeException |
✗ | ✓ | S5 | ✗ |
6 | start → S1 → S2 | S2 option 1 guard | Error |
✗ | ✓ | S5 | ✗ |
7 | start → S1 → S2 | S2 option 1 action | RuntimeException |
✓ | ✓ | S1 | ✗ |
8 | start → S1 → S2 | S2 option 1 action | Error |
✓ | ✗ | S1 | ✗ |
9 | start → S1 → S2 | S2 option guards + default option action | RuntimeException |
only action exception | ✓ | S1 | ✗ |
10 | start → S1 → S2 | S2 option guards + default option action | Error |
only action error | only guard errors | S1 | ✗ |
11 | start → S1 → S2 → S3 → S4 | S3 → S4 action | RuntimeException |
✗ | ✓ | S3 | ✓ |
12 | start → S1 → S2 → S3 → S4 | S3 → S4 action | Error |
✓ | ✗ | S3 | ✗ |
13 | start → S1 → S2 → S3 → S4 | S3 → S4 guard | RuntimeException |
✗ | ✓ | S3 | ✓ |
14 | start → S1 → S2 → S3 → S4 | S3 → S4 guard | Error |
✓ | ✓ | S3 | ✗ |
15 | start → S1 → S2 → S4 | S4 state entry + behavior + exit action | RuntimeException |
✗ | missing or extra exit action exception | S4 or S5 | ✗ |
16 | start → S1 → S2 → S4 | S4 state entry action | Error |
✓ | ✗ | S4 | ✗ |
17 | start → S1 → S2 → S4 | S4 state behavior action | Error |
✗ | ✗ | S5 | ✗ |
18 | start → S1 → S2 → S4 | S4 state exit action | Error |
only sometimes | ✗ | S4 or S5 | ✗ |
Test Result Groups
Error Gets Propagated
Test cases in which an error gets propagated to the caller can be considered OK in my opinion. The machine cannot be reused but since we experienced an error this is probably not what we want anyway. This applies to test items 2,4,8,12,14 and 16.
Error Not Propagated
This was demonstrated in case of type Error
in choice option guards in 6, and 10. Note that in 10 this allows an error from an action to slip through. The machine also continues transition execution and enters the end state. The same applies to item 17 where the error occurs in a state behavior action. The expected behavior here would be to terminate execution right away.
Statemachine Gets Caught In State After Exception
In case of type Exception
we may want to reuse the machine to start it again or re-send an event because the nature of the exception might be temporary. This will not be possible if the result state of the statemachine does not allow for that. It was described earlier in "Room For Improvement - Point 2" from above and applies to test items 1,3,5,7,9 and 15.
Transition Execution Not Interrupted After Exception
Some test items demonstrate that the statemachine continues its transition logic despite an exception occurred. This applies to the choice option guards as seen in test items 5 and 6 as well as to state actions from 15. A more severe case is test item 17 where despite an error in S4 behavior action the end state is entered.
Flaky Runs
Random erroneous behavior was experienced in test items 15 and 18 where the exit action from S4 fires. Sometimes the action is late meaning at the time of verification it has not been executed yet. You may modify the tests to make the thread wait for another second before mock verification to see that it does finally execute. There are other times when the same action executes twice. Possibly related to:
- Error in action causes double execution #384
- StateAction randomly not executed (race condition?) #493
The same happened in test items 11 and 13 when the event was sent a 2nd time (without exception).
What's more, the exception propagated to the caller is not always what we would expect:
java.util.ConcurrentModificationException
at java.base/java.util.ArrayList$Itr.checkForComodification(ArrayList.java:1013)
at java.base/java.util.ArrayList$Itr.next(ArrayList.java:967)
at reactor.core.publisher.FluxIterable$IterableSubscription.slowPath(FluxIterable.java:259)
[...100 more...]
at reactor.core.publisher.FluxGenerate$GenerateSubscription.next(FluxGenerate.java:178)
at org.springframework.statemachine.support.ReactiveStateMachineExecutor.lambda$handleTriggerlessTransitions$18(ReactiveStateMachineExecutor.java:349)
at reactor.core.publisher.FluxGenerate.lambda$new$1(FluxGenerate.java:58)
[...100 miles down the reactor...]
at reactor.core.publisher.MonoIgnoreThen.subscribe(MonoIgnoreThen.java:51)
at reactor.core.publisher.Mono.subscribe(Mono.java:4400)
at reactor.core.publisher.Mono.block(Mono.java:1706)
at org.springframework.statemachine.support.LifecycleObjectSupport.start(LifecycleObjectSupport.java:111)
This could be related to an issue that was meant to be fixed:
Improvement Proposals
P-1: Propagate Exceptions
Catch an exception, interrupt transition execution, and rethrow the exception. Let it exit the machine so that operators can try-catch.
P-2: Recover The Statemachine
This could be achieved by a Back To Origin approach where the machine is reset to:
- the pre-start state if the machine was started
- the pre-event state if an event was sent
State in this context refers to any part of the statemachine including extended state, current error, etc.
This could make sense as a general feature but may be optional as well: configurer.withRecovery( )
. E.g. when a statemachine is persisted in a database (via parts of itsStateMachineContext
) and calling threads only query the machine to restore it, send a single event, evaluate success and then persist it again, one will not need statemachine recovery in case of an exception. In essence, there might be users who want to reuse the machine for several events and others do not.
P-3: Keep Up Development
Based on the number of issues that have piled up and reasonable doubt that has been expressed:
I guess a lot of users would be happy to see progress on but not limited to this topic. One way to start would be to keep up communication with those involved in issues.