Improve Exception Handling

## Motivation

Some issues related to exception handling have been reported over time:
1. #548 
2. #601
3. #997 
4. #553 
5. #1055 
6. #970 
7. #1076 
8. #608 
9. #340 
10. #1090 

What they have in common is that users want to know how they can handle exceptions from self-written components at the caller level of the statemachine. They often ask why exceptions get caught inside the machine but are not passed to the outside.

It seems there are 2 principles that try to explain why Spring Statemachine catches exceptions extensively:
1. [Run To Completion](https://github.com/spring-projects/spring-statemachine/issues/1055#issuecomment-1157819623)
2. [Do Not Break The Machine](https://github.com/spring-projects/spring-statemachine/issues/183#issue-138674552) (see also #206 )

While one could argue that number 2 is merely a requirement derived from number 1 the "Run To Completion" actually refers only to how events should be processed. The "one by one" approach as explained on the linked sites is important because the machine needs to be in a well-defined and stable state before it can go on.

However, this does not take exceptions or errors into account. From my personal point of view I doubt these are even considered events (at least not the ones that occur unexpectedly). That is why I believe when it comes to exception handling we should move away from the very fundamental RTC paradigm and take a look at how users can operate a statemachine meaningfully.

## Room For Improvement

In the following I will refer to exceptions in actions or guards but most of the points can be applied to other user-written components as well (e.g. listeners, interceptors).

The current (3.2.0) implementation has at least 2 shortcomings that force the implementer to take extra measures against exceptions:
1. Exceptions are _not always_ propagated to the caller starting the statemachine or sending an event.
 * As a consequence, callers must choose a container (e.g. _extended state_) to store an exception that happens during statemachine execution. This container must be accessible from outside the machine so that one can read and evaluate the exception after execution.
2. Exceptions can make a statemachine _impossible to reuse_. E.g. when they occur in an action bound to a triggerless transition the machine literally hangs in the transit state where the transition originates.
 * Users that wish to reuse the same statemachine instance between events must extend their exception handling by one of the following:
 * Add a looping event transition to the transit state, i.e. one that leads to the same state again. Once an exception occurs the event has to be sent to continue on the former path since triggerless transitions get executed only once a state is entered.
 * Create a machine backup, e.g. with the help of Spring `StateMachinePersister`. This can be used to restore the statemachine from a stable state. In technical terms this means the machine experiences a reset.

## Exception Handling Test

To demonstrate these points I created test based on the following statemachine:


<img src="https://github.com/spring-projects/spring-statemachine/assets/55539338/bb6d1d38-0b75-40c7-b80b-34d4bf15d805" width=75% height=75%>


* _S1_ .. _S5_ := states
 * _S1_ := initial state
 * _S2_ := choice
 * _S3_ := event accepting state
 * _S4_ := transit state (with state entry + behavior + exit action)
 * _S5_ := end state
* _E_ := event
* _a_ := action
* _g_ := guard

The action is always registered with an `errorAction` that writes the exception into a container, specifically the `ExtendedState`. The guard is registered using a wrapper around it that provides the same exception handling.

The last column states whether the statemachine can be reused for transition execution. In case of a result state with only a triggerless transition or the end state the machine is stopped, reset and started again to see if its state changes. In case of _S3_ that accepts an event another event is sent with no exception to see if the transition is taken.

### Test Project

Attached. Built with Java 17.0.6 and Gradle 8.1.1.

[statemachine-exception-handling.zip](https://github.com/spring-projects/spring-statemachine/files/11711149/statemachine-exception-handling.zip)

Note that test cases were written not to fail at issues listed below. Instead, a comment was added to assertions proving an error.

### Test Report

Item #|Path To Exception|Exception Origin|Exception Type|Exception Propagated?|Exception In Container?|Result State|SSM Reusable?
:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:
1|start &rarr; _S1_|initial action|`RuntimeException`|&check;|&check;|&cross;|&cross;
2|start &rarr; _S1_|initial action|`Error`|&check;|&cross;|&cross;|&cross;
3|start &rarr; _S1_ &rarr; _S2_|_S1_ &rarr; _S2_ action|`RuntimeException`|&check;|&check;|_S1_|&cross;
4|start &rarr; _S1_ &rarr; _S2_|_S1_ &rarr; _S2_ action|`Error`|&check;|&cross;|_S1_|&cross;
5|start &rarr; _S1_ &rarr; _S2_|_S2_ option 1 guard|`RuntimeException`|&cross;|&check;|_S5_|&cross;
6|start &rarr; _S1_ &rarr; _S2_|_S2_ option 1 guard|`Error`|&cross;|&check;|_S5_|&cross;
7|start &rarr; _S1_ &rarr; _S2_|_S2_ option 1 action|`RuntimeException`|&check;|&check;|_S1_|&cross;
8|start &rarr; _S1_ &rarr; _S2_|_S2_ option 1 action|`Error`|&check;|&cross;|_S1_|&cross;
9|start &rarr; _S1_ &rarr; _S2_|_S2_ option guards + default option action|`RuntimeException`|only action exception|&check;|_S1_|&cross;
10|start &rarr; _S1_ &rarr; _S2_|_S2_ option guards + default option action|`Error`|only action error|only guard errors|_S1_|&cross;
11|start &rarr; _S1_ &rarr; _S2_ &rarr; _S3_ &rarr; _S4_|_S3_ &rarr; _S4_ action|`RuntimeException`|&cross;|&check;|_S3_|&check;
12|start &rarr; _S1_ &rarr; _S2_ &rarr; _S3_ &rarr; _S4_|_S3_ &rarr; _S4_ action|`Error`|&check;|&cross;|_S3_|&cross;
13|start &rarr; _S1_ &rarr; _S2_ &rarr; _S3_ &rarr; _S4_|_S3_ &rarr; _S4_ guard|`RuntimeException`|&cross;|&check;|_S3_|&check;
14|start &rarr; _S1_ &rarr; _S2_ &rarr; _S3_ &rarr; _S4_|_S3_ &rarr; _S4_ guard|`Error`|&check;|&check;|_S3_|&cross;
15|start &rarr; _S1_ &rarr; _S2_ &rarr; _S4_|_S4_ state entry + behavior + exit action|`RuntimeException`|&cross;|missing or extra exit action exception|_S4_ or _S5_|&cross;
16|start &rarr; _S1_ &rarr; _S2_ &rarr; _S4_|_S4_ state entry action|`Error`|&check;|&cross;|_S4_|&cross;
17|start &rarr; _S1_ &rarr; _S2_ &rarr; _S4_|_S4_ state behavior action|`Error`|&cross;|&cross;|_S5_|&cross;
18|start &rarr; _S1_ &rarr; _S2_ &rarr; _S4_|_S4_ state exit action|`Error`|only sometimes|&cross;|_S4_ or _S5_|&cross;

### Test Result Groups

#### Error Gets Propagated

Test cases in which an error gets propagated to the caller can be considered OK in my opinion. The machine cannot be reused but since we experienced an error this is probably not what we want anyway. This applies to test items 2,4,8,12,14 and 16.

#### Error Not Propagated

This was demonstrated in case of type `Error` in choice option guards in 6, and 10. Note that in 10 this allows an error from an action to slip through. The machine also continues transition execution and enters the end state. The same applies to item 17 where the error occurs in a state behavior action. The expected behavior here would be to terminate execution right away.

#### Statemachine Gets Caught In State After Exception

In case of type `Exception` we may want to reuse the machine to start it again or re-send an event because the nature of the exception might be temporary. This will not be possible if the result state of the statemachine does not allow for that. It was described earlier in "Room For Improvement - Point 2" from above and applies to test items 1,3,5,7,9 and 15.

#### Transition Execution Not Interrupted After Exception

Some test items demonstrate that the statemachine continues its transition logic despite an exception occurred. This applies to the choice option guards as seen in test items 5 and 6 as well as to state actions from 15. A more severe case is test item 17 where despite an error in _S4_ behavior action the end state is entered.

#### Flaky Runs

Random erroneous behavior was experienced in test items 15 and 18 where the exit action from _S4_ fires. Sometimes the action is late meaning at the time of verification it has not been executed yet. You may modify the tests to make the thread wait for another second before mock verification to see that it does finally execute. There are other times when the same action executes twice. Possibly related to:
* #384 
* #493 

The same happened in test items 11 and 13 when the event was sent a 2nd time (without exception).

What's more, the exception propagated to the caller is not always what we would expect:
```
java.util.ConcurrentModificationException
	at java.base/java.util.ArrayList$Itr.checkForComodification(ArrayList.java:1013)
	at java.base/java.util.ArrayList$Itr.next(ArrayList.java:967)
	at reactor.core.publisher.FluxIterable$IterableSubscription.slowPath(FluxIterable.java:259)
	
	[...100 more...]

	at reactor.core.publisher.FluxGenerate$GenerateSubscription.next(FluxGenerate.java:178)
	at org.springframework.statemachine.support.ReactiveStateMachineExecutor.lambda$handleTriggerlessTransitions$18(ReactiveStateMachineExecutor.java:349)
	at reactor.core.publisher.FluxGenerate.lambda$new$1(FluxGenerate.java:58)
	
	[...100 miles down the reactor...]

	at reactor.core.publisher.MonoIgnoreThen.subscribe(MonoIgnoreThen.java:51)
	at reactor.core.publisher.Mono.subscribe(Mono.java:4400)
	at reactor.core.publisher.Mono.block(Mono.java:1706)
	at org.springframework.statemachine.support.LifecycleObjectSupport.start(LifecycleObjectSupport.java:111)
```

This could be related to an issue that was meant to be fixed:
* #736 

## Improvement Proposals

### P-1: Propagate Exceptions
Catch an exception, interrupt transition execution, and rethrow the exception. Let it exit the machine so that operators can try-catch.

### P-2: Recover The Statemachine
This could be achieved by a _Back To Origin_ approach where the machine is reset to:
* the pre-start state if the machine was started
* the pre-event state if an event was sent

State in this context refers to any part of the statemachine including extended state, current error, etc.

This could make sense as a general feature but may be optional as well: `configurer.withRecovery( )`. E.g. when a statemachine is persisted in a database (via parts of its`StateMachineContext`) and calling threads only query the machine to restore it, send a single event, evaluate success and then persist it again, one will not need statemachine recovery in case of an exception. In essence, there might be users who want to reuse the machine for several events and others do not.

### P-3: Keep Up Development
Based on the number of issues that have piled up and reasonable doubt that has been expressed:
* #1081 

I guess a lot of users would be happy to see progress on but not limited to this topic. One way to start would be to keep up communication with those involved in issues.

Item #	Path To Exception	Exception Origin	Exception Type	Exception Propagated?	Exception In Container?	Result State	SSM Reusable?
1	start → S₁	initial action	`RuntimeException`	✓	✓	✗	✗
2	start → S₁	initial action	`Error`	✓	✗	✗	✗
3	start → S₁ → S₂	S₁ → S₂ action	`RuntimeException`	✓	✓	S₁	✗
4	start → S₁ → S₂	S₁ → S₂ action	`Error`	✓	✗	S₁	✗
5	start → S₁ → S₂	S₂ option 1 guard	`RuntimeException`	✗	✓	S₅	✗
6	start → S₁ → S₂	S₂ option 1 guard	`Error`	✗	✓	S₅	✗
7	start → S₁ → S₂	S₂ option 1 action	`RuntimeException`	✓	✓	S₁	✗
8	start → S₁ → S₂	S₂ option 1 action	`Error`	✓	✗	S₁	✗
9	start → S₁ → S₂	S₂ option guards + default option action	`RuntimeException`	only action exception	✓	S₁	✗
10	start → S₁ → S₂	S₂ option guards + default option action	`Error`	only action error	only guard errors	S₁	✗
11	start → S₁ → S₂ → S₃ → S₄	S₃ → S₄ action	`RuntimeException`	✗	✓	S₃	✓
12	start → S₁ → S₂ → S₃ → S₄	S₃ → S₄ action	`Error`	✓	✗	S₃	✗
13	start → S₁ → S₂ → S₃ → S₄	S₃ → S₄ guard	`RuntimeException`	✗	✓	S₃	✓
14	start → S₁ → S₂ → S₃ → S₄	S₃ → S₄ guard	`Error`	✓	✓	S₃	✗
15	start → S₁ → S₂ → S₄	S₄ state entry + behavior + exit action	`RuntimeException`	✗	missing or extra exit action exception	S₄ or S₅	✗
16	start → S₁ → S₂ → S₄	S₄ state entry action	`Error`	✓	✗	S₄	✗
17	start → S₁ → S₂ → S₄	S₄ state behavior action	`Error`	✗	✗	S₅	✗
18	start → S₁ → S₂ → S₄	S₄ state exit action	`Error`	only sometimes	✗	S₄ or S₅	✗

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve Exception Handling #1099

Motivation

Room For Improvement

Exception Handling Test

Test Project

Test Report

Test Result Groups

Error Gets Propagated

Error Not Propagated

Statemachine Gets Caught In State After Exception

Transition Execution Not Interrupted After Exception

Flaky Runs

Improvement Proposals

P-1: Propagate Exceptions

P-2: Recover The Statemachine

P-3: Keep Up Development

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve Exception Handling #1099

Description

Motivation

Room For Improvement

Exception Handling Test

Test Project

Test Report

Test Result Groups

Error Gets Propagated

Error Not Propagated

Statemachine Gets Caught In State After Exception

Transition Execution Not Interrupted After Exception

Flaky Runs

Improvement Proposals

P-1: Propagate Exceptions

P-2: Recover The Statemachine

P-3: Keep Up Development

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions