From 6a9755593eafc5990171936c7e0ef03987adcdc6 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Attila=20M=C3=A9sz=C3=A1ros?= Date: Wed, 21 May 2025 14:21:29 +0200 Subject: [PATCH 01/17] blog: caching MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Attila Mészáros --- .../blog/news/primary-cache-for-next-recon.md | 62 +++++++++++++++++++ .../source/informer/InformerWrapper.java | 1 + 2 files changed, 63 insertions(+) create mode 100644 docs/content/en/blog/news/primary-cache-for-next-recon.md diff --git a/docs/content/en/blog/news/primary-cache-for-next-recon.md b/docs/content/en/blog/news/primary-cache-for-next-recon.md new file mode 100644 index 0000000000..f1a58ed879 --- /dev/null +++ b/docs/content/en/blog/news/primary-cache-for-next-recon.md @@ -0,0 +1,62 @@ +--- +title: Custom resource change guarantees for next reconciliation +date: 2025-05-21 +author: >- + [Attila Mészáros](https://github.com/csviri) +--- + +We recently released v5.1 of Java Operator SDK. One of the highlights of this release is related to a topic of so-called +[allocated values](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#representing-allocated-values +) in Kubernetes. + +To sum up the problem, for example, if we create a resource in our controller that has a generated identifier - +in other words the new resource cannot be addressed only by using the values from the `.spec` of the resource - +we have to store it, commonly in the `.status` of the custom resource. However, operator frameworks cache resources +using informers, so the update that you made to the status of the custom resource will just eventually get into +the cache of the informer. If meanwhile some other event triggers the reconciliation, it can happen that you will +see the stale custom resource in the cache (in another word, the cache is eventually consistent). This is a problem +since you might not know at that point that the desired resources were already created, so it might happen that you try to +create it again. + +Java Operator SDK now out of the box provides a utility class [`PrimaryUpdateAndCacheUtils`](https://github.com/operator-framework/java-operator-sdk/blob/main/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/api/reconciler/PrimaryUpdateAndCacheUtils.java) +if you use it, the framework guarantees that the next reconciliation will always receive the updated resource: + +```java + @Override + public UpdateControl reconcile( + StatusPatchCacheCustomResource resource, + Context context) { + + // omitted code + + var freshCopy = createFreshCopy(resource); + + freshCopy + .getStatus() + .setValue(statusWithAllocatedValue()); + + // using the utility instead of update control + var updated = + PrimaryUpdateAndCacheUtils.ssaPatchStatusAndCacheResource(resource, freshCopy, context); + return UpdateControl.noUpdate(); + } +``` + +This utility class will do the magic, but how does it work? Actually, there are multiple ways to solve this problem, +but in the end, we decided to provide only the mentioned approach. (If you want to dig deep see this [PR](https://github.com/operator-framework/java-operator-sdk/pull/2800/files)). + +The trick is that we can cache the resource of our update in an additional cache on top of the informer's cache. +If we read the resource, we first check if the resource is in the overlay cache and only read it from the Informers cache +if not present there. If the informer receives an event with that resource, we always remove the resource from the overlay +cache. But this **works only** if the update is done **with optimistic locking**. +So if the update fails on conflict, we simply read the resource from the server, apply your changes and try to update again +with optimistic locking. + +So why optimistic locking? (A bit simplified explanation) Note that if we do not update the resource with optimistic locking, it can happen that +another party does an update on resource just before we do. The informer receives the event from another party's update, +if we compare resource versions with this resource and previously cached resource (we used to do our update), +that would be different, and in general there is no elegant way to determine in general if this new version that +informer receives an event for is from an update that happened before or after our update. +(Note that informers watch can lose connection and other edge cases) +If we do an update with optimistic locking it simplifies the situation, we can easily have strong guarantees. + diff --git a/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/event/source/informer/InformerWrapper.java b/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/event/source/informer/InformerWrapper.java index c07ffdbf46..dc788f22dc 100644 --- a/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/event/source/informer/InformerWrapper.java +++ b/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/event/source/informer/InformerWrapper.java @@ -79,6 +79,7 @@ public void start() throws OperatorException { if (!configurationService.stopOnInformerErrorDuringStartup()) { informer.exceptionHandler((b, t) -> !ExceptionHandler.isDeserializationException(t)); } + // change thread name for easier debugging final var thread = Thread.currentThread(); final var name = thread.getName(); From 77db6f50bc97a4e7388e05eb0332ca02541dd33d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Attila=20M=C3=A9sz=C3=A1ros?= Date: Wed, 21 May 2025 16:59:05 +0200 Subject: [PATCH 02/17] docs: blogpost about primary caching MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Attila Mészáros --- docs/content/en/blog/news/primary-cache-for-next-recon.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/docs/content/en/blog/news/primary-cache-for-next-recon.md b/docs/content/en/blog/news/primary-cache-for-next-recon.md index f1a58ed879..fd1c4ce387 100644 --- a/docs/content/en/blog/news/primary-cache-for-next-recon.md +++ b/docs/content/en/blog/news/primary-cache-for-next-recon.md @@ -58,5 +58,10 @@ if we compare resource versions with this resource and previously cached resourc that would be different, and in general there is no elegant way to determine in general if this new version that informer receives an event for is from an update that happened before or after our update. (Note that informers watch can lose connection and other edge cases) -If we do an update with optimistic locking it simplifies the situation, we can easily have strong guarantees. + +If we do an update with optimistic locking it simplifies the situation, we can easily have strong guarantees. +Since we know if the update with optimistic locking is successful, we had the fresh resource in our cache. +Thus, the next event we receive will be the one that is results of our update or a newer one. +So if we cache the resource in the overlay cache we know that with the next event, we can remove it from there. + From f59df34f137d792c5dd0f802f9b77a741f870f92 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Attila=20M=C3=A9sz=C3=A1ros?= Date: Wed, 21 May 2025 17:07:24 +0200 Subject: [PATCH 03/17] wip MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Attila Mészáros --- docs/content/en/blog/news/primary-cache-for-next-recon.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/docs/content/en/blog/news/primary-cache-for-next-recon.md b/docs/content/en/blog/news/primary-cache-for-next-recon.md index fd1c4ce387..f680d60125 100644 --- a/docs/content/en/blog/news/primary-cache-for-next-recon.md +++ b/docs/content/en/blog/news/primary-cache-for-next-recon.md @@ -29,8 +29,7 @@ if you use it, the framework guarantees that the next reconciliation will always // omitted code - var freshCopy = createFreshCopy(resource); - + var freshCopy = createFreshCopy(resource); // need fresh copy just because we use the SSA version of update freshCopy .getStatus() .setValue(statusWithAllocatedValue()); @@ -63,5 +62,5 @@ If we do an update with optimistic locking it simplifies the situation, we can e Since we know if the update with optimistic locking is successful, we had the fresh resource in our cache. Thus, the next event we receive will be the one that is results of our update or a newer one. So if we cache the resource in the overlay cache we know that with the next event, we can remove it from there. - - +If the update with optimistic locking fails, we can wait until the informer's cache is populated with next resource +version and retry. From 6b0eff6b9e9e27eed2d2af6a9fcc5eca8c640e53 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Attila=20M=C3=A9sz=C3=A1ros?= Date: Wed, 21 May 2025 17:49:34 +0200 Subject: [PATCH 04/17] wip MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Attila Mészáros --- .../processing/event/source/informer/InformerWrapper.java | 1 - 1 file changed, 1 deletion(-) diff --git a/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/event/source/informer/InformerWrapper.java b/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/event/source/informer/InformerWrapper.java index dc788f22dc..c07ffdbf46 100644 --- a/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/event/source/informer/InformerWrapper.java +++ b/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/event/source/informer/InformerWrapper.java @@ -79,7 +79,6 @@ public void start() throws OperatorException { if (!configurationService.stopOnInformerErrorDuringStartup()) { informer.exceptionHandler((b, t) -> !ExceptionHandler.isDeserializationException(t)); } - // change thread name for easier debugging final var thread = Thread.currentThread(); final var name = thread.getName(); From 703bfca122ee158891e286779143e18d832938c4 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Attila=20M=C3=A9sz=C3=A1ros?= Date: Thu, 22 May 2025 08:42:31 +0200 Subject: [PATCH 05/17] Update docs/content/en/blog/news/primary-cache-for-next-recon.md Co-authored-by: Martin Stefanko --- docs/content/en/blog/news/primary-cache-for-next-recon.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/content/en/blog/news/primary-cache-for-next-recon.md b/docs/content/en/blog/news/primary-cache-for-next-recon.md index f680d60125..76b84a2072 100644 --- a/docs/content/en/blog/news/primary-cache-for-next-recon.md +++ b/docs/content/en/blog/news/primary-cache-for-next-recon.md @@ -60,7 +60,7 @@ informer receives an event for is from an update that happened before or after o If we do an update with optimistic locking it simplifies the situation, we can easily have strong guarantees. Since we know if the update with optimistic locking is successful, we had the fresh resource in our cache. -Thus, the next event we receive will be the one that is results of our update or a newer one. +Thus, the next event we receive will be the one that is the result of our update or a newer one. So if we cache the resource in the overlay cache we know that with the next event, we can remove it from there. If the update with optimistic locking fails, we can wait until the informer's cache is populated with next resource version and retry. From 6259ac0e722a21a53605dab7165daea12eb6131c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Attila=20M=C3=A9sz=C3=A1ros?= Date: Thu, 22 May 2025 08:52:04 +0200 Subject: [PATCH 06/17] Update primary-cache-for-next-recon.md --- .../blog/news/primary-cache-for-next-recon.md | 33 +++++++++---------- 1 file changed, 16 insertions(+), 17 deletions(-) diff --git a/docs/content/en/blog/news/primary-cache-for-next-recon.md b/docs/content/en/blog/news/primary-cache-for-next-recon.md index 76b84a2072..b82449a335 100644 --- a/docs/content/en/blog/news/primary-cache-for-next-recon.md +++ b/docs/content/en/blog/news/primary-cache-for-next-recon.md @@ -10,13 +10,13 @@ We recently released v5.1 of Java Operator SDK. One of the highlights of this re ) in Kubernetes. To sum up the problem, for example, if we create a resource in our controller that has a generated identifier - -in other words the new resource cannot be addressed only by using the values from the `.spec` of the resource - +in other words, the new resource cannot be addressed only by using the values from the `.spec` of the resource - we have to store it, commonly in the `.status` of the custom resource. However, operator frameworks cache resources using informers, so the update that you made to the status of the custom resource will just eventually get into the cache of the informer. If meanwhile some other event triggers the reconciliation, it can happen that you will see the stale custom resource in the cache (in another word, the cache is eventually consistent). This is a problem since you might not know at that point that the desired resources were already created, so it might happen that you try to -create it again. +create them again. Java Operator SDK now out of the box provides a utility class [`PrimaryUpdateAndCacheUtils`](https://github.com/operator-framework/java-operator-sdk/blob/main/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/api/reconciler/PrimaryUpdateAndCacheUtils.java) if you use it, the framework guarantees that the next reconciliation will always receive the updated resource: @@ -41,26 +41,25 @@ if you use it, the framework guarantees that the next reconciliation will always } ``` -This utility class will do the magic, but how does it work? Actually, there are multiple ways to solve this problem, -but in the end, we decided to provide only the mentioned approach. (If you want to dig deep see this [PR](https://github.com/operator-framework/java-operator-sdk/pull/2800/files)). +This utility class will do the magic, but how does it work? There are multiple ways to solve this problem, +but ultimately, we only provided the mentioned approach. (If you want to dig deep, see this [PR](https://github.com/operator-framework/java-operator-sdk/pull/2800/files)). -The trick is that we can cache the resource of our update in an additional cache on top of the informer's cache. -If we read the resource, we first check if the resource is in the overlay cache and only read it from the Informers cache +The trick is to cache the resource of our update in an additional cache on top of the informer's cache. +If we read the resource, we first check if it is in the overlay cache and only read it from the Informers cache if not present there. If the informer receives an event with that resource, we always remove the resource from the overlay cache. But this **works only** if the update is done **with optimistic locking**. -So if the update fails on conflict, we simply read the resource from the server, apply your changes and try to update again -with optimistic locking. +So if the update fails on conflict, we simply wait and poll the informer cache until there is a new resource version, apply your changes, +and try to update again with optimistic locking. So why optimistic locking? (A bit simplified explanation) Note that if we do not update the resource with optimistic locking, it can happen that -another party does an update on resource just before we do. The informer receives the event from another party's update, -if we compare resource versions with this resource and previously cached resource (we used to do our update), -that would be different, and in general there is no elegant way to determine in general if this new version that -informer receives an event for is from an update that happened before or after our update. +another party does an update on the resource just before we do. The informer receives the event from another party's update, +if we would compare resource versions with this resource and the previously cached resource (response from our update), +that would be different, and in general there is no elegant way to determine if this new version that +informer receives an event from an update that happened before or after our update. (Note that informers watch can lose connection and other edge cases) -If we do an update with optimistic locking it simplifies the situation, we can easily have strong guarantees. -Since we know if the update with optimistic locking is successful, we had the fresh resource in our cache. +If we do an update with optimistic locking, it simplifies the situation, we can easily have strong guarantees. +Since we know if the update with optimistic locking is successful, we have the fresh resource in our cache. Thus, the next event we receive will be the one that is the result of our update or a newer one. -So if we cache the resource in the overlay cache we know that with the next event, we can remove it from there. -If the update with optimistic locking fails, we can wait until the informer's cache is populated with next resource -version and retry. +So if we cache the resource in the overlay cache from the response, we know that with the next event, we can remove it from there. + From cbb3315547063f9245a5c962f9a8ea7ab65a1ac1 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Attila=20M=C3=A9sz=C3=A1ros?= Date: Thu, 22 May 2025 09:23:01 +0200 Subject: [PATCH 07/17] mermaid and improvement MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Attila Mészáros --- .../blog/news/primary-cache-for-next-recon.md | 50 ++++++++++++------- 1 file changed, 33 insertions(+), 17 deletions(-) diff --git a/docs/content/en/blog/news/primary-cache-for-next-recon.md b/docs/content/en/blog/news/primary-cache-for-next-recon.md index b82449a335..867c8e3028 100644 --- a/docs/content/en/blog/news/primary-cache-for-next-recon.md +++ b/docs/content/en/blog/news/primary-cache-for-next-recon.md @@ -9,13 +9,13 @@ We recently released v5.1 of Java Operator SDK. One of the highlights of this re [allocated values](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#representing-allocated-values ) in Kubernetes. -To sum up the problem, for example, if we create a resource in our controller that has a generated identifier - -in other words, the new resource cannot be addressed only by using the values from the `.spec` of the resource - -we have to store it, commonly in the `.status` of the custom resource. However, operator frameworks cache resources +To sum up the problem, for example, if we create a resource in our controller that has a generated ID - +in other words, we cannot address the resource only by using the values from the `.spec` - +we have to store it, usually in the `.status` of the custom resource. However, operator frameworks cache resources using informers, so the update that you made to the status of the custom resource will just eventually get into -the cache of the informer. If meanwhile some other event triggers the reconciliation, it can happen that you will -see the stale custom resource in the cache (in another word, the cache is eventually consistent). This is a problem -since you might not know at that point that the desired resources were already created, so it might happen that you try to +the cache of the informer. If meanwhile some other event triggers the reconciliation, it can happen that we will +see the stale custom resource in the cache (in another word, the cache is eventually consistent) without the generated ID in the status. +This is a problem since we might not know at that point that the desired resources were already created, so it might happen that you try to create them again. Java Operator SDK now out of the box provides a utility class [`PrimaryUpdateAndCacheUtils`](https://github.com/operator-framework/java-operator-sdk/blob/main/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/api/reconciler/PrimaryUpdateAndCacheUtils.java) @@ -41,15 +41,16 @@ if you use it, the framework guarantees that the next reconciliation will always } ``` -This utility class will do the magic, but how does it work? There are multiple ways to solve this problem, -but ultimately, we only provided the mentioned approach. (If you want to dig deep, see this [PR](https://github.com/operator-framework/java-operator-sdk/pull/2800/files)). +This utility class will do the magic for you. But how does it work? +There are multiple ways to solve this problem, +but ultimately, we only provided the solution below. (If you want to dig deep in alternatives, see this [PR](https://github.com/operator-framework/java-operator-sdk/pull/2800/files)). -The trick is to cache the resource of our update in an additional cache on top of the informer's cache. -If we read the resource, we first check if it is in the overlay cache and only read it from the Informers cache -if not present there. If the informer receives an event with that resource, we always remove the resource from the overlay -cache. But this **works only** if the update is done **with optimistic locking**. -So if the update fails on conflict, we simply wait and poll the informer cache until there is a new resource version, apply your changes, -and try to update again with optimistic locking. +The trick is to cache the resource from the response of our update in an additional cache on top of the informer's cache. +If we read the resource, we first check if it is in the overlay cache and read it from there if present, otherwise read it from the cache of the informer. +If the informer receives an event with that resource, we always remove the resource from the overlay +cache, since that is a more recent resource. But this **works only** if the update is done **with optimistic locking**. +So if the update fails on conflict, we simply wait and poll the informer cache until there is a new resource version, +and try to update again using the new resource (applied our changes again) with optimistic locking. So why optimistic locking? (A bit simplified explanation) Note that if we do not update the resource with optimistic locking, it can happen that another party does an update on the resource just before we do. The informer receives the event from another party's update, @@ -58,8 +59,23 @@ that would be different, and in general there is no elegant way to determine if informer receives an event from an update that happened before or after our update. (Note that informers watch can lose connection and other edge cases) -If we do an update with optimistic locking, it simplifies the situation, we can easily have strong guarantees. +```mermaid +flowchart TD + A["Update Resource with Lock"] --> B{"Is Successful"} + B -- Fails on conflict --> D["Poll the Informer until resource changes"] + D --> n4["Apply desired changes on the resource"] + B -- Yes --> n2["Still the original resource in informer cache"] + n2 -- Yes --> C["Cache the resource in overlay cache"] + n2 -- NO --> n3["We know there is already a fesh resource in Informer"] + n4 --> A + +``` + +If we do our update with optimistic locking, it simplifies the situation, we can easily have strong guarantees. Since we know if the update with optimistic locking is successful, we have the fresh resource in our cache. -Thus, the next event we receive will be the one that is the result of our update or a newer one. +Thus, the next event we receive will be the one that is the result of our update +(or a newer one if somebody did an update after, but that is fine since it will contain our allocated values). So if we cache the resource in the overlay cache from the response, we know that with the next event, we can remove it from there. - +Note that we store the result in overlay cache only if at that time we still have the original resource in cache, +if the cache already updated. This means that we already received a new event after our update, +so we have a fresh resource in the informer cache. From 93a4a8d4ec228f4903114df25d2dbd78438d24a7 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Attila=20M=C3=A9sz=C3=A1ros?= Date: Thu, 22 May 2025 09:24:35 +0200 Subject: [PATCH 08/17] date MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Attila Mészáros --- docs/content/en/blog/news/primary-cache-for-next-recon.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/content/en/blog/news/primary-cache-for-next-recon.md b/docs/content/en/blog/news/primary-cache-for-next-recon.md index 867c8e3028..1d2f66580f 100644 --- a/docs/content/en/blog/news/primary-cache-for-next-recon.md +++ b/docs/content/en/blog/news/primary-cache-for-next-recon.md @@ -1,6 +1,6 @@ --- title: Custom resource change guarantees for next reconciliation -date: 2025-05-21 +date: 2025-05-22 author: >- [Attila Mészáros](https://github.com/csviri) --- From 6bb447391ee3820ed958b3a2f5d9e7464e104201 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Attila=20M=C3=A9sz=C3=A1ros?= Date: Thu, 22 May 2025 09:31:24 +0200 Subject: [PATCH 09/17] title MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Attila Mészáros --- docs/content/en/blog/news/primary-cache-for-next-recon.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/content/en/blog/news/primary-cache-for-next-recon.md b/docs/content/en/blog/news/primary-cache-for-next-recon.md index 1d2f66580f..fb34e35fde 100644 --- a/docs/content/en/blog/news/primary-cache-for-next-recon.md +++ b/docs/content/en/blog/news/primary-cache-for-next-recon.md @@ -1,5 +1,5 @@ --- -title: Custom resource change guarantees for next reconciliation +title: How to guarantee allocated values for reconciliation date: 2025-05-22 author: >- [Attila Mészáros](https://github.com/csviri) From c9f92f0f27feb76e4c8c723ae90c01bb3500490e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Attila=20M=C3=A9sz=C3=A1ros?= Date: Thu, 22 May 2025 09:41:54 +0200 Subject: [PATCH 10/17] docs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Attila Mészáros --- docs/content/en/blog/news/primary-cache-for-next-recon.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/content/en/blog/news/primary-cache-for-next-recon.md b/docs/content/en/blog/news/primary-cache-for-next-recon.md index fb34e35fde..ad66624d80 100644 --- a/docs/content/en/blog/news/primary-cache-for-next-recon.md +++ b/docs/content/en/blog/news/primary-cache-for-next-recon.md @@ -1,5 +1,5 @@ --- -title: How to guarantee allocated values for reconciliation +title: How to guarantee allocated values for next reconciliation date: 2025-05-22 author: >- [Attila Mészáros](https://github.com/csviri) From 7d2588e60ebb14baf9fea09c582d9626b227113e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Attila=20M=C3=A9sz=C3=A1ros?= Date: Thu, 22 May 2025 10:18:35 +0200 Subject: [PATCH 11/17] improve MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Attila Mészáros --- docs/content/en/blog/news/primary-cache-for-next-recon.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/docs/content/en/blog/news/primary-cache-for-next-recon.md b/docs/content/en/blog/news/primary-cache-for-next-recon.md index ad66624d80..b47aad0a39 100644 --- a/docs/content/en/blog/news/primary-cache-for-next-recon.md +++ b/docs/content/en/blog/news/primary-cache-for-next-recon.md @@ -59,11 +59,12 @@ that would be different, and in general there is no elegant way to determine if informer receives an event from an update that happened before or after our update. (Note that informers watch can lose connection and other edge cases) + ```mermaid flowchart TD A["Update Resource with Lock"] --> B{"Is Successful"} - B -- Fails on conflict --> D["Poll the Informer until resource changes"] - D --> n4["Apply desired changes on the resource"] + B -- Fails on conflict --> D["Poll the Informer cache until resource changes"] + D --> n4["Apply desired changes on the resource again"] B -- Yes --> n2["Still the original resource in informer cache"] n2 -- Yes --> C["Cache the resource in overlay cache"] n2 -- NO --> n3["We know there is already a fesh resource in Informer"] @@ -71,6 +72,7 @@ flowchart TD ``` + If we do our update with optimistic locking, it simplifies the situation, we can easily have strong guarantees. Since we know if the update with optimistic locking is successful, we have the fresh resource in our cache. Thus, the next event we receive will be the one that is the result of our update From 38b0ea75401be28e082149f6cdf523759d3a8c5d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Attila=20M=C3=A9sz=C3=A1ros?= Date: Thu, 22 May 2025 12:39:13 +0200 Subject: [PATCH 12/17] wording MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Attila Mészáros --- docs/content/en/blog/news/primary-cache-for-next-recon.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/content/en/blog/news/primary-cache-for-next-recon.md b/docs/content/en/blog/news/primary-cache-for-next-recon.md index b47aad0a39..eb3971b40c 100644 --- a/docs/content/en/blog/news/primary-cache-for-next-recon.md +++ b/docs/content/en/blog/news/primary-cache-for-next-recon.md @@ -9,9 +9,9 @@ We recently released v5.1 of Java Operator SDK. One of the highlights of this re [allocated values](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#representing-allocated-values ) in Kubernetes. -To sum up the problem, for example, if we create a resource in our controller that has a generated ID - +To sum up the problem, let's say if we create a resource from our controller that has a generated ID - in other words, we cannot address the resource only by using the values from the `.spec` - -we have to store it, usually in the `.status` of the custom resource. However, operator frameworks cache resources +we have to store this ID, usually in the `.status` of the custom resource. However, operator frameworks cache resources using informers, so the update that you made to the status of the custom resource will just eventually get into the cache of the informer. If meanwhile some other event triggers the reconciliation, it can happen that we will see the stale custom resource in the cache (in another word, the cache is eventually consistent) without the generated ID in the status. From a69f6f42be4587e269a23f070ed16ec09a2bdf3f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Attila=20M=C3=A9sz=C3=A1ros?= Date: Thu, 22 May 2025 12:45:57 +0200 Subject: [PATCH 13/17] improve MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Attila Mészáros --- .../blog/news/primary-cache-for-next-recon.md | 24 +++++++++---------- 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/docs/content/en/blog/news/primary-cache-for-next-recon.md b/docs/content/en/blog/news/primary-cache-for-next-recon.md index eb3971b40c..db1e734656 100644 --- a/docs/content/en/blog/news/primary-cache-for-next-recon.md +++ b/docs/content/en/blog/news/primary-cache-for-next-recon.md @@ -7,11 +7,12 @@ author: >- We recently released v5.1 of Java Operator SDK. One of the highlights of this release is related to a topic of so-called [allocated values](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#representing-allocated-values -) in Kubernetes. +). -To sum up the problem, let's say if we create a resource from our controller that has a generated ID - -in other words, we cannot address the resource only by using the values from the `.spec` - -we have to store this ID, usually in the `.status` of the custom resource. However, operator frameworks cache resources +To describe the problem, let's say if we create a resource from our controller that has a generated ID +we have to store this ID, usually in the `.status` of the custom resource. +(In other words, we cannot address the resource only by using the values from the `.spec`.) +However, operator frameworks cache resources using informers, so the update that you made to the status of the custom resource will just eventually get into the cache of the informer. If meanwhile some other event triggers the reconciliation, it can happen that we will see the stale custom resource in the cache (in another word, the cache is eventually consistent) without the generated ID in the status. @@ -47,7 +48,7 @@ but ultimately, we only provided the solution below. (If you want to dig deep in The trick is to cache the resource from the response of our update in an additional cache on top of the informer's cache. If we read the resource, we first check if it is in the overlay cache and read it from there if present, otherwise read it from the cache of the informer. -If the informer receives an event with that resource, we always remove the resource from the overlay +If the informer receives an event with a fresh resource, we always remove the resource from the overlay cache, since that is a more recent resource. But this **works only** if the update is done **with optimistic locking**. So if the update fails on conflict, we simply wait and poll the informer cache until there is a new resource version, and try to update again using the new resource (applied our changes again) with optimistic locking. @@ -57,7 +58,7 @@ another party does an update on the resource just before we do. The informer rec if we would compare resource versions with this resource and the previously cached resource (response from our update), that would be different, and in general there is no elegant way to determine if this new version that informer receives an event from an update that happened before or after our update. -(Note that informers watch can lose connection and other edge cases) +(Note that informer's watch can lose connection and other edge cases) ```mermaid @@ -72,12 +73,11 @@ flowchart TD ``` - -If we do our update with optimistic locking, it simplifies the situation, we can easily have strong guarantees. -Since we know if the update with optimistic locking is successful, we have the fresh resource in our cache. +If we do our update with optimistic locking, it simplifies the situation, and we can have strong guarantees. +Since we know if the update with optimistic locking is successful, we have the up-to-date resource in our caches. Thus, the next event we receive will be the one that is the result of our update (or a newer one if somebody did an update after, but that is fine since it will contain our allocated values). So if we cache the resource in the overlay cache from the response, we know that with the next event, we can remove it from there. -Note that we store the result in overlay cache only if at that time we still have the original resource in cache, -if the cache already updated. This means that we already received a new event after our update, -so we have a fresh resource in the informer cache. +Note that we store the result in overlay cache only if at that time we still have the original resource in cache. +If the cache already updated, that means that we already received a new event after our update, +so we have a fresh resource in the informer cache already. From 1ff9952bea244b56391be5f19d8b6c30ad2d66d2 Mon Sep 17 00:00:00 2001 From: Chris Laprun Date: Fri, 23 May 2025 12:09:23 +0200 Subject: [PATCH 14/17] docs: start improving wording Signed-off-by: Chris Laprun --- .../blog/news/primary-cache-for-next-recon.md | 94 +++++++++++-------- 1 file changed, 53 insertions(+), 41 deletions(-) diff --git a/docs/content/en/blog/news/primary-cache-for-next-recon.md b/docs/content/en/blog/news/primary-cache-for-next-recon.md index db1e734656..375a6999ca 100644 --- a/docs/content/en/blog/news/primary-cache-for-next-recon.md +++ b/docs/content/en/blog/news/primary-cache-for-next-recon.md @@ -1,66 +1,77 @@ --- -title: How to guarantee allocated values for next reconciliation +title: How to guarantee allocated values for next reconciliation date: 2025-05-22 author: >- - [Attila Mészáros](https://github.com/csviri) + [Attila Mészáros](https://github.com/csviri) --- -We recently released v5.1 of Java Operator SDK. One of the highlights of this release is related to a topic of so-called +We recently released v5.1 of Java Operator SDK (JOSDK). One of the highlights of this release is related to a topic of +so-called [allocated values](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#representing-allocated-values ). -To describe the problem, let's say if we create a resource from our controller that has a generated ID -we have to store this ID, usually in the `.status` of the custom resource. -(In other words, we cannot address the resource only by using the values from the `.spec`.) -However, operator frameworks cache resources -using informers, so the update that you made to the status of the custom resource will just eventually get into -the cache of the informer. If meanwhile some other event triggers the reconciliation, it can happen that we will -see the stale custom resource in the cache (in another word, the cache is eventually consistent) without the generated ID in the status. -This is a problem since we might not know at that point that the desired resources were already created, so it might happen that you try to -create them again. +To describe the problem, let's say that our controller needs to create a resource that has a generated identifier, i.e. +a resource which identifier cannot be directely derived from the custom resource's desired state as specified in its +`spec` field. In order to record the fact that the resource was successfully created, and to avoid attempting to +recreate the resource again in subsequent reconciliations, it is typical for this type of controller to store the +generated identifier in the custom resource's `status` field. -Java Operator SDK now out of the box provides a utility class [`PrimaryUpdateAndCacheUtils`](https://github.com/operator-framework/java-operator-sdk/blob/main/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/api/reconciler/PrimaryUpdateAndCacheUtils.java) -if you use it, the framework guarantees that the next reconciliation will always receive the updated resource: +The Java Operator SDK relies on the informers' cache to retrieve resources. These caches, however, are only guaranteed +to be eventually consistent. It could happen, then, that, if some other event occurs, that would result in a new +reconciliation, **before** the update that's been made to our resource status has the chance to be propagated first to +the cluster and then back to the informer cache, that the resource in the informer cache does **not** contain the latest +version as modified by the reconciler. This would result in a new reconciliation where the generated identifier would be +missing from the resource status and, therefore, another attempt to create the resource by the reconciler, which is not +what we'd like. + +Java Operator SDK now provides a utility class [ +`PrimaryUpdateAndCacheUtils`](https://github.com/operator-framework/java-operator-sdk/blob/main/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/api/reconciler/PrimaryUpdateAndCacheUtils.java) +to handle this particular use case. Using that overlay cache, your reconciler is guaranteed to see the most up-to-date +version of the resource on the next reconciliation: ```java - @Override - public UpdateControl reconcile( - StatusPatchCacheCustomResource resource, - Context context) { - + +@Override +public UpdateControl reconcile( + StatusPatchCacheCustomResource resource, + Context context) { + // omitted code - + var freshCopy = createFreshCopy(resource); // need fresh copy just because we use the SSA version of update freshCopy - .getStatus() - .setValue(statusWithAllocatedValue()); + .getStatus() + .setValue(statusWithAllocatedValue()); // using the utility instead of update control var updated = - PrimaryUpdateAndCacheUtils.ssaPatchStatusAndCacheResource(resource, freshCopy, context); + PrimaryUpdateAndCacheUtils.ssaPatchStatusAndCacheResource(resource, freshCopy, context); return UpdateControl.noUpdate(); - } +} ``` -This utility class will do the magic for you. But how does it work? -There are multiple ways to solve this problem, -but ultimately, we only provided the solution below. (If you want to dig deep in alternatives, see this [PR](https://github.com/operator-framework/java-operator-sdk/pull/2800/files)). +How does `PrimaryUpdateAndCacheUtils` work? +There are multiple ways to solve this problem, but ultimately, we only provide the solution described below. (If you +want to dig deep in alternatives, see +this [PR](https://github.com/operator-framework/java-operator-sdk/pull/2800/files)). -The trick is to cache the resource from the response of our update in an additional cache on top of the informer's cache. -If we read the resource, we first check if it is in the overlay cache and read it from there if present, otherwise read it from the cache of the informer. -If the informer receives an event with a fresh resource, we always remove the resource from the overlay +The trick is to cache the resource from the response of our update in an additional cache on top of the informer's +cache. If we read the resource, we first check if it is in the overlay cache and read it from there if present, +otherwise read it from the cache of the informer. If the informer receives an event with a fresh resource, we always +remove the resource from the overlay cache, since that is a more recent resource. But this **works only** if the update is done **with optimistic locking**. So if the update fails on conflict, we simply wait and poll the informer cache until there is a new resource version, and try to update again using the new resource (applied our changes again) with optimistic locking. -So why optimistic locking? (A bit simplified explanation) Note that if we do not update the resource with optimistic locking, it can happen that -another party does an update on the resource just before we do. The informer receives the event from another party's update, -if we would compare resource versions with this resource and the previously cached resource (response from our update), -that would be different, and in general there is no elegant way to determine if this new version that -informer receives an event from an update that happened before or after our update. +So why optimistic locking? (A bit simplified explanation) Note that if we do not update the resource with optimistic +locking, it can happen that +another party does an update on the resource just before we do. The informer receives the event from another party's +update, +if we would compare resource versions with this resource and the previously cached resource (response from our update), +that would be different, and in general there is no elegant way to determine if this new version that +informer receives an event from an update that happened before or after our update. (Note that informer's watch can lose connection and other edge cases) - ```mermaid flowchart TD A["Update Resource with Lock"] --> B{"Is Successful"} @@ -74,10 +85,11 @@ flowchart TD ``` If we do our update with optimistic locking, it simplifies the situation, and we can have strong guarantees. -Since we know if the update with optimistic locking is successful, we have the up-to-date resource in our caches. -Thus, the next event we receive will be the one that is the result of our update -(or a newer one if somebody did an update after, but that is fine since it will contain our allocated values). -So if we cache the resource in the overlay cache from the response, we know that with the next event, we can remove it from there. +Since we know if the update with optimistic locking is successful, we have the up-to-date resource in our caches. +Thus, the next event we receive will be the one that is the result of our update +(or a newer one if somebody did an update after, but that is fine since it will contain our allocated values). +So if we cache the resource in the overlay cache from the response, we know that with the next event, we can remove it +from there. Note that we store the result in overlay cache only if at that time we still have the original resource in cache. -If the cache already updated, that means that we already received a new event after our update, +If the cache already updated, that means that we already received a new event after our update, so we have a fresh resource in the informer cache already. From 7a2d569806a25da3e61cfa0fcf8c5c490960bce0 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Attila=20M=C3=A9sz=C3=A1ros?= Date: Fri, 23 May 2025 15:13:58 +0200 Subject: [PATCH 15/17] wording MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Attila Mészáros --- docs/content/en/blog/news/primary-cache-for-next-recon.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/content/en/blog/news/primary-cache-for-next-recon.md b/docs/content/en/blog/news/primary-cache-for-next-recon.md index 375a6999ca..e6987a9d9a 100644 --- a/docs/content/en/blog/news/primary-cache-for-next-recon.md +++ b/docs/content/en/blog/news/primary-cache-for-next-recon.md @@ -59,7 +59,7 @@ The trick is to cache the resource from the response of our update in an additio cache. If we read the resource, we first check if it is in the overlay cache and read it from there if present, otherwise read it from the cache of the informer. If the informer receives an event with a fresh resource, we always remove the resource from the overlay -cache, since that is a more recent resource. But this **works only** if the update is done **with optimistic locking**. +cache, since that is a more recent resource. But this **works only** if our request update is using **with optimistic locking**. So if the update fails on conflict, we simply wait and poll the informer cache until there is a new resource version, and try to update again using the new resource (applied our changes again) with optimistic locking. From 39bba6fd2803e67f5e9925062eab4cc116603709 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Attila=20M=C3=A9sz=C3=A1ros?= Date: Fri, 23 May 2025 17:38:00 +0200 Subject: [PATCH 16/17] comment improve MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Attila Mészáros --- docs/content/en/blog/news/primary-cache-for-next-recon.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/content/en/blog/news/primary-cache-for-next-recon.md b/docs/content/en/blog/news/primary-cache-for-next-recon.md index e6987a9d9a..a8ffab1cd5 100644 --- a/docs/content/en/blog/news/primary-cache-for-next-recon.md +++ b/docs/content/en/blog/news/primary-cache-for-next-recon.md @@ -43,7 +43,7 @@ public UpdateControl reconcile( .getStatus() .setValue(statusWithAllocatedValue()); - // using the utility instead of update control + // using the utility instead of update control to patch the resource status var updated = PrimaryUpdateAndCacheUtils.ssaPatchStatusAndCacheResource(resource, freshCopy, context); return UpdateControl.noUpdate(); From c98d5721427a947d24345703eea43aed050670a7 Mon Sep 17 00:00:00 2001 From: Chris Laprun Date: Fri, 23 May 2025 20:00:04 +0200 Subject: [PATCH 17/17] docs: improve Signed-off-by: Chris Laprun --- .../blog/news/primary-cache-for-next-recon.md | 70 ++++++++++--------- 1 file changed, 36 insertions(+), 34 deletions(-) diff --git a/docs/content/en/blog/news/primary-cache-for-next-recon.md b/docs/content/en/blog/news/primary-cache-for-next-recon.md index a8ffab1cd5..8f22ddb7b6 100644 --- a/docs/content/en/blog/news/primary-cache-for-next-recon.md +++ b/docs/content/en/blog/news/primary-cache-for-next-recon.md @@ -51,45 +51,47 @@ public UpdateControl reconcile( ``` How does `PrimaryUpdateAndCacheUtils` work? -There are multiple ways to solve this problem, but ultimately, we only provide the solution described below. (If you +There are multiple ways to solve this problem, but ultimately, we only provide the solution described below. If you want to dig deep in alternatives, see -this [PR](https://github.com/operator-framework/java-operator-sdk/pull/2800/files)). - -The trick is to cache the resource from the response of our update in an additional cache on top of the informer's -cache. If we read the resource, we first check if it is in the overlay cache and read it from there if present, -otherwise read it from the cache of the informer. If the informer receives an event with a fresh resource, we always -remove the resource from the overlay -cache, since that is a more recent resource. But this **works only** if our request update is using **with optimistic locking**. -So if the update fails on conflict, we simply wait and poll the informer cache until there is a new resource version, -and try to update again using the new resource (applied our changes again) with optimistic locking. - -So why optimistic locking? (A bit simplified explanation) Note that if we do not update the resource with optimistic -locking, it can happen that -another party does an update on the resource just before we do. The informer receives the event from another party's -update, -if we would compare resource versions with this resource and the previously cached resource (response from our update), -that would be different, and in general there is no elegant way to determine if this new version that -informer receives an event from an update that happened before or after our update. -(Note that informer's watch can lose connection and other edge cases) +this [PR](https://github.com/operator-framework/java-operator-sdk/pull/2800/files). + +The trick is to intercept the resource that the reconciler updated and cache that version in an additional cache on top +of the informer's cache. Subsequently, if the reconciler needs to read the resource, the SDK will first check if it is +in the overlay cache and read it from there if present, otherwise read it from the informer's cache. If the informer +receives an event with a fresh resource, we always remove the resource from the overlay cache, since that is a more +recent resource. But this **works only** if the reconciler updates the resource using **with optimistic locking**, which +is handled for you by `PrimaryUpdateAndCacheUtils` provided you pass it a "fresh" (i.e. a version of the resource that +only contains the fields you care about being updated) copy of the resource since Server-Side Apply will be used +underneath. If the update fails on conflict, because the resource has already been updated on the cluster before we got +the chance to get our update in, we simply wait and poll the informer cache until the new resource version from the +server appears in the informer's cache, +and then try to apply our updates to the resource again using the updated version from the server, again with optimistic +locking. + +So why is optimistic locking required? We hinted at it above, but the gist of it, is that if another party updates the +resource before we get a chance to, we wouldn't be able to properly handle the resulting situation correctly in all +cases. The informer would receive that new event before our own update would get a chance to propagate. Without +optimistic locking, there wouldn't be a fail-proof way to determine which update should prevail (i.e. which occurred +first), in particular in the event of the informer losing the connection to the cluster or other edge cases (the joys of +distributed computing!). + +Optimistic locking simplifies the situation and provides us with stronger guarantees: if the update succeeds, then we +can be sure we have the proper resource version in our caches. The next event will contain our update in all cases. +Because we know that, we can also be sure that we can evict the cached resource in the overlay cache whenever we receive +a new event. The overlay cache is only used if the SDK detects that the original resource (i.e. the one before we +applied our status update in the example above) is still in the informer's cache. + +The following diagram sums up the process: ```mermaid flowchart TD - A["Update Resource with Lock"] --> B{"Is Successful"} - B -- Fails on conflict --> D["Poll the Informer cache until resource changes"] - D --> n4["Apply desired changes on the resource again"] - B -- Yes --> n2["Still the original resource in informer cache"] + A["Update Resource with Lock"] --> B{"Successful?"} + B -- Fails on conflict --> D{"Resource updated from cluster?"} + D -- " No: poll until updated " --> D + D -- Yes --> n4["Apply desired changes on the resource again"] + B -- Yes --> n2{"Original resource still in informer cache?"} n2 -- Yes --> C["Cache the resource in overlay cache"] - n2 -- NO --> n3["We know there is already a fesh resource in Informer"] + n2 -- No --> n3["Informer cache already contains up-to-date version, do not use overlay cache"] n4 --> A ``` - -If we do our update with optimistic locking, it simplifies the situation, and we can have strong guarantees. -Since we know if the update with optimistic locking is successful, we have the up-to-date resource in our caches. -Thus, the next event we receive will be the one that is the result of our update -(or a newer one if somebody did an update after, but that is fine since it will contain our allocated values). -So if we cache the resource in the overlay cache from the response, we know that with the next event, we can remove it -from there. -Note that we store the result in overlay cache only if at that time we still have the original resource in cache. -If the cache already updated, that means that we already received a new event after our update, -so we have a fresh resource in the informer cache already.