Skip to content

OWLS-88571 - Potential fixes for domain startup issues in large k8s cluster when watch events are not delivered. #2305

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Apr 15, 2021

Conversation

ankedia
Copy link
Member

@ankedia ankedia commented Apr 8, 2021

Potential fixes for the domain startup issue in GBU CNE large k8s cluster where watch events are not reliably delivered by k8s. Currently, the introspection pod is not started when the domain is deleted and recreated using the kubectl apply command if there are previously running fibers for the domain. In such cases, domain status is null and this change starts a new fiber if the domain status is null. The second change is to discard the API client and associated HTTP client if the client gets into a bad state due to ProtocolException (and also due to a potential bug in okhttp3/okio library) and create a new client instance. Creating this draft PR to review these changes so that they can be provided to the GBU CNE team for further testing while we discuss other approaches. We're also discussing the unit testing approach to diagnose this type of issues in case of missed watch events and verify fixes.

@ankedia ankedia requested review from russgold and rjeberhard April 8, 2021 15:57
@rjeberhard rjeberhard marked this pull request as ready for review April 15, 2021 15:49
@rjeberhard
Copy link
Member

Plan to merge this as we worked on it together and reviewed each other's work.

@rjeberhard rjeberhard merged commit 1a7ca60 into main Apr 15, 2021
@ankedia ankedia deleted the owls_88571 branch April 16, 2021 23:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants