Skip to content

Fix: Improper read timeout handling. #621

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

konrad-kaminski
Copy link

Fix for JAVA-3920.

@stIncMale
Copy link
Member

stIncMale commented Jan 14, 2021

@konrad-kaminski
I spent some time thinking whether there are other approaches, and how are they compared to yours. I could think of only two possible approaches (they both involve introducing something like a TimeoutHandle to represent either a scheduled timeout, or a submitted task to schedule a timeout):

  1. [The one proposed in this PR, let's call it "uncooperative"] When the event loop thread notices in handleReadResponse that a user thread submitted a task for scheduling the timeout, it ignores it, schedules its own timeout and remembers to cancel the user timeout handle when/if the read completes successfully.
  2. ["cooperative"] When the event loop thread notices in handleReadResponse that a user thread submitted a task for scheduling the timeout, it may help completing that task right away (schedule the timeout requested by the user thread) or may leave it as is, but it does not schedule its own timeout (it also does not schedule a timeout if there is one already scheduled).

Here are the differences I can see:

  1. [performance concern] With the cooperative approach there may not exist concurrently scheduled timeouts, while the uncooperative approach allows unlimited number of scheduled timeouts (each scheduled timeout is an exception-throwing ReadTimeoutTask scheduled in the event loop). For example, if a user thread requests 100 bytes, and they arrive one by one, then the event loop thread wraps the original handler 100 times in the TimeoutCancellingHandler, and in addition to that schedules a timeout 100 times (provided that the first timeout task does not fire while all this happens).
  2. [broken functionality] The uncooperative approach may override the additionalTimeout requested by the user thread via the method NettyStream.read(final int numBytes, final int additionalTimeout). This is because the event loop thread may additionally schedule a timeout with the delay ReadTimeoutHandler.readTimeout, which is shorter than the originally requested delay additionalTimeout + ReadTimeoutHandler.readTimeout. Such a timeout scheduled by the event loop thread fires earlier than the one scheduled by the user thread, and the read operation fails prematurely.

I will be working on a PR that uses the cooperative approach (likely without the part where an event loop thread helps completing the task submitted by a user thread, because this makes things even more nontrivial without resulting in any substantial benefits).

@konrad-kaminski this problem does not seem to be simple, so I may be incorrect somewhere. I want you to know that I am open to discussing it, and highly appreciate what you have already contributed.

P.S. While thinking about this PR, I realized that we don't need ReadTimeoutHandler at all. The logic in it is needed, obviously, but it does not need to be a channel handler. We do not override or call any of the ChannelInboundHandlermethods, everything we do in ReadTimeoutHandler is unrelated to the channel pipeline.

@konrad-kaminski
Copy link
Author

I changed my PR to reflect the changes I think can be done to address the things you mentioned. I reuse PendingReader since my assumption is that only a user thread can create one as a result of readAsync. And it is actually the only place where a timeout should be scheduled - all other operations in readAsync are non-blocking, so we can safely do it just before creating PendingReader.

I got rid of ReadTimeoutHandlerand TimeoutHandle as a simple Future seems to work fine. I may have missed something though, so feel free to point it out.

@stIncMale
Copy link
Member

stIncMale commented Jan 15, 2021

it is actually the only place where a timeout should be scheduled - all other operations in readAsync are non-blocking

I'd rather say that all other operations in readAsync are not IO operations, because entering a synchronized block is very much blocking. My best guess is that originally timeout scheduling was put before entering the synchnonized block specifically for this reason: as scheduling a timeout from within the syncrhnorized means that it may be further delayed because the event loop thread acquired the monitor and was descheduled. However, this is arguably a weak reason as it simply may (but not necessary) further reduce the accuracy of a timeout, which is naturally inaccurate anyway and even more so as it is scheduled in the event loop (the event loop scheduling is likely less accurate than those in JDK implementations1).

Anyway, I tried to preserve the code in NettyStream as much as possible, and refactored ReadTimeoutHandler so that it is actually a handler that reacts to Netty events. The result is Fix the read timeout implementation in NettyStream #635.


1 The section "15.3.3 Scheduling implementation internals" in the "Netty in action" book:

The actual implementation in Netty is based on the paper "Hashed and hierarchical timing
wheels: Data structures to efficiently implement timer facility" by George Varghese. This kind
of implementation only guarantees an approximated execution, which means that the
execution of the task may not be 100% exactly on time. This has proven to be a tolerable
limitation in practice and does not affect most applications at all. Its just something to
remember if you need to schedule tasks, because it may not be the perfect fit if you need
100% exact, on-time execution.

To better understand how this works, think of it this way:

  1. You schedule a task with a given delay.
  2. The task gets inserted into the Schedule-Task-Queue of the EventLoop.
  3. The EventLoop checks on every run if tasks need to get executed now.
  4. If theres a task, it will execute it right away and remove it from the queue.
  5. It waits again for the next run and starts over again with step 4.

Because of this implementation the scheduled execution may be not 100 % acurate. Which
means it may not be 100 % exact, this is fine for most use-cases given that it allows for almost
no overhead within Netty.

But what if you need more accurate execution? Its easy. Youll need to use another
implementation of ScheduledExecutorService thats not part of Netty. Just remember that
if you dont follow Nettys thread model protocol, youll need to synchronize the concurrent
access on your own. Do this only if you must.

The "100% exact" wording is weird because no scheduling in non-realtime systems is accurate, but the author clearly suggests that there are more accurate implementations.

@konrad-kaminski
Copy link
Author

I've taken a look at your PR and even made a PR for it. This actually addresses only one thing, namely it does not schedule a timeout when it's not needed.

As you stated the code in synchronized blocks is IO non-blocking and at the same time it doesn't do anything which would take a long time. Timeouts are provided in milliseconds - I doubt that the code in any of the synchronized (in readAsync and handleReadResponse) blocks will take more than 1 millisecond. Having that in mind and approaching it from engineering perspective (rather than a theoretical one) I think that scheduling a timeout after the synchronized block (instead of before it) is a better choice - this way we can do it only if we have to. I've no idea how hard it is for the executor to schedule and then remove a timeout, but I presume it's not an easy feat. By doing it only when we have to we take a little (or maybe a lot?) burden off the executor.

As a side note: I see that your style of code differs from the one that is prevalent in mongo-driver. Examples of the differences are e.g. usage of early returns (IMHO unnecessary e.g. in ReadTimeoutHandler.fireTimeoutException or inconsistent e.g. ReadTimeoutHandler.unsynchronizedCancel or NettyStream.adjustTimeout) or adding asserts (there is only a handful of them in the entire code). This may make the code better or worse (depending on personal preferences), but it certainly makes it inconsistent. Out of curiosity - will this be a general trend in your (as in mongo-driver team) code?

@stIncMale
Copy link
Member

Hi @konrad-kaminski. We closed PR#635, here is the details. You probably will be pleased to see that all the improvements you suggested are implemented in the new PR#636.

@stIncMale
Copy link
Member

I see that your style of code differs from the one that is prevalent in mongo-driver.
...
Out of curiosity - will this be a general trend in your (as in mongo-driver team) code?

Thank you for pointing this out. For now I am refraining from using the early return pattern and from using assert statements. The team will discuss the possibility of using those.

@stIncMale
Copy link
Member

Fixed in #636.

@stIncMale stIncMale closed this Jan 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants