-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[URLSession](SR-4567) Fix for a race condition in URLSession #949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[URLSession](SR-4567) Fix for a race condition in URLSession #949
Conversation
7432349
to
78c2d09
Compare
78c2d09
to
f0a9fe4
Compare
f0a9fe4
to
9351942
Compare
@swift-ci please test |
@naithar Can you please review this patch? It works around a problem that occurs due to unique race condition in the Foundation-libcurl-libdispatch interplay. I can provide more information if needed. |
LGTM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall it looks like it will gate the register and unregister commands but it is a bit complex passing around a semaphore; I would worry about a deadlock on the serial queue, and the semaphore itself I am not fully certain is a safe assignment per the ownership. It might be worthwhile to run this through TSAN.
Unmanaged<_SocketSources>.fromOpaque(opaque).release() | ||
commandQueue.async { | ||
self.cancelInProgress?.wait() | ||
self.cancelInProgress = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the assignment here is perhaps not thread safe since the cancel handler gets a reference to it captured in the tearDown method; is it possible to instead have the semaphore last as long as the containing object context that is retained past the lifespan of the execution on the queue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nil assignment is guarded by the wait()
and only the cancel handler will signal()
- which means the nil assignment is executed only when the cancel handler no longer needs the semaphore. The assignment to nil
signifies that there is no cancel operation in progress which allows the register commands go through without waiting, until the next unregister. Increasing the scope of the semaphore will lead to other register
commands (may be on unrelated sockets) to wait
with perhaps nobody to signal
them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are three parts to the state of that semaphore; nil, not nil, and the signal. It doesn't look like a huge race potential but there is still something that strikes me as a potential failure here ( hard to tell without more understanding of exactly the involved queues/threads) This isn't a blocking objection - just a word of caution (and perhaps a suggestion to add a comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll add a comment around this code.
if let opaque = socketSourcePtr { | ||
Unmanaged<_SocketSources>.fromOpaque(opaque).release() | ||
commandQueue.async { | ||
self.cancelInProgress?.wait() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how long could cancels take? this could back up this queue indefinitely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @phausler for the review
In my experience, I haven't seen DispatchSource.cancel()
blocking for a long time. Moreover, the intention here is to block the command queue until a cancel()
goes through completely, of course assuming that cancel()
will not block. I have considered the fact that the cancel handler runs on the same queue as the event handler. The event handler will do two things:
- read/write from/to sockets that are ready - this will not block
- issue further register/unregister commands - this is async deferred to the command queue
So, in my understanding the event handler will not block the cancel handler - which means thewait()
here will not block indefinitely.
@dgrove-oss Can DispatchSource.cancel()
can block under any other circumstances?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the only work that falls out is just DispatchSource.cancel()
this is a non issue; it effectively is an atomic assign of a value.
Thanks @phausler - I have responded to your comments. I'll also work on running this through TSAN. |
any TSAN reports probably can be followed up in a following PR. |
This reverts commit 015ded4.
This race condition is a result of a unique sequence of events in the Foundation-libcurl-libdispatch interplay. Details of the issue can found here. This PR proposes a way to fix that race condition. This wouldn't have been possible without @dgrove-oss 's work on this issue, from the Dispatch angle, and his help on narrowing down on the actual problem at hand.
Short description of the issue
The URLSession implementation in swift-corelibs-foundation uses libcurl for HTTP support and libdispatch as the events library. There is a unique way in which Foundation, Dispatch and libcurl function and they interplay can lead to race conditions which could cause a hang or URLSessionTasks timing out with no response. I try to explain this problem in words here.
Whenever libcurl opens a connection for Foundation, it notifies the latter about the socket it created and the events Foundation should be expecting on that socket, via a callback. So, after making a POST call, for example, libcurl will tell Foundation that I have assigned socket with fd# X(say) and you should track this socket for the read event - because data must be available to read on it in the near future.
In response, Foundation registers this socket for the notified event with Dispatch. For example, in the above case, Foundation will tell Dispatch that when a read event occurs on fd# X, please run this handler. The handler typically does the socket reading, during which libcurl will again be invoked. Once the read is over and libcurl thinks there's nothing more to be done on socket with fd# X, it tells Foundation to unregister fd# X from libdispatch.
Foundation implements the registering and unregistering of fds with libdispatch using the DispatchSource API. On registering for a read, we create a DispatchReadSource and provide an event handler (which will do the socket read eventually). For unregistering, we would want to cancel the DispatchSource. However, the DispatchSource API clearly mentions that a cancel() operation is completed only when a cancel handler is run. See the documentation of cancel.
Note: we did not install a cancel handler before this PR.
Now, libcurl always tries to reuse connections. So, if one is making a series of API calls to the same endpoint, it is very likely that the same TCP level connection will be reused. There is a provision to have a new connection on every transfer, but that obviously leads to very poor performance.
Given this background, when an app is making API calls, back to back, to the same endpoint, we are going to be reusing the same socket (fd) again and again. Then, it is possible that Foundation get an unregister callback immediately followed by a register callback. The unregister callback leads to DispatchSource.cancel() which returns immediately. But the actual epoll_ctl with EPOLL_CTL_DEL happens asynchronously, on a different worker thread. If, after a DispatchSource.cancel() but before the EPOLL_CTL_DEL, libcurl asks us to register the same file descriptor, the EPOLL_CTL_ADD or EPOLL_CTL_MOD goes missing.
Hence, in the interplay of Foundation, libcurl and an epoll-based Dispatch, a unique race condition can lead to a failure to register sockets, effectively leading to a hang or a timeout.
Proposed solution
We've tried half a dozen solutions to fix the race conditions but many of them led to other race conditions. The best solution we're left with is to synchronize the register and unregister operations. An unregister operation is flagged as "completed" only when the related "cancel handler" runs. Also, the register/unregister code needs to be moved to a queue different from the MultiHandle workqueue to avoid a certain deadlock condition.