[URLSession](SR-4567) Fix for a race condition in URLSession #949

pushkarnk · 2017-04-12T18:26:57Z

This race condition is a result of a unique sequence of events in the Foundation-libcurl-libdispatch interplay. Details of the issue can found here. This PR proposes a way to fix that race condition. This wouldn't have been possible without @dgrove-oss 's work on this issue, from the Dispatch angle, and his help on narrowing down on the actual problem at hand.

Short description of the issue
The URLSession implementation in swift-corelibs-foundation uses libcurl for HTTP support and libdispatch as the events library. There is a unique way in which Foundation, Dispatch and libcurl function and they interplay can lead to race conditions which could cause a hang or URLSessionTasks timing out with no response. I try to explain this problem in words here.

Whenever libcurl opens a connection for Foundation, it notifies the latter about the socket it created and the events Foundation should be expecting on that socket, via a callback. So, after making a POST call, for example, libcurl will tell Foundation that I have assigned socket with fd# X(say) and you should track this socket for the read event - because data must be available to read on it in the near future.

In response, Foundation registers this socket for the notified event with Dispatch. For example, in the above case, Foundation will tell Dispatch that when a read event occurs on fd# X, please run this handler. The handler typically does the socket reading, during which libcurl will again be invoked. Once the read is over and libcurl thinks there's nothing more to be done on socket with fd# X, it tells Foundation to unregister fd# X from libdispatch.

Foundation implements the registering and unregistering of fds with libdispatch using the DispatchSource API. On registering for a read, we create a DispatchReadSource and provide an event handler (which will do the socket read eventually). For unregistering, we would want to cancel the DispatchSource. However, the DispatchSource API clearly mentions that a cancel() operation is completed only when a cancel handler is run. See the documentation of cancel.

Note: we did not install a cancel handler before this PR.

Now, libcurl always tries to reuse connections. So, if one is making a series of API calls to the same endpoint, it is very likely that the same TCP level connection will be reused. There is a provision to have a new connection on every transfer, but that obviously leads to very poor performance.

Given this background, when an app is making API calls, back to back, to the same endpoint, we are going to be reusing the same socket (fd) again and again. Then, it is possible that Foundation get an unregister callback immediately followed by a register callback. The unregister callback leads to DispatchSource.cancel() which returns immediately. But the actual epoll_ctl with EPOLL_CTL_DEL happens asynchronously, on a different worker thread. If, after a DispatchSource.cancel() but before the EPOLL_CTL_DEL, libcurl asks us to register the same file descriptor, the EPOLL_CTL_ADD or EPOLL_CTL_MOD goes missing.

Hence, in the interplay of Foundation, libcurl and an epoll-based Dispatch, a unique race condition can lead to a failure to register sockets, effectively leading to a hang or a timeout.

Proposed solution
We've tried half a dozen solutions to fix the race conditions but many of them led to other race conditions. The best solution we're left with is to synchronize the register and unregister operations. An unregister operation is flagged as "completed" only when the related "cancel handler" runs. Also, the register/unregister code needs to be moved to a queue different from the MultiHandle workqueue to avoid a certain deadlock condition.

alblue · 2017-04-13T10:34:48Z

@swift-ci please test

pushkarnk · 2017-04-18T09:42:16Z

@naithar Can you please review this patch? It works around a problem that occurs due to unique race condition in the Foundation-libcurl-libdispatch interplay. I can provide more information if needed.

naithar · 2017-04-18T10:46:21Z

LGTM.

phausler

Overall it looks like it will gate the register and unregister commands but it is a bit complex passing around a semaphore; I would worry about a deadlock on the serial queue, and the semaphore itself I am not fully certain is a safe assignment per the ownership. It might be worthwhile to run this through TSAN.

phausler · 2017-04-18T15:47:47Z

Foundation/NSURLSession/MultiHandle.swift

-                Unmanaged<_SocketSources>.fromOpaque(opaque).release()
+        commandQueue.async {
+            self.cancelInProgress?.wait()
+            self.cancelInProgress = nil


the assignment here is perhaps not thread safe since the cancel handler gets a reference to it captured in the tearDown method; is it possible to instead have the semaphore last as long as the containing object context that is retained past the lifespan of the execution on the queue?

The nil assignment is guarded by the wait() and only the cancel handler will signal() - which means the nil assignment is executed only when the cancel handler no longer needs the semaphore. The assignment to nil signifies that there is no cancel operation in progress which allows the register commands go through without waiting, until the next unregister. Increasing the scope of the semaphore will lead to other register commands (may be on unrelated sockets) to wait with perhaps nobody to signal them.

there are three parts to the state of that semaphore; nil, not nil, and the signal. It doesn't look like a huge race potential but there is still something that strikes me as a potential failure here ( hard to tell without more understanding of exactly the involved queues/threads) This isn't a blocking objection - just a word of caution (and perhaps a suggestion to add a comment).

Sure, I'll add a comment around this code.

phausler · 2017-04-18T15:53:33Z

Foundation/NSURLSession/MultiHandle.swift

-            if let opaque = socketSourcePtr {
-                Unmanaged<_SocketSources>.fromOpaque(opaque).release()
+        commandQueue.async {
+            self.cancelInProgress?.wait()


how long could cancels take? this could back up this queue indefinitely.

Thanks @phausler for the review

In my experience, I haven't seen DispatchSource.cancel() blocking for a long time. Moreover, the intention here is to block the command queue until a cancel() goes through completely, of course assuming that cancel() will not block. I have considered the fact that the cancel handler runs on the same queue as the event handler. The event handler will do two things:

read/write from/to sockets that are ready - this will not block

issue further register/unregister commands - this is async deferred to the command queue
So, in my understanding the event handler will not block the cancel handler - which means the wait() here will not block indefinitely.

@dgrove-oss Can DispatchSource.cancel() can block under any other circumstances?

If the only work that falls out is just DispatchSource.cancel() this is a non issue; it effectively is an atomic assign of a value.

pushkarnk · 2017-04-19T07:50:21Z

Thanks @phausler - I have responded to your comments. I'll also work on running this through TSAN.

phausler · 2017-04-19T19:19:53Z

any TSAN reports probably can be followed up in a following PR.

This reverts commit 015ded4.

pushkarnk force-pushed the urlsession-race-condition branch from 7432349 to 78c2d09 Compare April 12, 2017 18:30

pushkarnk changed the title ~~Fix for a race condition in URLSession~~ [URLSession](SR-4567) Fix for a race condition in URLSession Apr 12, 2017

pushkarnk force-pushed the urlsession-race-condition branch from 78c2d09 to f0a9fe4 Compare April 12, 2017 18:58

Fix for a race condition in URLSession

9351942

pushkarnk force-pushed the urlsession-race-condition branch from f0a9fe4 to 9351942 Compare April 12, 2017 19:02

phausler reviewed Apr 18, 2017

View reviewed changes

phausler merged commit 015ded4 into swiftlang:master Apr 19, 2017

pushkarnk mentioned this pull request Apr 21, 2017

Disable URLSession tests (SR-4655) #959

Merged

pushkarnk pushed a commit to pushkarnk/swift-corelibs-foundation that referenced this pull request Apr 26, 2017

Fix for a race condition in URLSession (swiftlang#949)

9b3df62

pushkarnk deleted the urlsession-race-condition branch May 4, 2017 09:51

weissi mentioned this pull request Aug 24, 2017

Revert "Fix for a race condition in URLSession (#949)" #1188

Merged

pushkarnk pushed a commit to pushkarnk/swift-corelibs-foundation that referenced this pull request Oct 12, 2017

Revert "Fix for a race condition in URLSession (swiftlang#949)"

e09b7f7

This reverts commit 015ded4.

pushkarnk mentioned this pull request Apr 24, 2017

[SR-4567] Race condition in URLSession causes a hang #3876

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[URLSession](SR-4567) Fix for a race condition in URLSession #949

[URLSession](SR-4567) Fix for a race condition in URLSession #949

Uh oh!

pushkarnk commented Apr 12, 2017 •

edited

Loading

Uh oh!

alblue commented Apr 13, 2017

Uh oh!

pushkarnk commented Apr 18, 2017

Uh oh!

naithar commented Apr 18, 2017

Uh oh!

phausler left a comment

Uh oh!

phausler Apr 18, 2017

Uh oh!

pushkarnk Apr 19, 2017 •

edited

Loading

Uh oh!

phausler Apr 19, 2017

Uh oh!

pushkarnk Apr 20, 2017

Uh oh!

phausler Apr 18, 2017

Uh oh!

pushkarnk Apr 19, 2017 •

edited

Loading

Uh oh!

phausler Apr 19, 2017

Uh oh!

pushkarnk commented Apr 19, 2017

Uh oh!

phausler commented Apr 19, 2017

Uh oh!

Uh oh!

[URLSession](SR-4567) Fix for a race condition in URLSession #949

[URLSession](SR-4567) Fix for a race condition in URLSession #949

Uh oh!

Conversation

pushkarnk commented Apr 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alblue commented Apr 13, 2017

Uh oh!

pushkarnk commented Apr 18, 2017

Uh oh!

naithar commented Apr 18, 2017

Uh oh!

phausler left a comment

Choose a reason for hiding this comment

Uh oh!

phausler Apr 18, 2017

Choose a reason for hiding this comment

Uh oh!

pushkarnk Apr 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phausler Apr 19, 2017

Choose a reason for hiding this comment

Uh oh!

pushkarnk Apr 20, 2017

Choose a reason for hiding this comment

Uh oh!

phausler Apr 18, 2017

Choose a reason for hiding this comment

Uh oh!

pushkarnk Apr 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phausler Apr 19, 2017

Choose a reason for hiding this comment

Uh oh!

pushkarnk commented Apr 19, 2017

Uh oh!

phausler commented Apr 19, 2017

Uh oh!

Uh oh!

pushkarnk commented Apr 12, 2017 •

edited

Loading

pushkarnk Apr 19, 2017 •

edited

Loading

pushkarnk Apr 19, 2017 •

edited

Loading