Skip to content

Add membarrier call in Android. #228

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

gonzalolarralde
Copy link
Contributor

membarrier syscall is not available in android / bionic. Android uses
cutils internally to execute the membarrier instruction, but is not
publicly available in the NDK. This shim takes the implementation from
cutils as a reference.

https://android.googlesource.com/platform/system/core/+/48b911c573c92742
aa80270b734811f722c67c37/include/cutils/atomic-arm.h

membarrier syscall is not available in android / bionic. Android uses
cutils internally to execute the membarrier instruction, but is not
publicly available in the NDK. This shim takes the implementation from
cutils as a reference.

https://android.googlesource.com/platform/system/core/+/48b911c573c92742
aa80270b734811f722c67c37/include/cutils/atomic-arm.h
@MadCoder
Copy link
Contributor

MadCoder commented Mar 7, 2017

no, sys_membarrier gives stronger guarantees that what you are doing. what you call membarrier is just c11's seq_cst barrier and is not enough for dispatch_once()

@MadCoder MadCoder closed this Mar 7, 2017
@gonzalolarralde
Copy link
Contributor Author

@MadCoder get it, sorry about that. Would atomic_thread_fence be a valid replacement for sys_membarrier then? For what I see os_atomic_maximally_synchronizing_barrier used to be a call to atomic_thread_fence after it was replaced in a PR that imports the changes from darwin's dispatch.

Was it replaced due to inaccuracy / inefficiency? Or just for convenience?

Thanks for your time.

@MadCoder
Copy link
Contributor

MadCoder commented Mar 8, 2017

atomic_thread_fence() is what I said, it is a c11 barrier which is not enough for dispatch once, it needs something that make sure no other CPU can see old things, which if you don't have more guarantees on your hardware, requires sys_membarrier() which is an IPI on all cores.

Alternatively, platforms that don't have a sys_membarrier should remove the inline fastpath and only have a function call with an acquire barrier.

@zayass
Copy link

zayass commented Apr 5, 2017

Hi @MadCoder @gonzalolarralde

As I see libdispatch uses atomic_thread_fence(memory_order_seq_cst) on all platforms except i386 and x86_64 before this commit ff7dc5b
Even with inline

Why is it not enough now?

@MadCoder
Copy link
Contributor

MadCoder commented Apr 5, 2017

It was never correct. The barrier required here is one that ensures other cores can't see stale values for the store(s) this dispatch_once did as the read side has no barrier whatsoever.

Most hardware has no instruction that can guarantee that, the only way is to IPI all the cores which is what sys_membarrier does.

The alternative is to put an acquire barrier on the read side. However this more or less becomes equivalent to a lock and defeats the point of the primitive.

@zayass
Copy link

zayass commented Apr 5, 2017

As I see read side uses something like os_atomic_cmpxchg(...,acquire) everywhere except _dispatch_gate_wait_slow|_dispatch_gate_broadcast_slow. Is it not enough when we HAVE_FUTEX and _dispatch_gate_wait_slow|_dispatch_gate_broadcast_slow internally use _dispatch_futex_wait|_dispatch_futex_wake?

@MadCoder
Copy link
Contributor

MadCoder commented Apr 5, 2017

You are incorrect, the read side is inlined from <dispatch/once.h> and has no barrier:

https://github.com/apple/swift-corelibs-libdispatch/blob/master/dispatch/once.h#L99

I think that given how widely dispatch_once is used, the IPI is not great, so the right fix is likely to:

  • remove the inlined codepaths if the compiler doesn't support C11 atomics
  • detect in once.h based on the architecture the required barrier, and use C11 atomics.

the challenge is that <stdatomic.h> is usually not compatible with C++ and it is important that this header works with C++ which makes implementing it challenging. But basically it would look more or less like this:

In <dispatch/base.h>

#if defined(__cplusplus) && __has_header(<atomic>)
#include <atomic>
#define DISPATCH_HAS_C11_ATOMICS 1
#define DISPATCH_ATOMIC(type)  std::atomic<type>
#define DISPATCH_ATOMIC_STD(name) std::name
#elif __has_header(<stdatomic.h>)
#include <stdatomic.h>
#define DISPATCH_HAS_C11_ATOMICS 1
#define DISPATCH_ATOMIC(type)  _Atomic(type)
#define DISPATCH_ATOMIC_STD(name) name
#else
#define DISPATCH_HAS_C11_ATOMICS 0
#endif

And then in once.h:

#if defined(__i386__) || defined(__x86_64__) || defined(__s390x__)
#define DISPATCH_ONCE_READ_BARRIER memory_order_relaxed
#else
#define DISPATCH_ONCE_READ_BARRIER memory_order_acquire
#endif

and make dispatch_once_f this way:

#if DISPATCH_HAS_C11_ATOMICS
DISPATCH_INLINE DISPATCH_ALWAYS_INLINE DISPATCH_NONNULL1 DISPATCH_NONNULL3
DISPATCH_NOTHROW
DISPATCH_SWIFT3_UNAVAILABLE("Use lazily initialized globals instead")
void
_dispatch_once_f(dispatch_once_t *predicate, void *_Nullable context,
		dispatch_function_t function)
{
	DISPATCH_ATOMIC(dispatch_once_t) *atomic_pred =
			(DISPATCH_ATOMIC(dispatch_once_t) *)predicate;
	dispatch_once_t v = DISPATCH_ATOMIC_STD(atomic_load_explicit)(atomic_pred,
			DISPATCH_ATOMIC_STD(DISPATCH_ONCE_READ_BARRIER));
	if (DISPATCH_EXPECT(v == ~0l)) return;
	dispatch_once_f(predicate, context, function);
	DISPATCH_COMPILER_CAN_ASSUME(*predicate == ~0l);
}
#undef dispatch_once_f
#define dispatch_once_f _dispatch_once_f
#endif // DISPATCH_HAS_C11_ATOMICS

the same should be done for dispatch_once.

However I'm not keen on injecting <atomic> or <stdatomic.h> bluntly in the namespace, so I need to discuss with @das what we will do here. Given that for the currently supported architectures for Swift this isn't required, I don't think we will make that change immediately.

@MadCoder
Copy link
Contributor

MadCoder commented Apr 5, 2017

And to be clear the reason why I'm not keen on injecting <atomic> or <stdatomic.h> is exactly because they aren't compatible with each other and that this will likely cause harm to pre-existing codebases. And I'm even less of a fan to try to implement that acquire load with compiler builtins as it would mean detecting compilers in headers (yuck).

We may decide that if there's no way to do the fully relaxed loads we would go through the function call always and be done with it (in dispatch we have all we need to write that acquire load), we'd have to make the code of dispatch_once_f in once.c slightly better so that the _dispatch_once_gate_tryenter can shortcut the case when the value is ~0ul already. This is a patch I would be willing to take

@zayass
Copy link

zayass commented Apr 5, 2017

Ok thanks, much clearer now. Maybe it is possible just move done check to the beginning of dispatch_once_f in once.c? Like

#if defined(__i386__) || defined(__x86_64__) || defined(__s390x__)
#define DISPATCH_ONCE_READ_BARRIER relaxed
#else
#define DISPATCH_ONCE_READ_BARRIER acquire
#endif

DISPATCH_NOINLINE
void
dispatch_once_f(dispatch_once_t *val, void *ctxt, dispatch_function_t func)
{
	dispatch_once_t v = os_atomic_load(val, DISPATCH_ONCE_READ_BARRIER);
	if (likely(v == ~0l)) {
		return;
	}
        ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants