You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: sips/pending/_posts/2013-05-31-improved-lazy-val-initialization.md
+26-13Lines changed: 26 additions & 13 deletions
Original file line number
Diff line number
Diff line change
@@ -317,10 +317,10 @@ See the evaluation section for more information.
317
317
### Version 5 - retry in case of failure ###
318
318
319
319
The current Scala semantics require to retry computation in case of failure.
320
-
The versions 4 presented above provides good performance characteristics in benchmarks, but it may leave threads waiting for failed initialization forever, leaking threads. Consider this example:
320
+
The versions 4 presented above provides good performance characteristics in benchmarks, but it may leave threads waiting for failed initialization forever, leaking threads. Consider this example:
321
321
322
322
class LazyCell {
323
-
private counter = -1
323
+
private var counter = -1
324
324
lazy val value = {
325
325
counter = counter + 1
326
326
if(counter < 42)
@@ -400,7 +400,7 @@ Note that the current lazy val initialization implementation is robust against t
400
400
}
401
401
402
402
In the current implementation the monitor is held throughout the lazy val initialization and released once the initialization block completes.
403
-
All the versions proposed above release the monitor during the initialization block execution and re-acquire it back, so the code above could stall indefinitely.
403
+
However, all the versions proposed above release the monitor during the initialization block execution and re-acquire it back, so the code above could stall indefinitely.
404
404
405
405
Additionally, usage of `this` as synchronization point may disallow concurrent initialization of different lazy vals in the same object. Consider the example below:
406
406
@@ -415,7 +415,7 @@ Additionally, usage of `this` as synchronization point may disallow concurrent i
415
415
}
416
416
}
417
417
418
-
Though two `slow` and `fast` vals are independent, in the current implementation they both synchronize on `this` for the entire duration of computation. This leads to situation when `fast` is required to wail for `slow` to be computed.
418
+
Though two `slow` and `fast` vals are independent, in the current implementation they both synchronize on `this` for the entire duration of computation. This leads to situation when `fast` is required to wait for `slow` to be computed.
419
419
In the versions presented above, a single call to `bad` may lead to monitor on `this` being held forever, leading calls to `slow` and `fast` to also block forever.
420
420
421
421
Note that those interactions are very surprising to users as they leak the internal limitations of implementation of lazy vals.
@@ -429,7 +429,7 @@ To solve overcome those limitations we propose a version that does not synchroni
@@ -518,12 +518,12 @@ Note that this class is extracted from other place in standard library that uses
518
518
- it requires usage of `identityHashCode` that is stored for every object inside object header.
519
519
- as global arrays are used to store monitors, seemingly unrelated things may create contention. This is addressed in detail in evaluation section.
520
520
521
-
Both not expanding monitors and usage of `idetityHashCode` interact with each other, as both of them operate on the
521
+
Both absence of monitors expansion and usage of `idetityHashCode` interact with each other, as both of them operate on the
522
522
object header. [12] presents the compete graph of transitions between possible states of object header.
523
523
What can be seen from this transition graph is that in contended case versions V2-V5 were promoting object into the worst case, the `heavyweight monitor` object, while the new scheme only disables biasing.
524
524
Note that under schemes presented here, V2-V5, this change only happens in case of presence of contention and happens per-object.
525
525
526
-
### Non-thread safe lazy vals ###
526
+
### Non-threadsafe lazy vals ###
527
527
While the new versions introduce speedups in contended case, they do generate complex bytecode and this may lead to new scheme being less appropriate for lazy vals that are not used in concurrent setting. In order to perfectly fit this use-case we propose to introduce an encoding for single-threaded lazy vals that is very simple and very efficient:
528
528
529
529
final class LazyCell {
@@ -539,7 +539,7 @@ While the new versions introduce speedups in contended case, they do generate co
539
539
540
540
This version behaves faster than all other versions on benchmarks but does not correctly handle safe publication in case of multiple threads. It can be used in applications that utilize multiple threads if some other means of safe publication is used instead.
541
541
542
-
### Local lazy vals ###
542
+
### Elegant Local lazy vals ###
543
543
Aside from lazy vals that are fields of objects, scala supports local lazy vals, defined inside methods:
544
544
545
545
def method = {
@@ -614,13 +614,15 @@ The scheme introduces new helper classes to standard library: such as `dotty.run
614
614
method$s(holder)
615
615
}
616
616
617
-
This solves the problem with deadlocks introduced by using java8 labmdas.[14]
617
+
This solves the problem with deadlocks introduced by using java8 lambdas.[14]
618
618
619
619
620
620
### Language change ###
621
-
To address the fact that we now have both thread-safe and single-threaded lazy vals,
621
+
To address the fact that we now have both threadsafe and single-threaded lazy vals,
622
622
we propose to bring lazy vals in sync with normal vals with regards to usage of `@volatile` annotation.
623
623
624
+
In order to simplify migration, Scalafix the migration tool that will be used to migrate between versions of Scala, including Dotty, supports `VolatileLazyVal` rewrite that adds `@volatile` to all `lazy vals` present in the codebase.
625
+
624
626
625
627
## Evaluation ##
626
628
@@ -643,23 +645,30 @@ For the uncontended case, we measure the cost of creating N objects and initiali
643
645
644
646
For the contended case, we measure the cost of initializing the lazy fields of N objects, previously created and stored in an array, by 4 different threads that linearly try to read the lazy field of an object before proceeding to the next one. The goal of this test is to asses the effect of entering the synchronized block and notifying the waiting threads - since the slow path is slower, the threads that “lag” behind should quickly reach the first object with an uninitialized lazy val, causing contention.
645
647
646
-
The current lazy val implementation (V1) seems to incur initialization costs that are at least 6 times greater compared to referencing a regular val. The handwritten implementation produces identical bytecode, with the difference that the calls are virtual instead of just querying the field value, probably the reason due to which it is up to 50% slower. The 2 synchronized blocks design with an eager notify (V2) is 3-4 times slower than the current lazy val implementation - just adding the `notifyAll` call changes things considerably. The 4 state/2 synchronized blocks approach (V3) is only 33-50% slower than the current lazy val implementation (V1). The CAS-based approach where `AtomicInteger`s are extended is as fast as the current lazy val initialization (V1), but when generalized and replaced with `AtomicReferenceFieldUpdater`s as discussed before, it is almost 50% slower than the current implementation V1. The final version, V6 uses `Unsafe` to bring back performance and is as around twice fast as fast as current lazy val initialization(V1) while maintaining correct semantics.
648
+
The current lazy val implementation (V1) seems to incur initialization costs that are at least 6 times greater compared to referencing a regular val. The handwritten implementation produces identical bytecode, with the difference that the calls are virtual instead of just querying the field value, probably the reason due to which it is up to 50% slower. The 2 synchronized blocks design with an eager notify (V2) is 3-4 times slower than the current lazy val implementation - just adding the `notifyAll` call changes things considerably. The 4 state/2 synchronized blocks approach (V3) is only 33-50% slower than the current lazy val implementation (V1). The CAS-based approach where `AtomicInteger`s are extended is as fast as the current lazy val initialization (V1), but when generalized and replaced with `AtomicReferenceFieldUpdater`s as discussed before, it is almost 50% slower than the current implementation V1. The final version, V6 uses `Unsafe` to bring back performance and is as around twice as fast as current lazy val initialization(V1) while maintaining correct semantics.
647
649
648
650
The CAS-based approaches V4, V5, V6 appear to have the best performance here, being twice as fast than the current lazy val initialization implementation (V1).
649
651
650
652
The proposed solution with (V6) is 50% faster than the current lazy val implementation in common use case. This comes at a price of synchronizing on global array of monitors, which may create contention between seemingly unrelated things. The more monitors are created the less is the probability of such contention. There's an also a positive effect though, reuse of global objects for synchronization allows the monitors on the instances containing lazy vals not to be expanded, saving on non-local memory allocation. Current implementation uses ` 8 * processorCount * processorCount` monitors and the benchmarks and by-hand study with "Vtune Amplifier XE" demonstrate that positive effect dominates, introducing a 2% speedup[13]. It’s worth mentioning that this is not a typical use-case that reflects a practical application, but rather a synthetic borderline designed to perform the worst-case comparison to demonstrate the cache contention.
651
653
652
654
The local lazy vals implementation is around 6x faster than the current version, as it eliminates the need for boxing and reduces number of allocations from 2 down to 1.
653
655
654
-
The concrete microbenchmark code is available as a GitHub repo \[[6][6]\].
656
+
The concrete microbenchmark code is available as a GitHub repo \[[6][6]\]. It additionally benchmarks many other implementations that are not covered in the text of this SIP, in particular it tests versions based on MethodHandles and runtime code generation and versions that use additional spinning before synchronizing on the monitor.
655
657
656
-
## Code size ##
658
+
###Code size###
657
659
The versions presented in V2-V6 have a lot more complex implementations and this shows up on the bytecode size. In the worst-case scenario, when the `<RHS>` value is a constant, the current scheme (V1) creates an initializer method that has size of 34 bytes, while dotty creates a version that is 184 bytes long. Local optimizations present in dotty linker[14] are able to reduce this size down to 160 bytes, but this is still substantially more than the current version.
658
660
659
661
On the other hand, the single-threaded version does not need separate initializer method and is around twice smaller than the current scheme (V1).
660
662
661
663
The proposed local lazy val transformation scheme also creates less bytecode, introducing 34 bytes instead of 42 bytes, mostly due to reduction in constant table size.
662
664
665
+
## Current status ##
666
+
Version V6 is implemented and used in Dotty, together with language change that makes lazy vals thread-unsafe if `@volatile` annotation is not specified.
667
+
Dotty implementation internally uses `@static` proposed in \[[16][16]\].
668
+
669
+
Both Dotty and released Scala 2.12 already implement "Elegant Local lazy vals". This was incorporated in the 2.12 release before this SIP was considered, as it was fixing a bug that blocked release\[[14][14]\].
670
+
671
+
663
672
## Acknowledgements ##
664
673
665
674
We would like to thank Peter Levart and the other members of the concurrency-interest mailing list for their suggestions, as well as the members of the scala-internals mailing list for the useful discussions and their input.
@@ -680,6 +689,8 @@ We would like to thank Peter Levart and the other members of the concurrency-int
680
689
12.[Synchronization, HotSpot internals wiki, April 2008][12]
681
690
13.[Lazy Vals in Dotty, cache contention discussion, February 2014][13]
682
691
14.[SI-9824 SI-9814 proper locking scope for lazy vals in lambdas, April 2016][14]
692
+
15.[Introducing Scalafix: a migration tool for Scalac to Dotty, October 2016][15]
0 commit comments