Skip to content

Improve Classpath Implementation #416

Open
@retronym

Description

@retronym

Background

The Scala compiler uses its Classpath abstraction to present the compilation classpath and sourcepath to the compiler. Entries in these paths are typically directories, JARs, or, more recently, the jrt:// virtual filesystem exposed by Java 9.

Scala 2.12 switched to a refactored implementation of the classpath (known as the "flat classpath"). This work was primary motivated by reducing the memory footprint of the classpath representation. Improved compilation speed by saving JAR listings is a secondary benefit.

An example of an environment where this matters is in a modularized build loaded into Scala IDE. There may be hundreds of instances of Global, each with classpaths numbering in hundreds of elements, that point at some smaller set of JARs.

Two main changes were made:

  1. Nested packages no longer give rise to nested objects in the representation, rather each element has a single element that can lookup a FQN (fully qualified name).
  2. The representation of the classpath is by default shared across multiple instances of the compiler (either a subsequent compilation or concurrent compilation in the same JVM).

Like any new implementation, there have been some teething problems.

  • An API point used by SBT to backtrack from a Symbol to the corresponding classpath entry became O(N) with poor constant factors. This was fixed in scala/scala#5956.
  • scala/bug#10295 The static caching of classpath represention does not have any invalidation based on, e.g. last modified timestamps of the JARs, nor does it ever close files once opened. In practice, this affects people using the exportJars =: true mode of SBT. A workaround is to use -YdisableFlatCpCaching

Furthermore, the cost of navigating through the classpath to a .class and converting it to a Symbol is a non-neglible part of some compilation use cases (in particular, very small compile runs), so finding ways to make this faster or avoid parts of it via reliable caching is appealing.

Goals

New Globals don't see stale JAR listings

Improve the current caching mechanism so that If a JAR is overwritten, a newly constructed Global should use the listing from the new JAR, rather than the stale cached info.

If aggregates are cached, these need to be invalidated when a component is invalidated.

Thread Safety of shared instances

Code review and adversarial testing should be used to flush out any thread safety problems in Classpath as it is shared by instance of Global.

JARs are not locked when there are no active compilations

There are two approaches here:

  • Make default the approach currently hidden behind -D scala.classpath.closeZip that open and closes zips around the extraction of each class file. While this hurts performance in a microbenchmark, it might be acceptable in the context of the compiler.
  • Extend and hook into the lifecycle of Global and/or Run (perhaps involving SBT) to implement a reference counting for cached classpath elements that correspond to open ZipFile-s , and close them when the ref-count drops to 0. I've prototyped this in retronym:ticket/10295

Performant Implementation

Create benchmarks and tests to measure:

  • Footprint of classpath representation should be minimized. Measure effect of sharing enabled/disabled.
  • Lookup performance wrt length of classpath should be better than O(n), aggregates should index package => entry
  • Avoid gratuitous allocations on lookups (e.g. by caching entries)

New Run-s respect overwritten JAR / .class files

While we're in the neighborhood of cache invalidation, we should set a stretch target of making multi-run compilation more sure of foot when treading over the shifting sands of overwritten JARs and .class files.

The naive approach would be to wipe out the symbol table on any change to the underlying files that had been witnessed during its construction.

A more nuanced approach would be able to prune only symbols that had changed, or had depended (perhaps transitively) on something that has changed.

As a concrete example, in a multi-module SBT:

  • scalac -cp static-lib.jar -d a/target/classes a/src/...
  • scalac -cp static-lib.jar:a/target/classes -d b/target/classes b/src/...
  • scalac -cp static-lib.jar -d a/target/classes a/src/... Overwrites some classfiles
  • scalac -cp static-lib.jar:a/target/classes -d b/target/classes b/src/... We could reuse Symbol-s representing static-lib.jar, rt.jar, and reset _root_.a back to its initial state. This should be possible because no signature in the those JARs refers to _root_.a.*

This has been tried in the past without success, so we must assume it isn't trivial to get right. As such, this is a stretch target.

Non Goals

Mutation of JAR or .class files during compilation currently has undefined behaviour, we don't plan to improve that.

The highest possible performance of the classpath in isolation. Performance tuning of the classpath should be motivated by scenarios in which we measure that it imposes an unreasonable overhead to a real world compilation use case.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions