Skip to content

Remove provided optional configs #1452

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 22, 2020

Conversation

jbaiera
Copy link
Member

@jbaiera jbaiera commented Mar 24, 2020

This PR removes usages of the PropDeps plugin from Gradle. As a result, provided and optional configurations have been removed. All previously referenced dependencies in these configurations have been updated to be listed under either api, implementation, or compileOnly.

Copy link
Contributor

@mark-vieira mark-vieira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we tested how this works for consumers? I see we've converted a lot of provided dependencies to api which is very different. This is now going to result in a POM where these things are marked <scope>compile</scope> instead of <scope>provided</scope> which has considerably different semantics.

@@ -356,17 +345,6 @@ class BuildPlugin implements Plugin<Project> {
dep.scope == 'test' || dep.artifactId == 'elasticsearch-hadoop-mr'
}

// Mark the optional dependencies to actually be optional
generatedPom.dependencies.findAll { it.scope == 'optional' }.each {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we handling this now? The generated POMs are going to be different now. Dependencies that were previously marked as optional are now going to be brought in by consumers by default. Is this fine or expected behavior?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used optional primarily in two places:

First, in the spark projects where if you were only using RDD's then you didn't need to necessarily pull in the SQL libraries. That said, both core Spark and Spark SQL were listed as provided, and thus would always both be on the classpath if launching via spark-submit. In terms of a strict build footprint, this may affect those who are strictly avoiding SQL, though a decent amount of Spark development is done in live environments/notebooks/platforms. I don't expect this to be a widespread problem since those who are actively trying to avoid SQL should be able to ignore this dependency with their own build tools.

Second, in the project-wide pom, where we have the code for every integration. The expectation here I assume was that every integration was optional within the project-wide artifact, and thus each integration's dependency was marked as optional. This meant that you could pull in the root project's artifact for Spark and not need to pull in Pig dependencies. This might be the biggest issue to address for the removal of optional scope dependencies, but to work around this, we publish integration specific artifacts which provide isolation based on the supported integration for users who only want support for one integration.

In fact, I'm willing to have a discussion on whether or not the project-wide artifact is even needed. I believe that it is a nice getting started point where you can put in a simple dependency name to get support for any integration, but beyond getting started, most users are much better off using integration specific artifacts for their projects.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't expect this to be a widespread problem since those who are actively trying to avoid SQL should be able to ignore this dependency with their own build tools.

Right. Folks bringing this in will now need to explicitly exclude this dependency rather than explicitly include it. If you don't feel this is going to be terribly disruptive then 👍.

the project-wide pom

What is the "project-wide pom" and what specifically is its purpose? Would a BOM be a more appropriate solution if you want folks to be able to import a simple POM that defines things like common dependency versions?

In fact, I'm willing to have a discussion on whether or not the project-wide artifact is even needed.

FWIW, Gradle has support specifically for publishing BOMs. We can still provide this kind of thing, but in a more appropriate form than a POM that declares nothing but optional dependencies. This is better modeled with by the <dependencyManagement> block of a POM.

https://docs.gradle.org/current/userguide/java_platform_plugin.html

@jbaiera
Copy link
Member Author

jbaiera commented Mar 24, 2020

The conversion of 'provided' dependencies to 'api' is based on (my understanding of) what Gradle expects an 'api' dependency to be. In this case, I understood it as "If you want to compile code that is going to use the public API for this, you will need this library as well since it is exposed through the API."

For MR, this means you need to have Hadoop on the compile classpath because you'll be interacting with input and output format classes that extend, expect, and return Hadoop classes.

For Spark, this includes the Spark frameworks since interacting with ES-Hadoop will require you to pass your Spark Contexts/RDDs/Dataframes to the library.

Hive and Pig is where these things get fuzzy. You don't write Java for Hive or Pig. Instead you are writing a parsed query language. For this reason, I took the approach of "If you were to build a library for those integrations that depended on ES-Hadoop, which things would need to be API dependencies?"

For that, I landed at the Hive and Pig dependencies that we use as inputs and outputs from what might be reasonably considered the API of the project (the input and output formats), though even that may be considered somewhat dodgy. I'm open to feedback on it for sure.

@mark-vieira
Copy link
Contributor

I understood it as "If you want to compile code that is going to use the public API for this, you will need this library as well since it is exposed through the API."

This is a correct understanding. The difference between api and provided is how the consumers runtime classpath is affected. An api dependency is effectively <scope>compile</scope> which means it will be on the compile and runtime classpath of the downstream consumer. Using provided the dependency is only on the compile classpath, and it's expected that the runtime environment will have this available already somehow.

For that, I landed at the Hive and Pig dependencies that we use as inputs and outputs from what might be reasonably considered the API of the project (the input and output formats), though even that may be considered somewhat dodgy.

If I'm not writing Java code that uses an API then this isn't an API dependency. These dependencies are "inputs" but they are runtime. That is, if the dependency changes, I don't need to recompile anything, but runtime behavior might change. At least that's how I understand how you're describing it. In any case, the implementation vs provided distinction remains. Using provided means we aren't telling consumers to put this on their runtime classpath. Now we are. Maybe that's ok?

@jbaiera
Copy link
Member Author

jbaiera commented Apr 7, 2020

I've made some changes to this - converted the specifications for most of the dependencies to implementation instead of api. The remaining api dependencies are for the Hadoop client in the mr project, the core Spark library in the spark projects, and the Storm API in the storm project. For the uberjar/uberpom we release, these have all been converted to implementation instead of api, which translates them to runtime in the pom that is created. This should keep the dependencies pulled into downstream projects for compilation to a minimum, mostly dependencies that downstream projects will already be depending on for compilation.

@jbaiera jbaiera requested a review from mark-vieira April 7, 2020 20:10
Copy link
Contributor

@mark-vieira mark-vieira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments, otherwise LGTM.

Also fix a bug where test jars could be added to the dist zip.
@jbaiera jbaiera merged commit 66354d4 into elastic:master Apr 22, 2020
@jbaiera jbaiera deleted the remove-provided-optional-configs branch April 22, 2020 15:55
jbaiera added a commit that referenced this pull request May 13, 2020
All previously referenced dependencies in these configurations have been 
updated to be listed under either api, implementation, or compileOnly.
@jbaiera
Copy link
Member Author

jbaiera commented Jul 14, 2020

Relates #1423

@koertkuipers
Copy link

koertkuipers commented Sep 17, 2020

i dont understand anything about gradle scopes/configurations, so i might be a little ignorant about what is happening here... but i know the standard is to have spark dependencies be provided in poms and i would expect such. e.g if you build a project that runs on spark you build an assembly jar (or equivalent) that should not include spark itself (it can be a real nasty source of bugs if you do include spark).
so the poms that result from this change simply seem incorrect to me and i will work around it using exclusion rules. its no big deal but just a little confusing.

@jbaiera
Copy link
Member Author

jbaiera commented Sep 18, 2020

@koertkuipers This change was made with the expectation that most Spark deployments use spark submit arguments to provide the jars for execution of a job rather than repackaging all dependencies into a single distributable. We went down this path with the thought that if users were creating a single distributable, they'd be taking a more direct approach to their build process and thus shouldn't have too much a problem excluding spark as a runtime dependency from the ES-Spark library. Happy to discuss the issue further though if this is causing usability concerns for you.

@koertkuipers
Copy link

hi @jbaiera
it is not clear to me how using spark submit arguments to provide the jars for execution would make it ok for spark itself to be a compile/runtime dependency?
i do not use this method of adding jars, but i can think of 2 possible scenarios:

  1. when adding jars with spark-submit (transitive) dependencies are also included. in this case it is not a good idea to pull in spark itself as a (transitive) dependency. spark jars are added to class-path by spark-submit already.
  2. when adding jars with spark-submit (transitive) dependencies are not included. in this case having spark itself as a dependency doesn't do any good either.

what is the scenario where having spark as a compile-time/runtime dependency is useful?

@jbaiera
Copy link
Member Author

jbaiera commented Oct 7, 2020

what is the scenario where having spark as a compile-time/runtime dependency is useful?

Going off what I remember from April, the driving force for this change was 2-fold: ES-Hadoop was being updated to better comply with Gradle's project modeling and the plugin that provided this functionality is a third party plugin that we have little to no control over. A lot of changes were happening in order to support cross compilation of Scala/Spark artifacts, and getting the community plugin that added provided and optional scopes to work with the changes we were making was becoming a maintenance nightmare.

At the time it seemed logical that a project depending on ES-Hadoop/Spark would already have defined Spark as a provided dependency, and thus having Spark as a compile time dependency of the library (which would end up transitive) would be relatively innocuous to the end-user project development environment.

As for the topic of deployments, my expectation at the time was that most deployments make use of the spark submit jar arguments to specify their dependencies. This was informed mostly from my experiences with the community over the years: Using the spark submit jar argument was the more common approach. Users who bundled their application and libraries together would just need to exclude Spark from that artifact. A small extra step.

With this line of thinking, we removed the plugin, set the scopes to what would be the most reasonable scopes Gradle could support out of the box, and moved on with our refactoring. Several months after this change though, I'm a bit more skeptical that the spark submit jar model of deployment is as ubiquitous as previously thought.

I'll open an issue for returning the dependencies to provided. I'm not sure how reasonable it would be to return the previously optional dependencies, which at the time was the Spark SQL libraries, as well as every integration library in the uberjar. Those would most likely make a return as provided if we went forward with that change.

@jbaiera
Copy link
Member Author

jbaiera commented Oct 7, 2020

@koertkuipers I've opened an issue for this #1541

@koertkuipers
Copy link

@jbaiera thank you for that!

you are right that it is just a few extra excludes. i just did this yesterday. i started with making elaticsearch-spark simply an intransitive dependency in sbt, but then realized that such an approach cannot be expressed correctly in the POM file my project generates. so i ended listing all the elaticsearch-spark dependencies explicitly for exclusion. no big deal but a little fragile.

i prefer to keep spark and hadoop out of the compile/runtime scope completely at all times, since they are a bit of transitive kitchen sinks, which makes it to understand what the actual dependencies are that will be end up in the assembly/fat jar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants