-
Notifications
You must be signed in to change notification settings - Fork 991
Remove provided optional configs #1452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove provided optional configs #1452
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we tested how this works for consumers? I see we've converted a lot of provided
dependencies to api
which is very different. This is now going to result in a POM where these things are marked <scope>compile</scope>
instead of <scope>provided</scope>
which has considerably different semantics.
@@ -356,17 +345,6 @@ class BuildPlugin implements Plugin<Project> { | |||
dep.scope == 'test' || dep.artifactId == 'elasticsearch-hadoop-mr' | |||
} | |||
|
|||
// Mark the optional dependencies to actually be optional | |||
generatedPom.dependencies.findAll { it.scope == 'optional' }.each { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we handling this now? The generated POMs are going to be different now. Dependencies that were previously marked as optional are now going to be brought in by consumers by default. Is this fine or expected behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We used optional primarily in two places:
First, in the spark projects where if you were only using RDD's then you didn't need to necessarily pull in the SQL libraries. That said, both core Spark and Spark SQL were listed as provided, and thus would always both be on the classpath if launching via spark-submit. In terms of a strict build footprint, this may affect those who are strictly avoiding SQL, though a decent amount of Spark development is done in live environments/notebooks/platforms. I don't expect this to be a widespread problem since those who are actively trying to avoid SQL should be able to ignore this dependency with their own build tools.
Second, in the project-wide pom, where we have the code for every integration. The expectation here I assume was that every integration was optional within the project-wide artifact, and thus each integration's dependency was marked as optional. This meant that you could pull in the root project's artifact for Spark and not need to pull in Pig dependencies. This might be the biggest issue to address for the removal of optional scope dependencies, but to work around this, we publish integration specific artifacts which provide isolation based on the supported integration for users who only want support for one integration.
In fact, I'm willing to have a discussion on whether or not the project-wide artifact is even needed. I believe that it is a nice getting started point where you can put in a simple dependency name to get support for any integration, but beyond getting started, most users are much better off using integration specific artifacts for their projects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't expect this to be a widespread problem since those who are actively trying to avoid SQL should be able to ignore this dependency with their own build tools.
Right. Folks bringing this in will now need to explicitly exclude this dependency rather than explicitly include it. If you don't feel this is going to be terribly disruptive then 👍.
the project-wide pom
What is the "project-wide pom" and what specifically is its purpose? Would a BOM be a more appropriate solution if you want folks to be able to import a simple POM that defines things like common dependency versions?
In fact, I'm willing to have a discussion on whether or not the project-wide artifact is even needed.
FWIW, Gradle has support specifically for publishing BOMs. We can still provide this kind of thing, but in a more appropriate form than a POM that declares nothing but optional dependencies. This is better modeled with by the <dependencyManagement>
block of a POM.
https://docs.gradle.org/current/userguide/java_platform_plugin.html
The conversion of 'provided' dependencies to 'api' is based on (my understanding of) what Gradle expects an 'api' dependency to be. In this case, I understood it as "If you want to compile code that is going to use the public API for this, you will need this library as well since it is exposed through the API." For MR, this means you need to have Hadoop on the compile classpath because you'll be interacting with input and output format classes that extend, expect, and return Hadoop classes. For Spark, this includes the Spark frameworks since interacting with ES-Hadoop will require you to pass your Spark Contexts/RDDs/Dataframes to the library. Hive and Pig is where these things get fuzzy. You don't write Java for Hive or Pig. Instead you are writing a parsed query language. For this reason, I took the approach of "If you were to build a library for those integrations that depended on ES-Hadoop, which things would need to be API dependencies?" For that, I landed at the Hive and Pig dependencies that we use as inputs and outputs from what might be reasonably considered the API of the project (the input and output formats), though even that may be considered somewhat dodgy. I'm open to feedback on it for sure. |
This is a correct understanding. The difference between
If I'm not writing Java code that uses an API then this isn't an API dependency. These dependencies are "inputs" but they are runtime. That is, if the dependency changes, I don't need to recompile anything, but runtime behavior might change. At least that's how I understand how you're describing it. In any case, the |
For the uberjar artifact, the integrations will all be marked as runtime only.
I've made some changes to this - converted the specifications for most of the dependencies to implementation instead of api. The remaining api dependencies are for the Hadoop client in the mr project, the core Spark library in the spark projects, and the Storm API in the storm project. For the uberjar/uberpom we release, these have all been converted to implementation instead of api, which translates them to runtime in the pom that is created. This should keep the dependencies pulled into downstream projects for compilation to a minimum, mostly dependencies that downstream projects will already be depending on for compilation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few comments, otherwise LGTM.
buildSrc/src/main/groovy/org/elasticsearch/hadoop/gradle/IntegrationBuildPlugin.groovy
Outdated
Show resolved
Hide resolved
buildSrc/src/main/groovy/org/elasticsearch/hadoop/gradle/RootBuildPlugin.groovy
Outdated
Show resolved
Hide resolved
Also fix a bug where test jars could be added to the dist zip.
All previously referenced dependencies in these configurations have been updated to be listed under either api, implementation, or compileOnly.
Relates #1423 |
i dont understand anything about gradle scopes/configurations, so i might be a little ignorant about what is happening here... but i know the standard is to have spark dependencies be provided in poms and i would expect such. e.g if you build a project that runs on spark you build an assembly jar (or equivalent) that should not include spark itself (it can be a real nasty source of bugs if you do include spark). |
@koertkuipers This change was made with the expectation that most Spark deployments use spark submit arguments to provide the jars for execution of a job rather than repackaging all dependencies into a single distributable. We went down this path with the thought that if users were creating a single distributable, they'd be taking a more direct approach to their build process and thus shouldn't have too much a problem excluding spark as a runtime dependency from the ES-Spark library. Happy to discuss the issue further though if this is causing usability concerns for you. |
hi @jbaiera
what is the scenario where having spark as a compile-time/runtime dependency is useful? |
Going off what I remember from April, the driving force for this change was 2-fold: ES-Hadoop was being updated to better comply with Gradle's project modeling and the plugin that provided this functionality is a third party plugin that we have little to no control over. A lot of changes were happening in order to support cross compilation of Scala/Spark artifacts, and getting the community plugin that added At the time it seemed logical that a project depending on ES-Hadoop/Spark would already have defined Spark as a provided dependency, and thus having Spark as a compile time dependency of the library (which would end up transitive) would be relatively innocuous to the end-user project development environment. As for the topic of deployments, my expectation at the time was that most deployments make use of the spark submit jar arguments to specify their dependencies. This was informed mostly from my experiences with the community over the years: Using the spark submit jar argument was the more common approach. Users who bundled their application and libraries together would just need to exclude Spark from that artifact. A small extra step. With this line of thinking, we removed the plugin, set the scopes to what would be the most reasonable scopes Gradle could support out of the box, and moved on with our refactoring. Several months after this change though, I'm a bit more skeptical that the spark submit jar model of deployment is as ubiquitous as previously thought. I'll open an issue for returning the dependencies to |
@koertkuipers I've opened an issue for this #1541 |
@jbaiera thank you for that! you are right that it is just a few extra excludes. i just did this yesterday. i started with making elaticsearch-spark simply an intransitive dependency in sbt, but then realized that such an approach cannot be expressed correctly in the POM file my project generates. so i ended listing all the elaticsearch-spark dependencies explicitly for exclusion. no big deal but a little fragile. i prefer to keep spark and hadoop out of the compile/runtime scope completely at all times, since they are a bit of transitive kitchen sinks, which makes it to understand what the actual dependencies are that will be end up in the assembly/fat jar. |
This PR removes usages of the PropDeps plugin from Gradle. As a result,
provided
andoptional
configurations have been removed. All previously referenced dependencies in these configurations have been updated to be listed under eitherapi
,implementation
, orcompileOnly
.