Remove provided optional configs #1452

jbaiera · 2020-03-24T20:38:37Z

This PR removes usages of the PropDeps plugin from Gradle. As a result, provided and optional configurations have been removed. All previously referenced dependencies in these configurations have been updated to be listed under either api, implementation, or compileOnly.

mark-vieira

Have we tested how this works for consumers? I see we've converted a lot of provided dependencies to api which is very different. This is now going to result in a POM where these things are marked <scope>compile</scope> instead of <scope>provided</scope> which has considerably different semantics.

mark-vieira · 2020-03-24T20:56:56Z

buildSrc/src/main/groovy/org/elasticsearch/hadoop/gradle/BuildPlugin.groovy

@@ -356,17 +345,6 @@ class BuildPlugin implements Plugin<Project>  {
                dep.scope == 'test' || dep.artifactId == 'elasticsearch-hadoop-mr'
            }

-            // Mark the optional dependencies to actually be optional
-            generatedPom.dependencies.findAll { it.scope == 'optional' }.each {


How are we handling this now? The generated POMs are going to be different now. Dependencies that were previously marked as optional are now going to be brought in by consumers by default. Is this fine or expected behavior?

We used optional primarily in two places:

First, in the spark projects where if you were only using RDD's then you didn't need to necessarily pull in the SQL libraries. That said, both core Spark and Spark SQL were listed as provided, and thus would always both be on the classpath if launching via spark-submit. In terms of a strict build footprint, this may affect those who are strictly avoiding SQL, though a decent amount of Spark development is done in live environments/notebooks/platforms. I don't expect this to be a widespread problem since those who are actively trying to avoid SQL should be able to ignore this dependency with their own build tools.

Second, in the project-wide pom, where we have the code for every integration. The expectation here I assume was that every integration was optional within the project-wide artifact, and thus each integration's dependency was marked as optional. This meant that you could pull in the root project's artifact for Spark and not need to pull in Pig dependencies. This might be the biggest issue to address for the removal of optional scope dependencies, but to work around this, we publish integration specific artifacts which provide isolation based on the supported integration for users who only want support for one integration.

In fact, I'm willing to have a discussion on whether or not the project-wide artifact is even needed. I believe that it is a nice getting started point where you can put in a simple dependency name to get support for any integration, but beyond getting started, most users are much better off using integration specific artifacts for their projects.

I don't expect this to be a widespread problem since those who are actively trying to avoid SQL should be able to ignore this dependency with their own build tools.

Right. Folks bringing this in will now need to explicitly exclude this dependency rather than explicitly include it. If you don't feel this is going to be terribly disruptive then 👍.

the project-wide pom

What is the "project-wide pom" and what specifically is its purpose? Would a BOM be a more appropriate solution if you want folks to be able to import a simple POM that defines things like common dependency versions?

In fact, I'm willing to have a discussion on whether or not the project-wide artifact is even needed.

FWIW, Gradle has support specifically for publishing BOMs. We can still provide this kind of thing, but in a more appropriate form than a POM that declares nothing but optional dependencies. This is better modeled with by the <dependencyManagement> block of a POM.

https://docs.gradle.org/current/userguide/java_platform_plugin.html

jbaiera · 2020-03-24T21:25:07Z

The conversion of 'provided' dependencies to 'api' is based on (my understanding of) what Gradle expects an 'api' dependency to be. In this case, I understood it as "If you want to compile code that is going to use the public API for this, you will need this library as well since it is exposed through the API."

For MR, this means you need to have Hadoop on the compile classpath because you'll be interacting with input and output format classes that extend, expect, and return Hadoop classes.

For Spark, this includes the Spark frameworks since interacting with ES-Hadoop will require you to pass your Spark Contexts/RDDs/Dataframes to the library.

Hive and Pig is where these things get fuzzy. You don't write Java for Hive or Pig. Instead you are writing a parsed query language. For this reason, I took the approach of "If you were to build a library for those integrations that depended on ES-Hadoop, which things would need to be API dependencies?"

For that, I landed at the Hive and Pig dependencies that we use as inputs and outputs from what might be reasonably considered the API of the project (the input and output formats), though even that may be considered somewhat dodgy. I'm open to feedback on it for sure.

mark-vieira · 2020-03-24T21:39:00Z

I understood it as "If you want to compile code that is going to use the public API for this, you will need this library as well since it is exposed through the API."

This is a correct understanding. The difference between api and provided is how the consumers runtime classpath is affected. An api dependency is effectively <scope>compile</scope> which means it will be on the compile and runtime classpath of the downstream consumer. Using provided the dependency is only on the compile classpath, and it's expected that the runtime environment will have this available already somehow.

For that, I landed at the Hive and Pig dependencies that we use as inputs and outputs from what might be reasonably considered the API of the project (the input and output formats), though even that may be considered somewhat dodgy.

If I'm not writing Java code that uses an API then this isn't an API dependency. These dependencies are "inputs" but they are runtime. That is, if the dependency changes, I don't need to recompile anything, but runtime behavior might change. At least that's how I understand how you're describing it. In any case, the implementation vs provided distinction remains. Using provided means we aren't telling consumers to put this on their runtime classpath. Now we are. Maybe that's ok?

For the uberjar artifact, the integrations will all be marked as runtime only.

jbaiera · 2020-04-07T20:10:47Z

I've made some changes to this - converted the specifications for most of the dependencies to implementation instead of api. The remaining api dependencies are for the Hadoop client in the mr project, the core Spark library in the spark projects, and the Storm API in the storm project. For the uberjar/uberpom we release, these have all been converted to implementation instead of api, which translates them to runtime in the pom that is created. This should keep the dependencies pulled into downstream projects for compilation to a minimum, mostly dependencies that downstream projects will already be depending on for compilation.

mark-vieira

Few comments, otherwise LGTM.

buildSrc/src/main/groovy/org/elasticsearch/hadoop/gradle/IntegrationBuildPlugin.groovy

buildSrc/src/main/groovy/org/elasticsearch/hadoop/gradle/RootBuildPlugin.groovy

spark/sql-20/build.gradle

Also fix a bug where test jars could be added to the dist zip.

All previously referenced dependencies in these configurations have been updated to be listed under either api, implementation, or compileOnly.

jbaiera · 2020-07-14T16:48:15Z

Relates #1423

koertkuipers · 2020-09-17T15:44:10Z

i dont understand anything about gradle scopes/configurations, so i might be a little ignorant about what is happening here... but i know the standard is to have spark dependencies be provided in poms and i would expect such. e.g if you build a project that runs on spark you build an assembly jar (or equivalent) that should not include spark itself (it can be a real nasty source of bugs if you do include spark).
so the poms that result from this change simply seem incorrect to me and i will work around it using exclusion rules. its no big deal but just a little confusing.

jbaiera · 2020-09-18T16:38:24Z

@koertkuipers This change was made with the expectation that most Spark deployments use spark submit arguments to provide the jars for execution of a job rather than repackaging all dependencies into a single distributable. We went down this path with the thought that if users were creating a single distributable, they'd be taking a more direct approach to their build process and thus shouldn't have too much a problem excluding spark as a runtime dependency from the ES-Spark library. Happy to discuss the issue further though if this is causing usability concerns for you.

koertkuipers · 2020-10-07T02:12:39Z

hi @jbaiera
it is not clear to me how using spark submit arguments to provide the jars for execution would make it ok for spark itself to be a compile/runtime dependency?
i do not use this method of adding jars, but i can think of 2 possible scenarios:

when adding jars with spark-submit (transitive) dependencies are also included. in this case it is not a good idea to pull in spark itself as a (transitive) dependency. spark jars are added to class-path by spark-submit already.
when adding jars with spark-submit (transitive) dependencies are not included. in this case having spark itself as a dependency doesn't do any good either.

what is the scenario where having spark as a compile-time/runtime dependency is useful?

jbaiera · 2020-10-07T21:27:35Z

what is the scenario where having spark as a compile-time/runtime dependency is useful?

Going off what I remember from April, the driving force for this change was 2-fold: ES-Hadoop was being updated to better comply with Gradle's project modeling and the plugin that provided this functionality is a third party plugin that we have little to no control over. A lot of changes were happening in order to support cross compilation of Scala/Spark artifacts, and getting the community plugin that added provided and optional scopes to work with the changes we were making was becoming a maintenance nightmare.

At the time it seemed logical that a project depending on ES-Hadoop/Spark would already have defined Spark as a provided dependency, and thus having Spark as a compile time dependency of the library (which would end up transitive) would be relatively innocuous to the end-user project development environment.

As for the topic of deployments, my expectation at the time was that most deployments make use of the spark submit jar arguments to specify their dependencies. This was informed mostly from my experiences with the community over the years: Using the spark submit jar argument was the more common approach. Users who bundled their application and libraries together would just need to exclude Spark from that artifact. A small extra step.

With this line of thinking, we removed the plugin, set the scopes to what would be the most reasonable scopes Gradle could support out of the box, and moved on with our refactoring. Several months after this change though, I'm a bit more skeptical that the spark submit jar model of deployment is as ubiquitous as previously thought.

I'll open an issue for returning the dependencies to provided. I'm not sure how reasonable it would be to return the previously optional dependencies, which at the time was the Spark SQL libraries, as well as every integration library in the uberjar. Those would most likely make a return as provided if we went forward with that change.

jbaiera · 2020-10-07T21:36:30Z

@koertkuipers I've opened an issue for this #1541

koertkuipers · 2020-10-07T22:37:56Z

@jbaiera thank you for that!

you are right that it is just a few extra excludes. i just did this yesterday. i started with making elaticsearch-spark simply an intransitive dependency in sbt, but then realized that such an approach cannot be expressed correctly in the POM file my project generates. so i ended listing all the elaticsearch-spark dependencies explicitly for exclusion. no big deal but a little fragile.

i prefer to keep spark and hadoop out of the compile/runtime scope completely at all times, since they are a bit of transitive kitchen sinks, which makes it to understand what the actual dependencies are that will be end up in the assembly/fat jar.

jbaiera added 5 commits March 24, 2020 16:03

Remove provided and optional configurations.

1745f61

Update scripts to use api/implementation configs

05f32d8

Remove usage of propdeps plugin

f8af462

Add notice, license, and sha files for old provided deps

14a50ba

Fix Spark 2.x tags dependency

dd6c590

jbaiera added enhancement :Build labels Mar 24, 2020

jbaiera requested a review from mark-vieira March 24, 2020 20:38

mark-vieira reviewed Mar 24, 2020

View reviewed changes

jbaiera added 2 commits March 30, 2020 13:27

Merge branch 'master' into remove-provided-optional-configs

1de829c

Convert as much as we can into implementation dependencies.

bd3df3e

For the uberjar artifact, the integrations will all be marked as runtime only.

jbaiera requested a review from mark-vieira April 7, 2020 20:10

mark-vieira approved these changes Apr 8, 2020

View reviewed changes

PR Feedback

b6d5dfc

Also fix a bug where test jars could be added to the dist zip.

jbaiera merged commit 66354d4 into elastic:master Apr 22, 2020

jbaiera deleted the remove-provided-optional-configs branch April 22, 2020 15:55

jbaiera added backport pending v8.0.0-alpha1 labels Apr 22, 2020

jbaiera added a commit that referenced this pull request May 13, 2020

Remove provided optional configs (#1452)

fc79150

All previously referenced dependencies in these configurations have been updated to be listed under either api, implementation, or compileOnly.

jbaiera added v7.8.0 and removed backport pending labels May 13, 2020

jbaiera mentioned this pull request Oct 7, 2020

Return to using provided scope for integration dependencies #1541

Open

Remove provided optional configs #1452

Remove provided optional configs #1452

Uh oh!

Conversation

jbaiera commented Mar 24, 2020

Uh oh!

mark-vieira left a comment

Choose a reason for hiding this comment

Uh oh!

mark-vieira Mar 24, 2020

Choose a reason for hiding this comment

Uh oh!

jbaiera Mar 24, 2020

Choose a reason for hiding this comment

Uh oh!

mark-vieira Mar 24, 2020

Choose a reason for hiding this comment

Uh oh!

jbaiera commented Mar 24, 2020

Uh oh!

mark-vieira commented Mar 24, 2020

Uh oh!

jbaiera commented Apr 7, 2020

Uh oh!

mark-vieira left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jbaiera commented Jul 14, 2020

Uh oh!

koertkuipers commented Sep 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbaiera commented Sep 18, 2020

Uh oh!

koertkuipers commented Oct 7, 2020

Uh oh!

jbaiera commented Oct 7, 2020

Uh oh!

jbaiera commented Oct 7, 2020

Uh oh!

koertkuipers commented Oct 7, 2020

Uh oh!

Uh oh!

koertkuipers commented Sep 17, 2020 •

edited

Loading