Adding support for update output mode to structured streaming #1839

masseyke · 2021-12-29T22:03:29Z

This commit adds support for "update" as the output mode for spark structured streaming to Elasticsearch.
Closes #1123

masseyke · 2021-12-29T22:18:45Z

I'm new to structured streaming, but I think that this does what we'd expect. That is, if you set output mode to "update", then it inserts documents if they didn't exist before, and only updates documents that have changed. That's what my understanding from https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts is of the expected behavior. In addition to the integration test that's part of this PR, I did a manual test (using the docker image at https://github.com/masseyke/es-spark-docker)

import org.apache.spark.sql.SparkSession
case class Person(id: String, name: String, surname: String, age: Int)
val people = spark.readStream.textFile("/test/*").map(_.split(",")).map(p => Person(p(0), p(1).trim, p(2).trim, p(3).trim.toInt))
val appendStream = people.writeStream.outputMode("append").option("checkpointLocation", "/save/location").option("es.mapping.id", "name").format("es").start("people")

Then I put a 4-row, 4-col csv file in hdfs at /test/, and made sure those 4 rows showed up as documents in Elasticsearch. And back in spark shell:

appendStream.stop

And then I added a second csv file to /test/ in HDFS that had the same rows as the first csv file but one of them had an updated value. Then back in spark shell:

val updateStream = people.writeStream.outputMode("update").option("checkpointLocation", "/save/location").option("es.mapping.id", "name").format("es").start("people")

Then with a query to Elasticsearch that returns version (GET people/_search?version=true) I verified that only the one document had been updated.
As @jbaiera mentioned in #1123, I'm intentionally blowing up if an "es.mapping.id" is not provided since we can't really update otherwise. Also I'm intentionally blowing up if "es.write.operation" is set to something other than "upsert" since based on the spark documentation upsert is what we want here. If it is unset, I set it to upsert.

jbaiera

LGTM!

jbaiera · 2022-01-20T00:52:19Z

...sql-20/src/itest/scala/org/elasticsearch/spark/sql/streaming/StreamingQueryTestHarness.scala

+   * Waits until all inputs are processed on the streaming query, but leaves the query open with the listener still in place, expecting
+   * another batch of inputs.
+   */
+  def waitForPartialCompletion(): Unit = {


…1878) AbstractMROldApiSaveTest.testUpdateWithoutId broke when #1839 was merged because we are now failing earlier if upsert is used but es.mapping.id is not set. The exception is now the same as the one you get when update is not configured to use es.mapping.id. Relates #1839 #69

Adding support for update output mode to structured streaming

efedab8

masseyke added feature v8.1.0 labels Dec 29, 2021

masseyke requested a review from jbaiera December 29, 2021 22:03

masseyke mentioned this pull request Dec 29, 2021

Support UPDATE output mode for Spark Structured Streaming #1123

Closed

masseyke marked this pull request as ready for review January 14, 2022 22:00

jbaiera approved these changes Jan 20, 2022

View reviewed changes

masseyke merged commit 3ef547d into elastic:master Jan 20, 2022

masseyke deleted the feature/structured-streaming-update branch January 20, 2022 14:12

masseyke mentioned this pull request Jan 24, 2022

Expected exception when incorrectly configuring upsert has changed #1878

Merged

masseyke added the :Spark label Feb 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding support for update output mode to structured streaming #1839

Adding support for update output mode to structured streaming #1839

Uh oh!

masseyke commented Dec 29, 2021

Uh oh!

masseyke commented Dec 29, 2021

Uh oh!

jbaiera left a comment

Uh oh!

jbaiera Jan 20, 2022

Uh oh!

Uh oh!

Adding support for update output mode to structured streaming #1839

Adding support for update output mode to structured streaming #1839

Uh oh!

Conversation

masseyke commented Dec 29, 2021

Uh oh!

masseyke commented Dec 29, 2021

Uh oh!

jbaiera left a comment

Choose a reason for hiding this comment

Uh oh!

jbaiera Jan 20, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!