Skip to content

Spark connector's implentation of "explode" does not work on nested fields #2051

Open
@ThibSCH

Description

@ThibSCH

Hi everyone,

What kind an issue is this?

  • Bug report.
  • Feature Request

Issue description

We use Spark to manipulated an array of distinct objects in an ElasticSearch Index.
The ElasticSearch index's field is mapped as :

"array_field": {
        "type": "nested",
        "properties": {
          "property1": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword"
              }
            }
          },
          "property2": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword"
              }
            }
          },
          "property3": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword"
              }
            }
          },
          "property4": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword"
              }
            }
          },
          "property5": {
            "type": "date"
          }
        }
      }

When we use the explode Spark function on a dataset created from reading from ElasticSearch the connector generates the following query :

"query": {
    "bool": {
      "must": [
        {
          "match_all": {}
        }
      ],
      "filter": [
        {
          "exists": {
            "field": "array_field"
          }
        }
      ]
    }
  } 

The "exists" part in the query is generated to differentiate calls of explode and explode_outer because explode drops nulls elements whereas explode_outer keeps them.
But since the field is a nested, the query never gets any match because it is not a nested query therefore the dataset is always empty.

Steps to reproduce

  1. Create an index with a nested mapped field
  2. Put a document with a valued nested field
  3. Read the index from Spark into a dataset
  4. Call Spark explode(field) on the nested field on the dataset
  5. The dataset is empty because the generated query does not match any document

Version Info

OS: : Linux
JVM : 1.8
Hadoop/Spark: Spark 3.3.0
ES-Hadoop : elasticsearch-spark-30_2.12:8.2.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions