Open
Description
Hi everyone,
What kind an issue is this?
- Bug report.
- Feature Request
Issue description
We use Spark to manipulated an array of distinct objects in an ElasticSearch Index.
The ElasticSearch index's field is mapped as :
"array_field": {
"type": "nested",
"properties": {
"property1": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"property2": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"property3": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"property4": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"property5": {
"type": "date"
}
}
}
When we use the explode Spark function on a dataset created from reading from ElasticSearch the connector generates the following query :
"query": {
"bool": {
"must": [
{
"match_all": {}
}
],
"filter": [
{
"exists": {
"field": "array_field"
}
}
]
}
}
The "exists" part in the query is generated to differentiate calls of explode and explode_outer because explode drops nulls elements whereas explode_outer keeps them.
But since the field is a nested, the query never gets any match because it is not a nested query therefore the dataset is always empty.
Steps to reproduce
- Create an index with a nested mapped field
- Put a document with a valued nested field
- Read the index from Spark into a dataset
- Call Spark
explode(field)
on the nested field on the dataset - The dataset is empty because the generated query does not match any document
Version Info
OS: : Linux
JVM : 1.8
Hadoop/Spark: Spark 3.3.0
ES-Hadoop : elasticsearch-spark-30_2.12:8.2.2