Skip to content

[docs] Clean up cross-repo links #2350

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 3, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/reference/apache-hive-integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -258,4 +258,4 @@ While {{es}} understands Hive types up to version 2.0, it is backwards compatibl
::::


It is worth mentioning that rich data types available only in {{es}}, such as [`GeoPoint`](elasticsearch://docs/reference/elasticsearch/mapping-reference/geo-point.md) or [`GeoShape`](elasticsearch://docs/reference/elasticsearch/mapping-reference/geo-shape.md) are supported by converting their structure into the primitives available in the table above. For example, based on its storage a `geo_point` might be returned as a `string` or an `array`.
It is worth mentioning that rich data types available only in {{es}}, such as [`GeoPoint`](elasticsearch://reference/elasticsearch/mapping-reference/geo-point.md) or [`GeoShape`](elasticsearch://reference/elasticsearch/mapping-reference/geo-shape.md) are supported by converting their structure into the primitives available in the table above. For example, based on its storage a `geo_point` might be returned as a `string` or an `array`.
18 changes: 9 additions & 9 deletions docs/reference/apache-spark-support.md
Original file line number Diff line number Diff line change
Expand Up @@ -267,7 +267,7 @@ saveToEs(javaRDD, "my-collection-{media_type}/doc"); <1>

#### Handling document metadata [spark-write-meta]

{{es}} allows each document to have its own [metadata](elasticsearch://docs/reference/elasticsearch/mapping-reference/document-metadata-fields.md). As explained above, through the various [mapping](/reference/configuration.md#cfg-mapping) options one can customize these parameters so that their values are extracted from their belonging document. Further more, one can even include/exclude what parts of the data are sent back to {{es}}. In Spark, elasticsearch-hadoop extends this functionality allowing metadata to be supplied *outside* the document itself through the use of [*pair* `RDD`s](http://spark.apache.org/docs/latest/programming-guide.md#working-with-key-value-pairs). In other words, for `RDD`s containing a key-value tuple, the metadata can be extracted from the key and the value used as the document source.
{{es}} allows each document to have its own [metadata](elasticsearch://reference/elasticsearch/mapping-reference/document-metadata-fields.md). As explained above, through the various [mapping](/reference/configuration.md#cfg-mapping) options one can customize these parameters so that their values are extracted from their belonging document. Further more, one can even include/exclude what parts of the data are sent back to {{es}}. In Spark, elasticsearch-hadoop extends this functionality allowing metadata to be supplied *outside* the document itself through the use of [*pair* `RDD`s](http://spark.apache.org/docs/latest/programming-guide.md#working-with-key-value-pairs). In other words, for `RDD`s containing a key-value tuple, the metadata can be extracted from the key and the value used as the document source.

The metadata is described through the `Metadata` Java [enum](http://docs.oracle.com/javase/tutorial/java/javaOO/enum.md) within `org.elasticsearch.spark.rdd` package which identifies its type - `id`, `ttl`, `version`, etc…​ Thus an `RDD` keys can be a `Map` containing the `Metadata` for each document and its associated values. If `RDD` key is not of type `Map`, elasticsearch-hadoop will consider the object as representing the document id and use it accordingly. This sounds more complicated than it is, so let us see some examples.

Expand Down Expand Up @@ -514,7 +514,7 @@ When dealing with multi-value/array fields, please see [this](/reference/mapping
::::


elasticsearch-hadoop automatically converts Spark built-in types to {{es}} [types](elasticsearch://docs/reference/elasticsearch/mapping-reference/field-data-types.md) (and back) as shown in the table below:
elasticsearch-hadoop automatically converts Spark built-in types to {{es}} [types](elasticsearch://reference/elasticsearch/mapping-reference/field-data-types.md) (and back) as shown in the table below:

| Scala type | {{es}} type |
| --- | --- |
Expand Down Expand Up @@ -552,7 +552,7 @@ in addition, the following *implied* conversion applies for Java types:

The conversion is done as a *best* effort; built-in Java and Scala types are guaranteed to be properly converted, however there are no guarantees for user types whether in Java or Scala. As mentioned in the tables above, when a `case` class is encountered in Scala or `JavaBean` in Java, the converters will try to `unwrap` its content and save it as an `object`. Note this works only for top-level user objects - if the user object has other user objects nested in, the conversion is likely to fail since the converter does not perform nested `unwrapping`. This is done on purpose since the converter has to *serialize* and *deserialize* the data and user types introduce ambiguity due to data loss; this can be addressed through some type of mapping however that takes the project way too close to the realm of ORMs and arguably introduces too much complexity for little to no gain; thanks to the processing functionality in Spark and the plugability in elasticsearch-hadoop one can easily transform objects into other types, if needed with minimal effort and maximum control.

It is worth mentioning that rich data types available only in {{es}}, such as [`GeoPoint`](elasticsearch://docs/reference/elasticsearch/mapping-reference/geo-point.md) or [`GeoShape`](elasticsearch://docs/reference/elasticsearch/mapping-reference/geo-shape.md) are supported by converting their structure into the primitives available in the table above. For example, based on its storage a `geo_point` might be returned as a `String` or a `Traversable`.
It is worth mentioning that rich data types available only in {{es}}, such as [`GeoPoint`](elasticsearch://reference/elasticsearch/mapping-reference/geo-point.md) or [`GeoShape`](elasticsearch://reference/elasticsearch/mapping-reference/geo-shape.md) are supported by converting their structure into the primitives available in the table above. For example, based on its storage a `geo_point` might be returned as a `String` or a `Traversable`.


## Spark Streaming support [spark-streaming]
Expand Down Expand Up @@ -854,7 +854,7 @@ jssc.start();

#### Handling document metadata [spark-streaming-write-meta]

{{es}} allows each document to have its own [metadata](elasticsearch://docs/reference/elasticsearch/mapping-reference/document-metadata-fields.md). As explained above, through the various [mapping](/reference/configuration.md#cfg-mapping) options one can customize these parameters so that their values are extracted from their belonging document. Further more, one can even include/exclude what parts of the data are sent back to {{es}}. In Spark, elasticsearch-hadoop extends this functionality allowing metadata to be supplied *outside* the document itself through the use of [*pair* `RDD`s](http://spark.apache.org/docs/latest/programming-guide.md#working-with-key-value-pairs).
{{es}} allows each document to have its own [metadata](elasticsearch://reference/elasticsearch/mapping-reference/document-metadata-fields.md). As explained above, through the various [mapping](/reference/configuration.md#cfg-mapping) options one can customize these parameters so that their values are extracted from their belonging document. Further more, one can even include/exclude what parts of the data are sent back to {{es}}. In Spark, elasticsearch-hadoop extends this functionality allowing metadata to be supplied *outside* the document itself through the use of [*pair* `RDD`s](http://spark.apache.org/docs/latest/programming-guide.md#working-with-key-value-pairs).

This is no different in Spark Streaming. For `DStreams`s containing a key-value tuple, the metadata can be extracted from the key and the value used as the document source.

Expand Down Expand Up @@ -1064,7 +1064,7 @@ in addition, the following *implied* conversion applies for Java types:
| `Map` | `object` |
| *Java Bean* | `object` (see `Map`) |

It is worth re-mentioning that rich data types available only in {{es}}, such as [`GeoPoint`](elasticsearch://docs/reference/elasticsearch/mapping-reference/geo-point.md) or [`GeoShape`](elasticsearch://docs/reference/elasticsearch/mapping-reference/geo-shape.md) are supported by converting their structure into the primitives available in the table above. For example, based on its storage a `geo_point` might be returned as a `String` or a `Traversable`.
It is worth re-mentioning that rich data types available only in {{es}}, such as [`GeoPoint`](elasticsearch://reference/elasticsearch/mapping-reference/geo-point.md) or [`GeoShape`](elasticsearch://reference/elasticsearch/mapping-reference/geo-shape.md) are supported by converting their structure into the primitives available in the table above. For example, based on its storage a `geo_point` might be returned as a `String` or a `Traversable`.


## Spark SQL support [spark-sql]
Expand Down Expand Up @@ -1374,7 +1374,7 @@ The connector translates the query into:
}
```

Further more, the pushdown filters can work on `analyzed` terms (the default) or can be configured to be *strict* and provide `exact` matches (work only on `not-analyzed` fields). Unless one manually specifies the mapping, it is highly recommended to leave the defaults as they are. This and other topics are discussed at length in the {{es}} [Reference Documentation](elasticsearch://docs/reference/query-languages/query-dsl-term-query.md).
Further more, the pushdown filters can work on `analyzed` terms (the default) or can be configured to be *strict* and provide `exact` matches (work only on `not-analyzed` fields). Unless one manually specifies the mapping, it is highly recommended to leave the defaults as they are. This and other topics are discussed at length in the {{es}} [Reference Documentation](elasticsearch://reference/query-languages/query-dsl-term-query.md).

Note that `double.filtering`, available since elasticsearch-hadoop 2.2 for Spark 1.6 or higher, allows filters that are already pushed down to {{es}} to be processed/evaluated by Spark as well (default) or not. Turning this feature off, especially when dealing with large data sizes speed things up. However one should pay attention to the semantics as turning this off, might return different results (depending on how the data is indexed, `analyzed` vs `not_analyzed`). In general, when turning *strict* on, one can disable `double.filtering` as well.

Expand Down Expand Up @@ -1458,7 +1458,7 @@ val smiths = sqlContext.esDF("spark/people","?q=Smith") <1>

In some cases, especially when the index in {{es}} contains a lot of fields, it is desireable to create a `DataFrame` that contains only a *subset* of them. While one can modify the `DataFrame` (by working on its backing `RDD`) through the official Spark API or through dedicated queries, elasticsearch-hadoop allows the user to specify what fields to include and exclude from {{es}} when creating the `DataFrame`.

Through `es.read.field.include` and `es.read.field.exclude` properties, one can indicate what fields to include or exclude from the index mapping. The syntax is similar to that of {{es}} [include/exclude](elasticsearch://docs/reference/elasticsearch/rest-apis/retrieve-selected-fields.md#source-filtering). Multiple values can be specified by using a comma. By default, no value is specified meaning all properties/fields are included and no properties/fields are excluded. Note that these properties can include leading and trailing wildcards. Including part of a hierarchy of fields without a trailing wildcard does not imply that the entire hierarcy is included. However in most cases it does not make sense to include only part of a hierarchy, so a trailing wildcard should be included.
Through `es.read.field.include` and `es.read.field.exclude` properties, one can indicate what fields to include or exclude from the index mapping. The syntax is similar to that of {{es}} [include/exclude](elasticsearch://reference/elasticsearch/rest-apis/retrieve-selected-fields.md#source-filtering). Multiple values can be specified by using a comma. By default, no value is specified meaning all properties/fields are included and no properties/fields are excluded. Note that these properties can include leading and trailing wildcards. Including part of a hierarchy of fields without a trailing wildcard does not imply that the entire hierarcy is included. However in most cases it does not make sense to include only part of a hierarchy, so a trailing wildcard should be included.

For example:

Expand Down Expand Up @@ -1510,7 +1510,7 @@ When dealing with multi-value/array fields, please see [this](/reference/mapping
::::


elasticsearch-hadoop automatically converts Spark built-in types to {{es}} [types](elasticsearch://docs/reference/elasticsearch/mapping-reference/field-data-types.md) (and back) as shown in the table below:
elasticsearch-hadoop automatically converts Spark built-in types to {{es}} [types](elasticsearch://reference/elasticsearch/mapping-reference/field-data-types.md) (and back) as shown in the table below:

While Spark SQL [`DataType`s](https://spark.apache.org/docs/latest/sql-programming-guide.md#data-types) have an equivalent in both Scala and Java and thus the [RDD](#spark-type-conversion) conversion can apply, there are slightly different semantics - in particular with the `java.sql` types due to the way Spark SQL handles them:

Expand Down Expand Up @@ -1716,7 +1716,7 @@ If automatic index creation is used, please review [this](/reference/mapping-typ
::::


elasticsearch-hadoop automatically converts Spark built-in types to {{es}} [types](elasticsearch://docs/reference/elasticsearch/mapping-reference/field-data-types.md) as shown in the table below:
elasticsearch-hadoop automatically converts Spark built-in types to {{es}} [types](elasticsearch://reference/elasticsearch/mapping-reference/field-data-types.md) as shown in the table below:

While Spark SQL [`DataType`s](https://spark.apache.org/docs/latest/sql-programming-guide.md#data-types) have an equivalent in both Scala and Java and thus the [RDD](#spark-type-conversion) conversion can apply, there are slightly different semantics - in particular with the `java.sql` types due to the way Spark SQL handles them:

Expand Down
2 changes: 1 addition & 1 deletion docs/reference/cloudrestricted-environments.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ As the nodes are not accessible from outside, fix the problem by assigning them

## Use a dedicated series of proxies/gateways [_use_a_dedicated_series_of_proxiesgateways]

If exposing the cluster is not an option, one can chose to use a proxy or a VPN so that the Hadoop cluster can transparently access the {{es}} cluster in a different network. By using an indirection layer, the two networks can *transparently* communicate with each other. Do note that usually this means the two networks would know how to properly route the IPs from one to the other. If the proxy/VPN solution does not handle this automatically, {{es}} might help through its [network settings](elasticsearch://docs/reference/elasticsearch/configuration-reference/networking-settings.md) in particular `network.host` and `network.publish_host` which control what IP the nodes bind and in particular *publish* or *advertise* to their clients. This allows a certain publicly-accessible IP to be broadcasted to the clients to allow access to a node, even if the node itself does not run on that IP.
If exposing the cluster is not an option, one can chose to use a proxy or a VPN so that the Hadoop cluster can transparently access the {{es}} cluster in a different network. By using an indirection layer, the two networks can *transparently* communicate with each other. Do note that usually this means the two networks would know how to properly route the IPs from one to the other. If the proxy/VPN solution does not handle this automatically, {{es}} might help through its [network settings](elasticsearch://reference/elasticsearch/configuration-reference/networking-settings.md) in particular `network.host` and `network.publish_host` which control what IP the nodes bind and in particular *publish* or *advertise* to their clients. This allows a certain publicly-accessible IP to be broadcasted to the clients to allow access to a node, even if the node itself does not run on that IP.


## Configure the connector to run in WAN mode [_configure_the_connector_to_run_in_wan_mode]
Expand Down
Loading