From f98ed41fcc7e24356797f7c48b4136a8b90f74e3 Mon Sep 17 00:00:00 2001 From: Colleen McGinnis Date: Fri, 28 Feb 2025 12:32:09 -0600 Subject: [PATCH] clean up cross-repo links --- docs/reference/apache-hive-integration.md | 2 +- docs/reference/apache-spark-support.md | 18 +++++++++--------- docs/reference/cloudrestricted-environments.md | 2 +- docs/reference/configuration.md | 12 ++++++------ docs/reference/index.md | 2 +- docs/reference/kerberos.md | 2 +- docs/reference/mapping-types.md | 10 +++++----- docs/reference/mapreduce-integration.md | 4 ++-- 8 files changed, 26 insertions(+), 26 deletions(-) diff --git a/docs/reference/apache-hive-integration.md b/docs/reference/apache-hive-integration.md index 36686a93f..9f65007f8 100644 --- a/docs/reference/apache-hive-integration.md +++ b/docs/reference/apache-hive-integration.md @@ -258,4 +258,4 @@ While {{es}} understands Hive types up to version 2.0, it is backwards compatibl :::: -It is worth mentioning that rich data types available only in {{es}}, such as [`GeoPoint`](elasticsearch://docs/reference/elasticsearch/mapping-reference/geo-point.md) or [`GeoShape`](elasticsearch://docs/reference/elasticsearch/mapping-reference/geo-shape.md) are supported by converting their structure into the primitives available in the table above. For example, based on its storage a `geo_point` might be returned as a `string` or an `array`. +It is worth mentioning that rich data types available only in {{es}}, such as [`GeoPoint`](elasticsearch://reference/elasticsearch/mapping-reference/geo-point.md) or [`GeoShape`](elasticsearch://reference/elasticsearch/mapping-reference/geo-shape.md) are supported by converting their structure into the primitives available in the table above. For example, based on its storage a `geo_point` might be returned as a `string` or an `array`. diff --git a/docs/reference/apache-spark-support.md b/docs/reference/apache-spark-support.md index a2b5ddaeb..603cbe186 100644 --- a/docs/reference/apache-spark-support.md +++ b/docs/reference/apache-spark-support.md @@ -267,7 +267,7 @@ saveToEs(javaRDD, "my-collection-{media_type}/doc"); <1> #### Handling document metadata [spark-write-meta] -{{es}} allows each document to have its own [metadata](elasticsearch://docs/reference/elasticsearch/mapping-reference/document-metadata-fields.md). As explained above, through the various [mapping](/reference/configuration.md#cfg-mapping) options one can customize these parameters so that their values are extracted from their belonging document. Further more, one can even include/exclude what parts of the data are sent back to {{es}}. In Spark, elasticsearch-hadoop extends this functionality allowing metadata to be supplied *outside* the document itself through the use of [*pair* `RDD`s](http://spark.apache.org/docs/latest/programming-guide.md#working-with-key-value-pairs). In other words, for `RDD`s containing a key-value tuple, the metadata can be extracted from the key and the value used as the document source. +{{es}} allows each document to have its own [metadata](elasticsearch://reference/elasticsearch/mapping-reference/document-metadata-fields.md). As explained above, through the various [mapping](/reference/configuration.md#cfg-mapping) options one can customize these parameters so that their values are extracted from their belonging document. Further more, one can even include/exclude what parts of the data are sent back to {{es}}. In Spark, elasticsearch-hadoop extends this functionality allowing metadata to be supplied *outside* the document itself through the use of [*pair* `RDD`s](http://spark.apache.org/docs/latest/programming-guide.md#working-with-key-value-pairs). In other words, for `RDD`s containing a key-value tuple, the metadata can be extracted from the key and the value used as the document source. The metadata is described through the `Metadata` Java [enum](http://docs.oracle.com/javase/tutorial/java/javaOO/enum.md) within `org.elasticsearch.spark.rdd` package which identifies its type - `id`, `ttl`, `version`, etc…​ Thus an `RDD` keys can be a `Map` containing the `Metadata` for each document and its associated values. If `RDD` key is not of type `Map`, elasticsearch-hadoop will consider the object as representing the document id and use it accordingly. This sounds more complicated than it is, so let us see some examples. @@ -514,7 +514,7 @@ When dealing with multi-value/array fields, please see [this](/reference/mapping :::: -elasticsearch-hadoop automatically converts Spark built-in types to {{es}} [types](elasticsearch://docs/reference/elasticsearch/mapping-reference/field-data-types.md) (and back) as shown in the table below: +elasticsearch-hadoop automatically converts Spark built-in types to {{es}} [types](elasticsearch://reference/elasticsearch/mapping-reference/field-data-types.md) (and back) as shown in the table below: | Scala type | {{es}} type | | --- | --- | @@ -552,7 +552,7 @@ in addition, the following *implied* conversion applies for Java types: The conversion is done as a *best* effort; built-in Java and Scala types are guaranteed to be properly converted, however there are no guarantees for user types whether in Java or Scala. As mentioned in the tables above, when a `case` class is encountered in Scala or `JavaBean` in Java, the converters will try to `unwrap` its content and save it as an `object`. Note this works only for top-level user objects - if the user object has other user objects nested in, the conversion is likely to fail since the converter does not perform nested `unwrapping`. This is done on purpose since the converter has to *serialize* and *deserialize* the data and user types introduce ambiguity due to data loss; this can be addressed through some type of mapping however that takes the project way too close to the realm of ORMs and arguably introduces too much complexity for little to no gain; thanks to the processing functionality in Spark and the plugability in elasticsearch-hadoop one can easily transform objects into other types, if needed with minimal effort and maximum control. -It is worth mentioning that rich data types available only in {{es}}, such as [`GeoPoint`](elasticsearch://docs/reference/elasticsearch/mapping-reference/geo-point.md) or [`GeoShape`](elasticsearch://docs/reference/elasticsearch/mapping-reference/geo-shape.md) are supported by converting their structure into the primitives available in the table above. For example, based on its storage a `geo_point` might be returned as a `String` or a `Traversable`. +It is worth mentioning that rich data types available only in {{es}}, such as [`GeoPoint`](elasticsearch://reference/elasticsearch/mapping-reference/geo-point.md) or [`GeoShape`](elasticsearch://reference/elasticsearch/mapping-reference/geo-shape.md) are supported by converting their structure into the primitives available in the table above. For example, based on its storage a `geo_point` might be returned as a `String` or a `Traversable`. ## Spark Streaming support [spark-streaming] @@ -854,7 +854,7 @@ jssc.start(); #### Handling document metadata [spark-streaming-write-meta] -{{es}} allows each document to have its own [metadata](elasticsearch://docs/reference/elasticsearch/mapping-reference/document-metadata-fields.md). As explained above, through the various [mapping](/reference/configuration.md#cfg-mapping) options one can customize these parameters so that their values are extracted from their belonging document. Further more, one can even include/exclude what parts of the data are sent back to {{es}}. In Spark, elasticsearch-hadoop extends this functionality allowing metadata to be supplied *outside* the document itself through the use of [*pair* `RDD`s](http://spark.apache.org/docs/latest/programming-guide.md#working-with-key-value-pairs). +{{es}} allows each document to have its own [metadata](elasticsearch://reference/elasticsearch/mapping-reference/document-metadata-fields.md). As explained above, through the various [mapping](/reference/configuration.md#cfg-mapping) options one can customize these parameters so that their values are extracted from their belonging document. Further more, one can even include/exclude what parts of the data are sent back to {{es}}. In Spark, elasticsearch-hadoop extends this functionality allowing metadata to be supplied *outside* the document itself through the use of [*pair* `RDD`s](http://spark.apache.org/docs/latest/programming-guide.md#working-with-key-value-pairs). This is no different in Spark Streaming. For `DStreams`s containing a key-value tuple, the metadata can be extracted from the key and the value used as the document source. @@ -1064,7 +1064,7 @@ in addition, the following *implied* conversion applies for Java types: | `Map` | `object` | | *Java Bean* | `object` (see `Map`) | -It is worth re-mentioning that rich data types available only in {{es}}, such as [`GeoPoint`](elasticsearch://docs/reference/elasticsearch/mapping-reference/geo-point.md) or [`GeoShape`](elasticsearch://docs/reference/elasticsearch/mapping-reference/geo-shape.md) are supported by converting their structure into the primitives available in the table above. For example, based on its storage a `geo_point` might be returned as a `String` or a `Traversable`. +It is worth re-mentioning that rich data types available only in {{es}}, such as [`GeoPoint`](elasticsearch://reference/elasticsearch/mapping-reference/geo-point.md) or [`GeoShape`](elasticsearch://reference/elasticsearch/mapping-reference/geo-shape.md) are supported by converting their structure into the primitives available in the table above. For example, based on its storage a `geo_point` might be returned as a `String` or a `Traversable`. ## Spark SQL support [spark-sql] @@ -1374,7 +1374,7 @@ The connector translates the query into: } ``` -Further more, the pushdown filters can work on `analyzed` terms (the default) or can be configured to be *strict* and provide `exact` matches (work only on `not-analyzed` fields). Unless one manually specifies the mapping, it is highly recommended to leave the defaults as they are. This and other topics are discussed at length in the {{es}} [Reference Documentation](elasticsearch://docs/reference/query-languages/query-dsl-term-query.md). +Further more, the pushdown filters can work on `analyzed` terms (the default) or can be configured to be *strict* and provide `exact` matches (work only on `not-analyzed` fields). Unless one manually specifies the mapping, it is highly recommended to leave the defaults as they are. This and other topics are discussed at length in the {{es}} [Reference Documentation](elasticsearch://reference/query-languages/query-dsl-term-query.md). Note that `double.filtering`, available since elasticsearch-hadoop 2.2 for Spark 1.6 or higher, allows filters that are already pushed down to {{es}} to be processed/evaluated by Spark as well (default) or not. Turning this feature off, especially when dealing with large data sizes speed things up. However one should pay attention to the semantics as turning this off, might return different results (depending on how the data is indexed, `analyzed` vs `not_analyzed`). In general, when turning *strict* on, one can disable `double.filtering` as well. @@ -1458,7 +1458,7 @@ val smiths = sqlContext.esDF("spark/people","?q=Smith") <1> In some cases, especially when the index in {{es}} contains a lot of fields, it is desireable to create a `DataFrame` that contains only a *subset* of them. While one can modify the `DataFrame` (by working on its backing `RDD`) through the official Spark API or through dedicated queries, elasticsearch-hadoop allows the user to specify what fields to include and exclude from {{es}} when creating the `DataFrame`. -Through `es.read.field.include` and `es.read.field.exclude` properties, one can indicate what fields to include or exclude from the index mapping. The syntax is similar to that of {{es}} [include/exclude](elasticsearch://docs/reference/elasticsearch/rest-apis/retrieve-selected-fields.md#source-filtering). Multiple values can be specified by using a comma. By default, no value is specified meaning all properties/fields are included and no properties/fields are excluded. Note that these properties can include leading and trailing wildcards. Including part of a hierarchy of fields without a trailing wildcard does not imply that the entire hierarcy is included. However in most cases it does not make sense to include only part of a hierarchy, so a trailing wildcard should be included. +Through `es.read.field.include` and `es.read.field.exclude` properties, one can indicate what fields to include or exclude from the index mapping. The syntax is similar to that of {{es}} [include/exclude](elasticsearch://reference/elasticsearch/rest-apis/retrieve-selected-fields.md#source-filtering). Multiple values can be specified by using a comma. By default, no value is specified meaning all properties/fields are included and no properties/fields are excluded. Note that these properties can include leading and trailing wildcards. Including part of a hierarchy of fields without a trailing wildcard does not imply that the entire hierarcy is included. However in most cases it does not make sense to include only part of a hierarchy, so a trailing wildcard should be included. For example: @@ -1510,7 +1510,7 @@ When dealing with multi-value/array fields, please see [this](/reference/mapping :::: -elasticsearch-hadoop automatically converts Spark built-in types to {{es}} [types](elasticsearch://docs/reference/elasticsearch/mapping-reference/field-data-types.md) (and back) as shown in the table below: +elasticsearch-hadoop automatically converts Spark built-in types to {{es}} [types](elasticsearch://reference/elasticsearch/mapping-reference/field-data-types.md) (and back) as shown in the table below: While Spark SQL [`DataType`s](https://spark.apache.org/docs/latest/sql-programming-guide.md#data-types) have an equivalent in both Scala and Java and thus the [RDD](#spark-type-conversion) conversion can apply, there are slightly different semantics - in particular with the `java.sql` types due to the way Spark SQL handles them: @@ -1716,7 +1716,7 @@ If automatic index creation is used, please review [this](/reference/mapping-typ :::: -elasticsearch-hadoop automatically converts Spark built-in types to {{es}} [types](elasticsearch://docs/reference/elasticsearch/mapping-reference/field-data-types.md) as shown in the table below: +elasticsearch-hadoop automatically converts Spark built-in types to {{es}} [types](elasticsearch://reference/elasticsearch/mapping-reference/field-data-types.md) as shown in the table below: While Spark SQL [`DataType`s](https://spark.apache.org/docs/latest/sql-programming-guide.md#data-types) have an equivalent in both Scala and Java and thus the [RDD](#spark-type-conversion) conversion can apply, there are slightly different semantics - in particular with the `java.sql` types due to the way Spark SQL handles them: diff --git a/docs/reference/cloudrestricted-environments.md b/docs/reference/cloudrestricted-environments.md index 147d6482b..ca96c2373 100644 --- a/docs/reference/cloudrestricted-environments.md +++ b/docs/reference/cloudrestricted-environments.md @@ -25,7 +25,7 @@ As the nodes are not accessible from outside, fix the problem by assigning them ## Use a dedicated series of proxies/gateways [_use_a_dedicated_series_of_proxiesgateways] -If exposing the cluster is not an option, one can chose to use a proxy or a VPN so that the Hadoop cluster can transparently access the {{es}} cluster in a different network. By using an indirection layer, the two networks can *transparently* communicate with each other. Do note that usually this means the two networks would know how to properly route the IPs from one to the other. If the proxy/VPN solution does not handle this automatically, {{es}} might help through its [network settings](elasticsearch://docs/reference/elasticsearch/configuration-reference/networking-settings.md) in particular `network.host` and `network.publish_host` which control what IP the nodes bind and in particular *publish* or *advertise* to their clients. This allows a certain publicly-accessible IP to be broadcasted to the clients to allow access to a node, even if the node itself does not run on that IP. +If exposing the cluster is not an option, one can chose to use a proxy or a VPN so that the Hadoop cluster can transparently access the {{es}} cluster in a different network. By using an indirection layer, the two networks can *transparently* communicate with each other. Do note that usually this means the two networks would know how to properly route the IPs from one to the other. If the proxy/VPN solution does not handle this automatically, {{es}} might help through its [network settings](elasticsearch://reference/elasticsearch/configuration-reference/networking-settings.md) in particular `network.host` and `network.publish_host` which control what IP the nodes bind and in particular *publish* or *advertise* to their clients. This allows a certain publicly-accessible IP to be broadcasted to the clients to allow access to a node, even if the node itself does not run on that IP. ## Configure the connector to run in WAN mode [_configure_the_connector_to_run_in_wan_mode] diff --git a/docs/reference/configuration.md b/docs/reference/configuration.md index c1ca77792..f19cd9eb0 100644 --- a/docs/reference/configuration.md +++ b/docs/reference/configuration.md @@ -239,7 +239,7 @@ Added in 2.1. `es.mapping.include` (default none) -: Field/property to be included in the document sent to {{es}}. Useful for *extracting* the needed data from entities. The syntax is similar to that of {{es}} [include/exclude](elasticsearch://docs/reference/elasticsearch/rest-apis/retrieve-selected-fields.md#source-filtering). Multiple values can be specified by using a comma. By default, no value is specified meaning all properties/fields are included. +: Field/property to be included in the document sent to {{es}}. Useful for *extracting* the needed data from entities. The syntax is similar to that of {{es}} [include/exclude](elasticsearch://reference/elasticsearch/rest-apis/retrieve-selected-fields.md#source-filtering). Multiple values can be specified by using a comma. By default, no value is specified meaning all properties/fields are included. ::::{important} The `es.mapping.include` feature is ignored when `es.input.json` is specified. In order to prevent the connector from indexing data that is implicitly excluded, any jobs with these property conflicts will refuse to execute! @@ -252,7 +252,7 @@ Added in 2.1. `es.mapping.exclude` (default none) -: Field/property to be excluded in the document sent to {{es}}. Useful for *eliminating* unneeded data from entities. The syntax is similar to that of {{es}} [include/exclude](elasticsearch://docs/reference/elasticsearch/rest-apis/retrieve-selected-fields.md#source-filtering). Multiple values can be specified by using a comma. By default, no value is specified meaning no properties/fields are excluded. +: Field/property to be excluded in the document sent to {{es}}. Useful for *eliminating* unneeded data from entities. The syntax is similar to that of {{es}} [include/exclude](elasticsearch://reference/elasticsearch/rest-apis/retrieve-selected-fields.md#source-filtering). Multiple values can be specified by using a comma. By default, no value is specified meaning no properties/fields are excluded. ::::{important} The `es.mapping.exclude` feature is ignored when `es.input.json` is specified. In order to prevent the connector from indexing data that is explicitly excluded, any jobs with these property conflicts will refuse to execute! @@ -320,7 +320,7 @@ Added in 2.2. `es.read.field.as.array.include` (default empty) -: Fields/properties that should be considered as arrays/lists. Since {{es}} can map one or multiple values to a field, elasticsearch-hadoop cannot determine from the mapping whether to treat a field on a document as a single value or an array. When encountering multiple values, elasticsearch-hadoop automatically reads the field into the appropriate array/list type for an integration, but in strict mapping scenarios (like Spark SQL) this may lead to problems (an array is encountered when Spark’s Catalyst engine expects a single value). The syntax for this setting is similar to that of {{es}} [include/exclude](elasticsearch://docs/reference/elasticsearch/rest-apis/retrieve-selected-fields.md#source-filtering). Multiple values can be specified by using a comma. By default, no value is specified, meaning no fields/properties are treated as arrays. +: Fields/properties that should be considered as arrays/lists. Since {{es}} can map one or multiple values to a field, elasticsearch-hadoop cannot determine from the mapping whether to treat a field on a document as a single value or an array. When encountering multiple values, elasticsearch-hadoop automatically reads the field into the appropriate array/list type for an integration, but in strict mapping scenarios (like Spark SQL) this may lead to problems (an array is encountered when Spark’s Catalyst engine expects a single value). The syntax for this setting is similar to that of {{es}} [include/exclude](elasticsearch://reference/elasticsearch/rest-apis/retrieve-selected-fields.md#source-filtering). Multiple values can be specified by using a comma. By default, no value is specified, meaning no fields/properties are treated as arrays. ::::{note} Not all fields need to specify `es.read.field.as.array` to be treated as an array. Fields of type `nested` are always treated as an array of objects and should not be marked under `es.read.field.as.array.include`. @@ -520,7 +520,7 @@ Added in 5.0.0. : Whether to discover the nodes within the {{es}} cluster or only to use the ones given in `es.nodes` for metadata queries. Note that this setting only applies during start-up; afterwards when reading and writing, elasticsearch-hadoop uses the target index shards (and their hosting nodes) unless `es.nodes.client.only` is enabled. `es.nodes.client.only` (default false) -: Whether to use {{es}} [client nodes](elasticsearch://docs/reference/elasticsearch/configuration-reference/node-settings.md) (or *load-balancers*). When enabled, elasticsearch-hadoop will route *all* its requests (after nodes discovery, if enabled) through the *client* nodes within the cluster. Note this typically significantly reduces the node parallelism and thus it is disabled by default. Enabling it also disables `es.nodes.data.only` (since a client node is a non-data node). +: Whether to use {{es}} [client nodes](elasticsearch://reference/elasticsearch/configuration-reference/node-settings.md) (or *load-balancers*). When enabled, elasticsearch-hadoop will route *all* its requests (after nodes discovery, if enabled) through the *client* nodes within the cluster. Note this typically significantly reduces the node parallelism and thus it is disabled by default. Enabling it also disables `es.nodes.data.only` (since a client node is a non-data node). ::::{note} Added in 2.1.2. @@ -528,7 +528,7 @@ Added in 2.1.2. `es.nodes.data.only` (default true) -: Whether to use {{es}} [data nodes](elasticsearch://docs/reference/elasticsearch/configuration-reference/node-settings.md) only. When enabled, elasticsearch-hadoop will route *all* its requests (after nodes discovery, if enabled) through the *data* nodes within the cluster. The purpose of this configuration setting is to avoid overwhelming non-data nodes as these tend to be "smaller" nodes. This is enabled by default. +: Whether to use {{es}} [data nodes](elasticsearch://reference/elasticsearch/configuration-reference/node-settings.md) only. When enabled, elasticsearch-hadoop will route *all* its requests (after nodes discovery, if enabled) through the *data* nodes within the cluster. The purpose of this configuration setting is to avoid overwhelming non-data nodes as these tend to be "smaller" nodes. This is enabled by default. ::::{note} Added in 5.0.0. @@ -536,7 +536,7 @@ Added in 5.0.0. `es.nodes.ingest.only` (default false) -: Whether to use {{es}} [ingest nodes](elasticsearch://docs/reference/elasticsearch/configuration-reference/node-settings.md) only. When enabled, elasticsearch-hadoop will route *all* of its requests (after nodes discovery, if enabled) through the *ingest* nodes within the cluster. The purpose of this configuration setting is to avoid incurring the cost of forwarding data meant for a pipeline from non-ingest nodes; Really only useful when writing data to an Ingest Pipeline (see `es.ingest.pipeline` above). +: Whether to use {{es}} [ingest nodes](elasticsearch://reference/elasticsearch/configuration-reference/node-settings.md) only. When enabled, elasticsearch-hadoop will route *all* of its requests (after nodes discovery, if enabled) through the *ingest* nodes within the cluster. The purpose of this configuration setting is to avoid incurring the cost of forwarding data meant for a pipeline from non-ingest nodes; Really only useful when writing data to an Ingest Pipeline (see `es.ingest.pipeline` above). ::::{note} Added in 2.2. diff --git a/docs/reference/index.md b/docs/reference/index.md index 3f484760f..f2f17dbba 100644 --- a/docs/reference/index.md +++ b/docs/reference/index.md @@ -8,7 +8,7 @@ mapped_pages: # {{esh-full}} {{esh-full}} is an umbrella project consisting of two similar, yet independent sub-projects: `elasticsearch-hadoop` and `repository-hdfs`. -This documentation pertains to `elasticsearch-hadoop`. For information about `repository-hdfs` and using HDFS as a back-end repository for doing snapshot or restore from or to {{es}}, go to [Hadoop HDFS repository plugin](elasticsearch://docs/reference/elasticsearch-plugins/repository-hdfs.md). +This documentation pertains to `elasticsearch-hadoop`. For information about `repository-hdfs` and using HDFS as a back-end repository for doing snapshot or restore from or to {{es}}, go to [Hadoop HDFS repository plugin](elasticsearch://reference/elasticsearch-plugins/repository-hdfs.md). {{esh-full}} is an [open-source](./license.md), stand-alone, self-contained, small library that allows Hadoop jobs (whether using Map/Reduce or libraries built upon it such as Hive or new upcoming libraries like Apache Spark ) to *interact* with {{es}}. One can think of it as a *connector* that allows data to flow *bi-directionaly* so that applications can leverage transparently the {{es}} engine capabilities to significantly enrich their capabilities and increase the performance. diff --git a/docs/reference/kerberos.md b/docs/reference/kerberos.md index 1decbcdc6..fc30789d5 100644 --- a/docs/reference/kerberos.md +++ b/docs/reference/kerberos.md @@ -24,7 +24,7 @@ This documentation assumes that you have already provisioned a Hadoop cluster wi Before starting, you will need to ensure that principals for your users are provisioned in your Kerberos deployment, as well as service principals for each {{es}} node. To enable Kerberos authentication on {{es}}, it must be [configured with a Kerberos realm](docs-content://deploy-manage/users-roles/cluster-or-deployment-auth/kerberos.md). It is recommended that you familiarize yourself with how to configure {{es}} Kerberos realms so that you can make appropriate adjustments to fit your deployment. You can find more information on how they work in the [Elastic Stack documentation](docs-content://deploy-manage/users-roles/cluster-or-deployment-auth/kerberos.md). -Additionally, you will need to [ configure the API Key Realm](elasticsearch://docs/reference/elasticsearch/configuration-reference/security-settings.md) in {{es}}. Hadoop and other distributed data processing frameworks only authenticate with Kerberos in the process that launches a job. Once a job has been launched, the worker processes are often cut off from the original Kerberos credentials and need some other form of authentication. Hadoop services often provide mechanisms for obtaining *Delegation Tokens* during job submission. These tokens are then distributed to worker processes which use the tokens to authenticate on behalf of the user running the job. Elasticsearch for Apache Hadoop obtains API Keys in order to provide tokens for worker processes to authenticate with. +Additionally, you will need to [ configure the API Key Realm](elasticsearch://reference/elasticsearch/configuration-reference/security-settings.md) in {{es}}. Hadoop and other distributed data processing frameworks only authenticate with Kerberos in the process that launches a job. Once a job has been launched, the worker processes are often cut off from the original Kerberos credentials and need some other form of authentication. Hadoop services often provide mechanisms for obtaining *Delegation Tokens* during job submission. These tokens are then distributed to worker processes which use the tokens to authenticate on behalf of the user running the job. Elasticsearch for Apache Hadoop obtains API Keys in order to provide tokens for worker processes to authenticate with. ### Connector Settings [kerberos-settings-eshadoop] diff --git a/docs/reference/mapping-types.md b/docs/reference/mapping-types.md index d78e8d2c8..d147fe183 100644 --- a/docs/reference/mapping-types.md +++ b/docs/reference/mapping-types.md @@ -10,12 +10,12 @@ As explained in the previous sections, elasticsearch-hadoop integrates closely w ## Converting data to {{es}} [_converting_data_to_es] -By design, elasticsearch-hadoop provides no data transformation or mapping layer itself simply because there is no need for them: Hadoop is designed to do ETL and some libraries (like Pig and Hive) provide type information themselves. Furthermore, {{es}} has rich support for mapping out of the box including automatic detection, dynamic/schema-less mapping, templates and full manual control. Need to split strings into token, do data validation or eliminate unneeded data? There are plenty of ways to do that in Hadoop before reading/writing data from/to {{es}}. Need control over how data is stored in {{es}}? Use {{es}} APIs to define the [mapping](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-put-mapping), to update [settings](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-put-settings) or add generic [meta-data](elasticsearch://docs/reference/elasticsearch/mapping-reference/document-metadata-fields.md). +By design, elasticsearch-hadoop provides no data transformation or mapping layer itself simply because there is no need for them: Hadoop is designed to do ETL and some libraries (like Pig and Hive) provide type information themselves. Furthermore, {{es}} has rich support for mapping out of the box including automatic detection, dynamic/schema-less mapping, templates and full manual control. Need to split strings into token, do data validation or eliminate unneeded data? There are plenty of ways to do that in Hadoop before reading/writing data from/to {{es}}. Need control over how data is stored in {{es}}? Use {{es}} APIs to define the [mapping](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-put-mapping), to update [settings](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-put-settings) or add generic [meta-data](elasticsearch://reference/elasticsearch/mapping-reference/document-metadata-fields.md). ## Time/Date mapping [mapping-date] -When it comes to handling dates, {{es}} always uses the [ISO 8601](http://en.wikipedia.org/wiki/ISO_8601) format for date/time. This is the default date format of {{es}} - if a custom one is needed, please **add** it to the default option rather then just replacing it. See the [date format](elasticsearch://docs/reference/elasticsearch/mapping-reference/mapping-date-format.md) section in {{es}} reference documentation for more information. Note that when reading data, if the date is not in ISO8601 format, by default elasticsearch-hadoop will likely not understand it as it does not replicate the elaborate date parsing in {{es}}. In these cases one can simply disable the date conversion and pass the raw information as a `long` or `String`, through the `es.mapping.date.rich` [property](/reference/configuration.md#cfg-field-info). +When it comes to handling dates, {{es}} always uses the [ISO 8601](http://en.wikipedia.org/wiki/ISO_8601) format for date/time. This is the default date format of {{es}} - if a custom one is needed, please **add** it to the default option rather then just replacing it. See the [date format](elasticsearch://reference/elasticsearch/mapping-reference/mapping-date-format.md) section in {{es}} reference documentation for more information. Note that when reading data, if the date is not in ISO8601 format, by default elasticsearch-hadoop will likely not understand it as it does not replicate the elaborate date parsing in {{es}}. In these cases one can simply disable the date conversion and pass the raw information as a `long` or `String`, through the `es.mapping.date.rich` [property](/reference/configuration.md#cfg-field-info). As a side note, elasticsearch-hadoop tries to detect whether dedicated date parsing libraries (in particular Joda, used also by {{es}}) are available at runtime and if so, will use them. If not, it will default to parsing using JDK classes which are not as rich. Going forward especially with the advent of JDK 8, elasticsearch-hadoop will try to migrate to `javax.time` library to have the same behaviour regardless of the classpath available at runtime. @@ -27,7 +27,7 @@ It is important to note that JSON objects (delimited by `{}` and typically assoc ## Geo types [mapping-geo] -For geolocation {{es}} provides two dedicated types, namely [`geo_point`](elasticsearch://docs/reference/elasticsearch/mapping-reference/geo-point.md) and [`geo_shape`](elasticsearch://docs/reference/elasticsearch/mapping-reference/geo-shape.md) which do not have a direct equivalent in any of the libraries elasticsearch-hadoop integrates with. Further more, {{es}} accepts multiple formats for each type (as there are different ways to represent data), in fact there are 4 different representations for `geo_point` and 9 for `geo_shape`. To go around this, the connector breaks down the geo types into primitives depending on the actual format used for their respective types. +For geolocation {{es}} provides two dedicated types, namely [`geo_point`](elasticsearch://reference/elasticsearch/mapping-reference/geo-point.md) and [`geo_shape`](elasticsearch://reference/elasticsearch/mapping-reference/geo-shape.md) which do not have a direct equivalent in any of the libraries elasticsearch-hadoop integrates with. Further more, {{es}} accepts multiple formats for each type (as there are different ways to represent data), in fact there are 4 different representations for `geo_point` and 9 for `geo_shape`. To go around this, the connector breaks down the geo types into primitives depending on the actual format used for their respective types. For strongly-typed libraries (like SparkSQL `DataFrame`s), the format needs to be known before hand and thus, elasticsearch-hadoop will *sample* the data asking elasticsearch-hadoop for one random document that is representative of the mapping, parse it and based on the values found, identify the format used and create the necessary schema. This happens automatically at start-up without any user interference. As always, the user data must all be using the *same* format (a requirement from SparkSQL) otherwise reading a different format will trigger an exception. @@ -44,7 +44,7 @@ Note that typically handling of these types poses no issues for the user whether By default, {{es}} provides [automatic index and mapping](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-create) when data is added under an index that has not been created before. In other words, data can be added into {{es}} without the index and the mappings being defined a priori. This is quite convenient since {{es}} automatically adapts to the data being fed to it - moreover, if certain entries have extra fields, {{es}} schema-less nature allows them to be indexed without any issues. $$$auto-mapping-type-loss$$$ -It is important to remember that automatic mapping uses the payload values to identify the [field types](elasticsearch://docs/reference/elasticsearch/mapping-reference/field-data-types.md), using the **first document** that adds each field. elasticsearch-hadoop communicates with {{es}} through JSON which does not provide any type information, rather only the field names and their values. One can think of it as *type erasure* or information loss; for example JSON does not differentiate integer numeric types - `byte`, `short`, `int`, `long` are all placed in the same `long` *bucket*. this can have unexpected side-effects since the type information is *guessed* such as: +It is important to remember that automatic mapping uses the payload values to identify the [field types](elasticsearch://reference/elasticsearch/mapping-reference/field-data-types.md), using the **first document** that adds each field. elasticsearch-hadoop communicates with {{es}} through JSON which does not provide any type information, rather only the field names and their values. One can think of it as *type erasure* or information loss; for example JSON does not differentiate integer numeric types - `byte`, `short`, `int`, `long` are all placed in the same `long` *bucket*. this can have unexpected side-effects since the type information is *guessed* such as: #### numbers mapped only as `long`/`double` [_numbers_mapped_only_as_longdouble] @@ -135,5 +135,5 @@ In most cases, [templates](docs-content://manage-data/data-store/templates.md) a ## Limitations [limitations] -{{es}} allows field names to contain dots (*.*). But {{esh}} does not support them, and fails when reading or writing fields with dots. Refer to {{es}} [Dot Expander Processor](elasticsearch://docs/reference/ingestion-tools/enrich-processor/dot-expand-processor.md) for tooling to assist replacing dots in field names. +{{es}} allows field names to contain dots (*.*). But {{esh}} does not support them, and fails when reading or writing fields with dots. Refer to {{es}} [Dot Expander Processor](elasticsearch://reference/ingestion-tools/enrich-processor/dot-expand-processor.md) for tooling to assist replacing dots in field names. diff --git a/docs/reference/mapreduce-integration.md b/docs/reference/mapreduce-integration.md index 1698e37b3..62362feee 100644 --- a/docs/reference/mapreduce-integration.md +++ b/docs/reference/mapreduce-integration.md @@ -362,7 +362,7 @@ If automatic index creation is used, please review [this](/reference/mapping-typ :::: -elasticsearch-hadoop automatically converts Hadoop built-in `Writable` types to {{es}} [field types](elasticsearch://docs/reference/elasticsearch/mapping-reference/field-data-types.md) (and back) as shown in the table below: +elasticsearch-hadoop automatically converts Hadoop built-in `Writable` types to {{es}} [field types](elasticsearch://reference/elasticsearch/mapping-reference/field-data-types.md) (and back) as shown in the table below: | `Writable` | {{es}} type | | --- | --- | @@ -383,5 +383,5 @@ elasticsearch-hadoop automatically converts Hadoop built-in `Writable` types to | `AbstractMapWritable` | `map` | | `ShortWritable` | `short` | -It is worth mentioning that rich data types available only in {{es}}, such as [`GeoPoint`](elasticsearch://docs/reference/elasticsearch/mapping-reference/geo-point.md) or [`GeoShape`](elasticsearch://docs/reference/elasticsearch/mapping-reference/geo-shape.md) are supported by converting their structure into the primitives available in the table above. For example, based on its storage a `geo_point` might be returned as a `Text` (basically a `String`) or an `ArrayWritable`. +It is worth mentioning that rich data types available only in {{es}}, such as [`GeoPoint`](elasticsearch://reference/elasticsearch/mapping-reference/geo-point.md) or [`GeoShape`](elasticsearch://reference/elasticsearch/mapping-reference/geo-shape.md) are supported by converting their structure into the primitives available in the table above. For example, based on its storage a `geo_point` might be returned as a `Text` (basically a `String`) or an `ArrayWritable`.