|
| 1 | +--- |
| 2 | +mapped_pages: |
| 3 | + - https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html |
| 4 | +--- |
| 5 | + |
| 6 | +# Apache Hive integration [hive] |
| 7 | + |
| 8 | +Hive abstracts Hadoop by abstracting it through SQL-like language, called HiveQL so that users can apply data defining and manipulating operations to it, just like with SQL. In Hive data sets are [defined](https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-DDLOperations) through *tables* (that expose type information) in which data can be [loaded](https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-DMLOperations), [selected and transformed](https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-SQLOperations) through built-in operators or custom/user defined functions (or [UDF](https://cwiki.apache.org/confluence/display/Hive/OperatorsAndFunctions)s). |
| 9 | + |
| 10 | + |
| 11 | +## Installation [_installation_2] |
| 12 | + |
| 13 | +Make elasticsearch-hadoop jar available in the Hive classpath. Depending on your options, there are various [ways](https://cwiki.apache.org/confluence/display/Hive/HivePlugins#HivePlugins-DeployingjarsforUserDefinedFunctionsandUserDefinedSerDes) to achieve that. Use [ADD](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli#LanguageManualCli-HiveResources) command to add files, jars (what we want) or archives to the classpath: |
| 14 | + |
| 15 | +``` |
| 16 | +ADD JAR /path/elasticsearch-hadoop.jar; |
| 17 | +``` |
| 18 | + |
| 19 | +::::{note} |
| 20 | +the command expects a proper URI that can be found either on the local file-system or remotely. Typically it’s best to use a distributed file-system (like HDFS or Amazon S3) and use that since the script might be executed on various machines. |
| 21 | +:::: |
| 22 | + |
| 23 | + |
| 24 | +::::{important} |
| 25 | +When using JDBC/ODBC drivers, `ADD JAR` command is not available and will be ignored. Thus it is recommend to make the jar available to the Hive global classpath and indicated below. |
| 26 | +:::: |
| 27 | + |
| 28 | + |
| 29 | +As an alternative, one can use the command-line: |
| 30 | + |
| 31 | +```bash |
| 32 | +$ bin/hive --auxpath=/path/elasticsearch-hadoop.jar |
| 33 | +``` |
| 34 | + |
| 35 | +or use the `hive.aux.jars.path` property specified either through the command-line or, if available, through the `hive-site.xml` file, to register additional jars (that accepts an URI as well): |
| 36 | + |
| 37 | +```bash |
| 38 | +$ bin/hive -hiveconf hive.aux.jars.path=/path/elasticsearch-hadoop.jar |
| 39 | +``` |
| 40 | + |
| 41 | +or if the `hive-site.xml` configuration can be modified, one can register additional jars through the `hive.aux.jars.path` option (that accepts an URI as well): |
| 42 | + |
| 43 | +```xml |
| 44 | +<property> |
| 45 | + <name>hive.aux.jars.path</name> |
| 46 | + <value>/path/elasticsearch-hadoop.jar</value> |
| 47 | + <description>A comma separated list (with no spaces) of the jar files</description> |
| 48 | +</property> |
| 49 | +``` |
| 50 | + |
| 51 | + |
| 52 | +## Configuration [hive-configuration] |
| 53 | + |
| 54 | +When using Hive, one can use `TBLPROPERTIES` to specify the [configuration](/reference/configuration.md) properties (as an alternative to Hadoop `Configuration` object) when declaring the external table backed by {{es}}: |
| 55 | + |
| 56 | +```sql |
| 57 | +CREATE EXTERNAL TABLE artists (...) |
| 58 | +STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' |
| 59 | +TBLPROPERTIES('es.resource' = 'radio/artists', |
| 60 | + 'es.index.auto.create' = 'false'); <1> |
| 61 | +``` |
| 62 | + |
| 63 | +1. elasticsearch-hadoop setting |
| 64 | + |
| 65 | + |
| 66 | + |
| 67 | +## Mapping [hive-alias] |
| 68 | + |
| 69 | +By default, elasticsearch-hadoop uses the Hive table schema to map the data in {{es}}, using both the field names and types in the process. There are cases however when the names in Hive cannot be used with {{es}} (the field name can contain characters accepted by {{es}} but not by Hive). For such cases, one can use the `es.mapping.names` setting which accepts a comma-separated list of mapped names in the following format: `Hive field name`:`Elasticsearch field name` |
| 70 | + |
| 71 | +To wit: |
| 72 | + |
| 73 | +```sql |
| 74 | +CREATE EXTERNAL TABLE artists (...) |
| 75 | +STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' |
| 76 | +TBLPROPERTIES('es.resource' = 'radio/artists', |
| 77 | + 'es.mapping.names' = 'date:@timestamp, url:url_123'); <1> |
| 78 | +``` |
| 79 | + |
| 80 | +1. Hive column `date` mapped in {{es}} to `@timestamp`; Hive column `url` mapped in {{es}} to `url_123` |
| 81 | + |
| 82 | + |
| 83 | +::::{tip} |
| 84 | +Hive is case **insensitive** while {{es}} is not. The loss of information can create invalid queries (as the column in Hive might not match the one in {{es}}). To avoid this, elasticsearch-hadoop will always convert Hive column names to lower-case. This being said, it is recommended to use the default Hive style and use upper-case names only for Hive commands and avoid mixed-case names. |
| 85 | +:::: |
| 86 | + |
| 87 | + |
| 88 | +::::{tip} |
| 89 | +Hive treats missing values through a special value `NULL` as indicated [here](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-HandlingofNULLValues). This means that when running an incorrect query (with incorrect or non-existing field names) the Hive tables will be populated with `NULL` instead of throwing an exception. Make sure to validate your data and keep a close eye on your schema since updates will otherwise go unnotice due to this lenient behavior. |
| 90 | +:::: |
| 91 | + |
| 92 | + |
| 93 | + |
| 94 | +## Writing data to {{es}} [_writing_data_to_es_2] |
| 95 | + |
| 96 | +With elasticsearch-hadoop, {{es}} becomes just an external [table](https://cwiki.apache.org/confluence/display/Hive/LanguageManual`DDL#LanguageManualDDL-CreateTable) in which data can be loaded or read from: |
| 97 | + |
| 98 | +```sql |
| 99 | +CREATE EXTERNAL TABLE artists ( |
| 100 | + id BIGINT, |
| 101 | + name STRING, |
| 102 | + links STRUCT<url:STRING, picture:STRING>) |
| 103 | +STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'<1> |
| 104 | +TBLPROPERTIES('es.resource' = 'radio/artists'); <2> |
| 105 | + |
| 106 | +-- insert data to Elasticsearch from another table called 'source' |
| 107 | +INSERT OVERWRITE TABLE artists |
| 108 | + SELECT NULL, s.name, named_struct('url', s.url, 'picture', s.picture) |
| 109 | + FROM source s; |
| 110 | +``` |
| 111 | + |
| 112 | +1. {{es}} Hive `StorageHandler` |
| 113 | +2. {{es}} resource (index and type) associated with the given storage |
| 114 | + |
| 115 | + |
| 116 | +For cases where the id (or other metadata fields like `ttl` or `timestamp`) of the document needs to be specified, one can do so by setting the appropriate [mapping](/reference/configuration.md#cfg-mapping), namely `es.mapping.id`. Following the previous example, to indicate to {{es}} to use the field `id` as the document id, update the `table` properties: |
| 117 | + |
| 118 | +```sql |
| 119 | +CREATE EXTERNAL TABLE artists ( |
| 120 | + id BIGINT, |
| 121 | + ...) |
| 122 | +STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' |
| 123 | +TBLPROPERTIES('es.mapping.id' = 'id'...); |
| 124 | +``` |
| 125 | + |
| 126 | + |
| 127 | +### Writing existing JSON to {{es}} [writing-json-hive] |
| 128 | + |
| 129 | +For cases where the job input data is already in JSON, elasticsearch-hadoop allows direct indexing *without* applying any transformation; the data is taken as is and sent directly to {{es}}. In such cases, one needs to indicate the json input by setting the `es.input.json` parameter. As such, in this case elasticsearch-hadoop expects the output table to contain only one field, who’s content is used as the JSON document. That is, the library will recognize specific *textual* types (such as `string` or `binary`) or simply call (`toString`). |
| 130 | + |
| 131 | +| `Hive type` | Comment | |
| 132 | +| --- | --- | |
| 133 | +| `binary` | use this when the JSON data is represented as a `byte[]` or similar | |
| 134 | +| `string` | use this if the JSON data is represented as a `String` | |
| 135 | +| *anything else* | make sure the `toString()` returns the desired JSON document | |
| 136 | +| `varchar` | use this as an alternative to Hive `string` | |
| 137 | + |
| 138 | +::::{important} |
| 139 | +Make sure the data is properly encoded, in `UTF-8`. The field content is considered the final form of the document sent to {{es}}. |
| 140 | +:::: |
| 141 | + |
| 142 | + |
| 143 | +```java |
| 144 | +CREATE EXTERNAL TABLE json (data STRING) <1> |
| 145 | +STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' |
| 146 | +TBLPROPERTIES('es.resource' = '...', |
| 147 | + 'es.input.json` = 'yes'); <2> |
| 148 | +... |
| 149 | +``` |
| 150 | +
|
| 151 | +1. The table declaration only one field of type `STRING` |
| 152 | +2. Indicate elasticsearch-hadoop the table content is in JSON format |
| 153 | +
|
| 154 | +
|
| 155 | +
|
| 156 | +### Writing to dynamic/multi-resources [_writing_to_dynamicmulti_resources] |
| 157 | +
|
| 158 | +One can index the data to a different resource, depending on the *row* being read, by using patterns. Coming back to the aforementioned [media example](/reference/configuration.md#cfg-multi-writes), one could configure it as follows: |
| 159 | +
|
| 160 | +```sql |
| 161 | +CREATE EXTERNAL TABLE media ( |
| 162 | + name STRING, |
| 163 | + type STRING,<1> |
| 164 | + year STRING, |
| 165 | +STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' |
| 166 | +TBLPROPERTIES('es.resource' = 'my-collection-{type}/doc'); <2> |
| 167 | +``` |
| 168 | +
|
| 169 | +1. Table field used by the resource pattern. Any of the declared fields can be used. |
| 170 | +2. Resource pattern using field `type` |
| 171 | +
|
| 172 | +
|
| 173 | +For each *row* about to be written, elasticsearch-hadoop will extract the `type` field and use its value to determine the target resource. |
| 174 | +
|
| 175 | +The functionality is also available when dealing with raw JSON - in this case, the value will be extracted from the JSON document itself. Assuming the JSON source contains documents with the following structure: |
| 176 | +
|
| 177 | +```js |
| 178 | +{ |
| 179 | + "media_type":"music",<1> |
| 180 | + "title":"Surfing With The Alien", |
| 181 | + "year":"1987" |
| 182 | +} |
| 183 | +``` |
| 184 | +
|
| 185 | +1. field within the JSON document that will be used by the pattern |
| 186 | +
|
| 187 | +
|
| 188 | +the table declaration can be as follows: |
| 189 | +
|
| 190 | +```sql |
| 191 | +CREATE EXTERNAL TABLE json (data STRING) <1> |
| 192 | +STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' |
| 193 | +TBLPROPERTIES('es.resource' = 'my-collection-{media_type}/doc', <2> |
| 194 | + 'es.input.json` = 'yes'); |
| 195 | +``` |
| 196 | + |
| 197 | +1. Schema declaration for the table. Since JSON input is used, the schema is simply a holder to the raw data |
| 198 | +2. Resource pattern relying on fields *within* the JSON document and *not* on the table schema |
| 199 | + |
| 200 | + |
| 201 | + |
| 202 | +## Reading data from {{es}} [_reading_data_from_es] |
| 203 | + |
| 204 | +Reading from {{es}} is strikingly similar: |
| 205 | + |
| 206 | +```sql |
| 207 | +CREATE EXTERNAL TABLE artists ( |
| 208 | + id BIGINT, |
| 209 | + name STRING, |
| 210 | + links STRUCT<url:STRING, picture:STRING>) |
| 211 | +STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'<1> |
| 212 | +TBLPROPERTIES('es.resource' = 'radio/artists', <2> |
| 213 | + 'es.query' = '?q=me*'); <3> |
| 214 | + |
| 215 | +-- stream data from Elasticsearch |
| 216 | +SELECT * FROM artists; |
| 217 | +``` |
| 218 | + |
| 219 | +1. same {{es}} Hive `StorageHandler` |
| 220 | +2. {{es}} resource |
| 221 | +3. {{es}} query |
| 222 | + |
| 223 | + |
| 224 | + |
| 225 | +## Type conversion [hive-type-conversion] |
| 226 | + |
| 227 | +::::{important} |
| 228 | +If automatic index creation is used, please review [this](/reference/mapping-types.md#auto-mapping-type-loss) section for more information. |
| 229 | +:::: |
| 230 | + |
| 231 | + |
| 232 | +Hive provides various [types](https://cwiki.apache.org/confluence/display/Hive/LanguageManual`Types) for defining data and internally uses different implementations depending on the target environment (from JDK native types to binary-optimized ones). {{es}} integrates with all of them, including and Serde2 [lazy](http://hive.apache.org/javadocs/r1.0.1/api/index.md?org/apache/hadoop/hive/serde2/lazy/package-summary.md) and [lazy binary](http://hive.apache.org/javadocs/r1.0.1/api/index.md?org/apache/hadoop/hive/serde2/lazybinary/package-summary.md): |
| 233 | + |
| 234 | +| Hive type | {{es}} type | |
| 235 | +| --- | --- | |
| 236 | +| `void` | `null` | |
| 237 | +| `boolean` | `boolean` | |
| 238 | +| `tinyint` | `byte` | |
| 239 | +| `smallint` | `short` | |
| 240 | +| `int` | `int` | |
| 241 | +| `bigint` | `long` | |
| 242 | +| `double` | `double` | |
| 243 | +| `float` | `float` | |
| 244 | +| `string` | `string` | |
| 245 | +| `binary` | `binary` | |
| 246 | +| `timestamp` | `date` | |
| 247 | +| `struct` | `map` | |
| 248 | +| `map` | `map` | |
| 249 | +| `array` | `array` | |
| 250 | +| `union` | not supported (yet) | |
| 251 | +| `decimal` | `string` | |
| 252 | +| `date` | `date` | |
| 253 | +| `varchar` | `string` | |
| 254 | +| `char` | `string` | |
| 255 | + |
| 256 | +::::{note} |
| 257 | +While {{es}} understands Hive types up to version 2.0, it is backwards compatible with Hive 1.0 |
| 258 | +:::: |
| 259 | + |
| 260 | + |
| 261 | +It is worth mentioning that rich data types available only in {{es}}, such as [`GeoPoint`](elasticsearch://docs/reference/elasticsearch/mapping-reference/geo-point.md) or [`GeoShape`](elasticsearch://docs/reference/elasticsearch/mapping-reference/geo-shape.md) are supported by converting their structure into the primitives available in the table above. For example, based on its storage a `geo_point` might be returned as a `string` or an `array`. |
0 commit comments