Skip to content

Commit f0e8e1a

Browse files
[9.0][DOCS] Migrate docs from AsciiDoc to Markdown (elastic#2363)
* [docs] Migrate docs from AsciiDoc to Markdown (elastic#2349) Co-authored-by: lcawl <lcawley@elastic.co> * clean up cross-repo links (elastic#2350) --------- Co-authored-by: Colleen McGinnis <colleen.j.mcginnis@gmail.com> Co-authored-by: Colleen McGinnis <colleen.mcginnis@elastic.co>
1 parent b8f914a commit f0e8e1a

File tree

126 files changed

+6201
-7192
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

126 files changed

+6201
-7192
lines changed

docs/docset.yml

Lines changed: 488 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 261 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,261 @@
1+
---
2+
mapped_pages:
3+
- https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html
4+
---
5+
6+
# Apache Hive integration [hive]
7+
8+
Hive abstracts Hadoop by abstracting it through SQL-like language, called HiveQL so that users can apply data defining and manipulating operations to it, just like with SQL. In Hive data sets are [defined](https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-DDLOperations) through *tables* (that expose type information) in which data can be [loaded](https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-DMLOperations), [selected and transformed](https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-SQLOperations) through built-in operators or custom/user defined functions (or [UDF](https://cwiki.apache.org/confluence/display/Hive/OperatorsAndFunctions)s).
9+
10+
11+
## Installation [_installation_2]
12+
13+
Make elasticsearch-hadoop jar available in the Hive classpath. Depending on your options, there are various [ways](https://cwiki.apache.org/confluence/display/Hive/HivePlugins#HivePlugins-DeployingjarsforUserDefinedFunctionsandUserDefinedSerDes) to achieve that. Use [ADD](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli#LanguageManualCli-HiveResources) command to add files, jars (what we want) or archives to the classpath:
14+
15+
```
16+
ADD JAR /path/elasticsearch-hadoop.jar;
17+
```
18+
19+
::::{note}
20+
the command expects a proper URI that can be found either on the local file-system or remotely. Typically it’s best to use a distributed file-system (like HDFS or Amazon S3) and use that since the script might be executed on various machines.
21+
::::
22+
23+
24+
::::{important}
25+
When using JDBC/ODBC drivers, `ADD JAR` command is not available and will be ignored. Thus it is recommend to make the jar available to the Hive global classpath and indicated below.
26+
::::
27+
28+
29+
As an alternative, one can use the command-line:
30+
31+
```bash
32+
$ bin/hive --auxpath=/path/elasticsearch-hadoop.jar
33+
```
34+
35+
or use the `hive.aux.jars.path` property specified either through the command-line or, if available, through the `hive-site.xml` file, to register additional jars (that accepts an URI as well):
36+
37+
```bash
38+
$ bin/hive -hiveconf hive.aux.jars.path=/path/elasticsearch-hadoop.jar
39+
```
40+
41+
or if the `hive-site.xml` configuration can be modified, one can register additional jars through the `hive.aux.jars.path` option (that accepts an URI as well):
42+
43+
```xml
44+
<property>
45+
<name>hive.aux.jars.path</name>
46+
<value>/path/elasticsearch-hadoop.jar</value>
47+
<description>A comma separated list (with no spaces) of the jar files</description>
48+
</property>
49+
```
50+
51+
52+
## Configuration [hive-configuration]
53+
54+
When using Hive, one can use `TBLPROPERTIES` to specify the [configuration](/reference/configuration.md) properties (as an alternative to Hadoop `Configuration` object) when declaring the external table backed by {{es}}:
55+
56+
```sql
57+
CREATE EXTERNAL TABLE artists (...)
58+
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
59+
TBLPROPERTIES('es.resource' = 'radio/artists',
60+
'es.index.auto.create' = 'false'); <1>
61+
```
62+
63+
1. elasticsearch-hadoop setting
64+
65+
66+
67+
## Mapping [hive-alias]
68+
69+
By default, elasticsearch-hadoop uses the Hive table schema to map the data in {{es}}, using both the field names and types in the process. There are cases however when the names in Hive cannot be used with {{es}} (the field name can contain characters accepted by {{es}} but not by Hive). For such cases, one can use the `es.mapping.names` setting which accepts a comma-separated list of mapped names in the following format: `Hive field name`:`Elasticsearch field name`
70+
71+
To wit:
72+
73+
```sql
74+
CREATE EXTERNAL TABLE artists (...)
75+
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
76+
TBLPROPERTIES('es.resource' = 'radio/artists',
77+
'es.mapping.names' = 'date:@timestamp, url:url_123'); <1>
78+
```
79+
80+
1. Hive column `date` mapped in {{es}} to `@timestamp`; Hive column `url` mapped in {{es}} to `url_123`
81+
82+
83+
::::{tip}
84+
Hive is case **insensitive** while {{es}} is not. The loss of information can create invalid queries (as the column in Hive might not match the one in {{es}}). To avoid this, elasticsearch-hadoop will always convert Hive column names to lower-case. This being said, it is recommended to use the default Hive style and use upper-case names only for Hive commands and avoid mixed-case names.
85+
::::
86+
87+
88+
::::{tip}
89+
Hive treats missing values through a special value `NULL` as indicated [here](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-HandlingofNULLValues). This means that when running an incorrect query (with incorrect or non-existing field names) the Hive tables will be populated with `NULL` instead of throwing an exception. Make sure to validate your data and keep a close eye on your schema since updates will otherwise go unnotice due to this lenient behavior.
90+
::::
91+
92+
93+
94+
## Writing data to {{es}} [_writing_data_to_es_2]
95+
96+
With elasticsearch-hadoop, {{es}} becomes just an external [table](https://cwiki.apache.org/confluence/display/Hive/LanguageManual`DDL#LanguageManualDDL-CreateTable) in which data can be loaded or read from:
97+
98+
```sql
99+
CREATE EXTERNAL TABLE artists (
100+
id BIGINT,
101+
name STRING,
102+
links STRUCT<url:STRING, picture:STRING>)
103+
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'<1>
104+
TBLPROPERTIES('es.resource' = 'radio/artists'); <2>
105+
106+
-- insert data to Elasticsearch from another table called 'source'
107+
INSERT OVERWRITE TABLE artists
108+
SELECT NULL, s.name, named_struct('url', s.url, 'picture', s.picture)
109+
FROM source s;
110+
```
111+
112+
1. {{es}} Hive `StorageHandler`
113+
2. {{es}} resource (index and type) associated with the given storage
114+
115+
116+
For cases where the id (or other metadata fields like `ttl` or `timestamp`) of the document needs to be specified, one can do so by setting the appropriate [mapping](/reference/configuration.md#cfg-mapping), namely `es.mapping.id`. Following the previous example, to indicate to {{es}} to use the field `id` as the document id, update the `table` properties:
117+
118+
```sql
119+
CREATE EXTERNAL TABLE artists (
120+
id BIGINT,
121+
...)
122+
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
123+
TBLPROPERTIES('es.mapping.id' = 'id'...);
124+
```
125+
126+
127+
### Writing existing JSON to {{es}} [writing-json-hive]
128+
129+
For cases where the job input data is already in JSON, elasticsearch-hadoop allows direct indexing *without* applying any transformation; the data is taken as is and sent directly to {{es}}. In such cases, one needs to indicate the json input by setting the `es.input.json` parameter. As such, in this case elasticsearch-hadoop expects the output table to contain only one field, who’s content is used as the JSON document. That is, the library will recognize specific *textual* types (such as `string` or `binary`) or simply call (`toString`).
130+
131+
| `Hive type` | Comment |
132+
| --- | --- |
133+
| `binary` | use this when the JSON data is represented as a `byte[]` or similar |
134+
| `string` | use this if the JSON data is represented as a `String` |
135+
| *anything else* | make sure the `toString()` returns the desired JSON document |
136+
| `varchar` | use this as an alternative to Hive `string` |
137+
138+
::::{important}
139+
Make sure the data is properly encoded, in `UTF-8`. The field content is considered the final form of the document sent to {{es}}.
140+
::::
141+
142+
143+
```java
144+
CREATE EXTERNAL TABLE json (data STRING) <1>
145+
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
146+
TBLPROPERTIES('es.resource' = '...',
147+
'es.input.json` = 'yes'); <2>
148+
...
149+
```
150+
151+
1. The table declaration only one field of type `STRING`
152+
2. Indicate elasticsearch-hadoop the table content is in JSON format
153+
154+
155+
156+
### Writing to dynamic/multi-resources [_writing_to_dynamicmulti_resources]
157+
158+
One can index the data to a different resource, depending on the *row* being read, by using patterns. Coming back to the aforementioned [media example](/reference/configuration.md#cfg-multi-writes), one could configure it as follows:
159+
160+
```sql
161+
CREATE EXTERNAL TABLE media (
162+
name STRING,
163+
type STRING,<1>
164+
year STRING,
165+
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
166+
TBLPROPERTIES('es.resource' = 'my-collection-{type}/doc'); <2>
167+
```
168+
169+
1. Table field used by the resource pattern. Any of the declared fields can be used.
170+
2. Resource pattern using field `type`
171+
172+
173+
For each *row* about to be written, elasticsearch-hadoop will extract the `type` field and use its value to determine the target resource.
174+
175+
The functionality is also available when dealing with raw JSON - in this case, the value will be extracted from the JSON document itself. Assuming the JSON source contains documents with the following structure:
176+
177+
```js
178+
{
179+
"media_type":"music",<1>
180+
"title":"Surfing With The Alien",
181+
"year":"1987"
182+
}
183+
```
184+
185+
1. field within the JSON document that will be used by the pattern
186+
187+
188+
the table declaration can be as follows:
189+
190+
```sql
191+
CREATE EXTERNAL TABLE json (data STRING) <1>
192+
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
193+
TBLPROPERTIES('es.resource' = 'my-collection-{media_type}/doc', <2>
194+
'es.input.json` = 'yes');
195+
```
196+
197+
1. Schema declaration for the table. Since JSON input is used, the schema is simply a holder to the raw data
198+
2. Resource pattern relying on fields *within* the JSON document and *not* on the table schema
199+
200+
201+
202+
## Reading data from {{es}} [_reading_data_from_es]
203+
204+
Reading from {{es}} is strikingly similar:
205+
206+
```sql
207+
CREATE EXTERNAL TABLE artists (
208+
id BIGINT,
209+
name STRING,
210+
links STRUCT<url:STRING, picture:STRING>)
211+
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'<1>
212+
TBLPROPERTIES('es.resource' = 'radio/artists', <2>
213+
'es.query' = '?q=me*'); <3>
214+
215+
-- stream data from Elasticsearch
216+
SELECT * FROM artists;
217+
```
218+
219+
1. same {{es}} Hive `StorageHandler`
220+
2. {{es}} resource
221+
3. {{es}} query
222+
223+
224+
225+
## Type conversion [hive-type-conversion]
226+
227+
::::{important}
228+
If automatic index creation is used, please review [this](/reference/mapping-types.md#auto-mapping-type-loss) section for more information.
229+
::::
230+
231+
232+
Hive provides various [types](https://cwiki.apache.org/confluence/display/Hive/LanguageManual`Types) for defining data and internally uses different implementations depending on the target environment (from JDK native types to binary-optimized ones). {{es}} integrates with all of them, including and Serde2 [lazy](http://hive.apache.org/javadocs/r1.0.1/api/index.md?org/apache/hadoop/hive/serde2/lazy/package-summary.md) and [lazy binary](http://hive.apache.org/javadocs/r1.0.1/api/index.md?org/apache/hadoop/hive/serde2/lazybinary/package-summary.md):
233+
234+
| Hive type | {{es}} type |
235+
| --- | --- |
236+
| `void` | `null` |
237+
| `boolean` | `boolean` |
238+
| `tinyint` | `byte` |
239+
| `smallint` | `short` |
240+
| `int` | `int` |
241+
| `bigint` | `long` |
242+
| `double` | `double` |
243+
| `float` | `float` |
244+
| `string` | `string` |
245+
| `binary` | `binary` |
246+
| `timestamp` | `date` |
247+
| `struct` | `map` |
248+
| `map` | `map` |
249+
| `array` | `array` |
250+
| `union` | not supported (yet) |
251+
| `decimal` | `string` |
252+
| `date` | `date` |
253+
| `varchar` | `string` |
254+
| `char` | `string` |
255+
256+
::::{note}
257+
While {{es}} understands Hive types up to version 2.0, it is backwards compatible with Hive 1.0
258+
::::
259+
260+
261+
It is worth mentioning that rich data types available only in {{es}}, such as [`GeoPoint`](elasticsearch://reference/elasticsearch/mapping-reference/geo-point.md) or [`GeoShape`](elasticsearch://reference/elasticsearch/mapping-reference/geo-shape.md) are supported by converting their structure into the primitives available in the table above. For example, based on its storage a `geo_point` might be returned as a `string` or an `array`.

0 commit comments

Comments
 (0)