@@ -15,10 +15,7 @@ Install this demo on an existing Kubernetes cluster:
15
15
$ stackablectl demo install hbase-hdfs-load-cycling-data
16
16
----
17
17
18
- [WARNING]
19
- ====
20
- This demo should not be run alongside other demos.
21
- ====
18
+ WARNING: This demo should not be run alongside other demos.
22
19
23
20
[#system-requirements]
24
21
== System requirements
@@ -35,11 +32,11 @@ This demo will
35
32
36
33
* Install the required Stackable operators.
37
34
* Spin up the following data products:
38
- ** *Hbase :* An open source distributed, scalable, big data store. This demo uses it to store the
35
+ ** *HBase :* An open source distributed, scalable, big data store. This demo uses it to store the
39
36
{kaggle}[cyclist dataset] and enable access.
40
- ** *HDFS:* A distributed file system used to intermediately store the dataset before importing it into Hbase
37
+ ** *HDFS:* A distributed file system used to intermediately store the dataset before importing it into HBase
41
38
* Use {distcp}[distcp] to copy a {kaggle}[cyclist dataset] from an S3 bucket into HDFS.
42
- * Create HFiles, a File format for hbase consisting of sorted key/value pairs. Both keys and values are byte arrays.
39
+ * Create HFiles, a File format for hBase consisting of sorted key/value pairs. Both keys and values are byte arrays.
43
40
* Load Hfiles into an existing table via the `Importtsv` utility, which will load data in `TSV` or `CSV` format into
44
41
HBase.
45
42
* Query data via the `hbase` shell, which is an interactive shell to execute commands on the created table
@@ -87,10 +84,11 @@ This demo will run two jobs to automatically load data.
87
84
88
85
=== distcp-cycling-data
89
86
90
- {distcp}[DistCp] (distributed copy) is used for large inter/intra-cluster copying. It uses MapReduce to effect its
91
- distribution, error handling, recovery, and reporting. It expands a list of files and directories into input to map
92
- tasks, each of which will copy a partition of the files specified in the source list. Therefore, the first Job uses
93
- DistCp to copy data from a S3 bucket into HDFS. Below, you'll see parts from the logs.
87
+ {distcp}[DistCp] (distributed copy) is used for large inter/intra-cluster copying.
88
+ It uses MapReduce to effect its distribution, error handling, recovery, and reporting.
89
+ It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.
90
+ Therefore, the first Job uses DistCp to copy data from a S3 bucket into HDFS.
91
+ Below, you'll see parts from the logs.
94
92
95
93
[source]
96
94
----
@@ -111,11 +109,12 @@ Copying s3a://public-backup-nyc-tlc/cycling-tripdata/demo-cycling-tripdata.csv.g
111
109
112
110
The second Job consists of 2 steps.
113
111
114
- First, we use `org.apache.hadoop.hbase.mapreduce.ImportTsv` (see {importtsv}[ImportTsv Docs]) to create a table and
115
- Hfiles. Hfile is an Hbase dedicated file format which is performance optimized for hbase. It stores meta-information
116
- about the data and thus increases the performance of hbase. When connecting to the hbase master, opening a hbase shell
117
- and executing `list`, you will see the created table. However, it'll contain 0 rows at this point. You can connect to
118
- the shell via:
112
+ First, we use `org.apache.hadoop.hbase.mapreduce.ImportTsv` (see {importtsv}[ImportTsv Docs]) to create a table and Hfiles.
113
+ Hfile is an HBase dedicated file format which is performance optimized for HBase.
114
+ It stores meta-information about the data and thus increases the performance of HBase.
115
+ When connecting to the HBase master, opening a HBase shell and executing `list`, you will see the created table.
116
+ However, it'll contain 0 rows at this point.
117
+ You can connect to the shell via:
119
118
120
119
[source,console]
121
120
----
@@ -163,7 +162,7 @@ Took 13.4666 seconds
163
162
164
163
== Inspecting the Table
165
164
166
- You can now use the table and the data. You can use all available hbase shell commands.
165
+ You can now use the table and the data. You can use all available HBase shell commands.
167
166
168
167
[source,sql]
169
168
----
@@ -191,15 +190,15 @@ COLUMN FAMILIES DESCRIPTION
191
190
{NAME => 'started_at', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
192
191
----
193
192
194
- == Accessing the Hbase web interface
193
+ == Accessing the HBase web interface
195
194
196
195
[TIP]
197
196
====
198
197
Run `stackablectl stacklet list` to get the address of the _ui-http_ endpoint.
199
198
If the UI is unavailable, do a port-forward `kubectl port-forward hbase-master-default-0 16010`.
200
199
====
201
200
202
- The Hbase web UI will give you information on the status and metrics of your Hbase cluster. See below for the start page.
201
+ The HBase web UI will give you information on the status and metrics of your HBase cluster. See below for the start page.
203
202
204
203
image::hbase-hdfs-load-cycling-data/hbase-ui-start-page.png[]
205
204
@@ -209,7 +208,7 @@ image::hbase-hdfs-load-cycling-data/hbase-table-ui.png[]
209
208
210
209
== Accessing the HDFS web interface
211
210
212
- You can also see HDFS details via a UI by running `stackablectl stacklet list` and following the link next to one of the namenodes.
211
+ You can also see HDFS details via a UI by running `stackablectl stacklet list` and following the link next to one of the namenodes.
213
212
214
213
Below you will see the overview of your HDFS cluster.
215
214
@@ -223,7 +222,8 @@ You can also browse the file system by clicking on the `Utilities` tab and selec
223
222
224
223
image::hbase-hdfs-load-cycling-data/hdfs-data.png[]
225
224
226
- Navigate in the file system to the folder `data` and then the `raw` folder. Here you can find the raw data from the distcp job.
225
+ Navigate in the file system to the folder `data` and then the `raw` folder.
226
+ Here you can find the raw data from the distcp job.
227
227
228
228
image::hbase-hdfs-load-cycling-data/hdfs-data-raw.png[]
229
229
0 commit comments