For the past four years or so, the term “Big Data” has been loosely thrown around marketing and tech conferences, publications, blog articles, and everywhere in between. The buzzword has since been defined and classified, but one particular distributed storage and processing ecosystem might as well be synonymous with it as well: Apache Hadoop.

Hadoop is composed of a very wide array of packages and tools that can bulk ingest and process data with the power of distributed clusters of commodity hardware and/or container technologies. So it comes to no surprise that organizations have been combining the power of Hadoop to perform deeper analytics and produce “actionable insights” with Elasticsearch for robust log and performance metric analysis.

In this tutorial we shall utilize the Elastic Hadoop connector to integrate Elasticsearch with a Hadoop cluster and introduce how external tables in Hive work with Elasticsearch mappings and bulk-loaded docs.

Example Elasticsearch + Hadoop Flow

hadoop-es-map.png#asset:1548

Spin Up a Qbox Elasticsearch Cluster

Setting up and configuring Elasticsearch can be quite the undertaking. Hadoop is even a greater one. For this reason, we prefer to spin up a Qbox cluster to handle the Elasticsearch heavy lifting. You can get started with a Qbox Cluster here. For this tutorial we will be using Elasticsearch major version 5.

Check out our Hosted Kubernetes as a Service

Download and Install a Hadoop Distribution

There are plenty of Hadoop distributions to choose from, and a plethora of ways to deploy them (Ansible, Whirr, etc.), however for the sake of simplicity we prefer to use the Cloudera distribution and its Cloudera Manager (Version 5) and are running Ubuntu 16.04.

Add the CDH5 Repo (Ubuntu 16.04)

$ sudo wget 'https://archive.cloudera.com/cdh5/ubuntu/xenial/amd64/cdh/cloudera.list' \ -O /etc/apt/sources.list.d/cloudera.list
$ sudo apt-get update
$ wget <a href="https://archive.cloudera.com/cdh5/ubuntu/xenial/amd64/cdh/archive.key"> -O archive.key $ sudo apt-key add archive.key

Install Cloudera Manager

$ sudo apt-get -y install cloudera-manager-daemons cloudera-manager-server cloudera-manager-agent

Start the Agents on All Hosts in the Cluster

$ sudo service cloudera-scm-agent start

Start the Cloudera Manager Server Host

$ sudo service cloudera-scm-server start

Log into Cloudera Manager Admin Console

In a web browser, go to “http://<serverhost>:7180” and use admin as both the username and password. Once logged in accept the Terms & Conditions, select the Basic Edition license type.

When you get to the Cluster Setup, be sure to select the Core Hadoop services:

cloudera-manager.jpg#asset:1549

Follow the Cloudera Wizard on screen instructions. For more granular configuration and tuning settings, please refer to the official Cloudera documentation.

Download the ES Hadoop Connector

Go ahead and download the ES Hadoop Connector so that the Qbox ES cluster can communicate with your Hadoop cluster.

$ wget -P /tmp http://download.elastic.co/hadoop/elasticsearch-hadoop-5.5.1.zip && unzip /tmp/elasticsearch-hadoop-5.5.1.zip -d /tmp
$ cp /tmp/elasticsearch-hadoop-5.5.1/dist/elasticsearch-hadoop-5.5.1.jar /tmp/elasticsearch-hadoop-5.5.1.jar
$ hdfs dfs -copyFromLocal /tmp/elasticsearch-hadoop-5.5.1/dist/elasticsearch-hadoop-5.5.1.jar /tmp

Register the JAR

<property>
  <name>hive.aux.jars.path</name>
  <value>/path/to/elasticsearch-hadoop.jar</value>
</property>
  1. Register the JAR through hive.aux.jars.path parameter of hive-site.xml in  /usr/lib/hive/lib

  2. Restart the hiveserver2 service from the Cloudera Manager.

Create an Elasticsearch Index

curl -v -XPUT http://<hostname>:9200/hive_test

Create the Index Mappings

url -XPUT "http://<hostname>:9200/hive_test/_mapping" -d
{
   "dog": {
      "properties": {
         "breed": {
            "type": "string"
         },
         "sex": {
            "type": "string"
         }
      }
   }
}'

Add Docs to the Index

curl -v -XPOST http://<hostname>:9200/_bulk -d '
{ "create": { "_index": "hive_test", "_type": "dog" } }
{ "breed": "shibainu", "sex": "male" }
{ "create": { "_index": "hive_test", "_type": "dog" } }
{ "breed": "frenchie", "sex": "female" }
{ "create": { "_index": "hive_test", "_type": "dog" } }
{ "breed": "corgi", "sex": "male" }'

From the Hive> prompt:

1. Create a Hive Database

CREATE DATABASE es_testdb
USE es_testdb

2. Create an external Hive table so data can be pulled in from Elasticsearch.

CREATE EXTERNAL TABLE es_testtbl(
breed STRING,
sex STRING)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource'=hive_test/dog,
'es.nodes'=<hostname>:9200',
'es.index.auto.create'='TRUE',
'es.mapping.names' = breed:breed, sex:sex');

Note: Replace ‘<hostname>’ with hostname or IP of the Elasticsearch cluster’s master-eligible node.

3. Query ES From Hive

SELECT * FROM es_testtbl;
shibainu    male
frenchie    female
corgi    male

Summary

In this tutorial we utilized the Elastic Hadoop connector to integrate Elasticsearch with a Hadoop cluster and introduced how external tables in Hive work with Elasticsearch mappings and bulk-loaded docs. In addition to these experimentations, having your ES indices offloaded into Hive can serve as a first step to an additional disaster-recovery strategy, due to the highly distributed and replicated nature of HDFS and Hadoop. 

In the next tutorial, we will go a little deeper and learn how to import data sets from existing delimiter-separated values files into Elasticsearch and Hive for deeper analysis.

Other Articles

Give It a Whirl!

It's easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, Amazon or Microsoft Azure data centers. And you can now provision a replicated cluster.

Questions? Drop us a note, and we'll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.

comments powered by Disqus