
How to Offload Elasticsearch Indices to Hive (Hadoop)
Posted by Chuong Nguyen December 13, 2017For the past four years or so, the term “Big Data” has been loosely thrown around marketing and tech conferences, publications, blog articles, and everywhere in between. The buzzword has since been defined and classified, but one particular distributed storage and processing ecosystem might as well be synonymous with it as well: Apache Hadoop.
Hadoop is composed of a very wide array of packages and tools that can bulk ingest and process data with the power of distributed clusters of commodity hardware and/or container technologies. So it comes as no surprise that organizations have been combining the power of Hadoop to perform deeper analytics and produce “actionable insights” with Elasticsearch for robust log and performance metric analysis.
In this tutorial, we shall utilize the Elastic Hadoop connector to integrate Elasticsearch with a Hadoop cluster and introduce readers to how external tables in Hive work with Elasticsearch mappings and bulk-loaded docs.
Example Elasticsearch + Hadoop Flow

Spin Up a Qbox Elasticsearch Cluster
Setting up and configuring Elasticsearch can be quite the undertaking. Hadoop is even a greater one. For this reason, we prefer to spin up a Qbox cluster to handle the Elasticsearch heavy lifting. You can get started with a Qbox Cluster here. For this tutorial we will be using Elasticsearch major version 5.
Check out our Hosted Kubernetes as a Service
Download and Install a Hadoop Distribution
There are plenty of Hadoop distributions to choose from, and a plethora of ways to deploy them (Ansible, Whirr, etc.), however for the sake of simplicity we prefer to use the Cloudera distribution and its Cloudera Manager (Version 5) and are running Ubuntu 16.04.
Add the CDH5 Repo (Ubuntu 16.04)
$ sudo wget 'https://archive.cloudera.com/cdh5/ubuntu/xenial/amd64/cdh/cloudera.list' \ -O /etc/apt/sources.list.d/cloudera.list $ sudo apt-get update $ wget https://archive.cloudera.com/cdh5/ubuntu/xenial/amd64/cdh/archive.key -O archive.key $ sudo apt-key add archive.key
Install Cloudera Manager
$ sudo apt-get -y install cloudera-manager-daemons cloudera-manager-server cloudera-manager-agent
Start the Agents on All Hosts in the Cluster
$ sudo service cloudera-scm-agent start
Start the Cloudera Manager Server Host
$ sudo service cloudera-scm-server start
Log into Cloudera Manager Admin Console
In a web browser, go to “http://<serverhost>:7180”
and use admin as both the username and password. Once logged in accept the Terms & Conditions, select the Basic Edition license type.
When you get to the Cluster Setup, be sure to select the Core Hadoop services:

Follow the Cloudera Wizard on screen instructions. For more granular configuration and tuning settings, please refer to the official Cloudera documentation.
Download the ES Hadoop Connector
Go ahead and download the ES Hadoop Connector so that the Qbox ES cluster can communicate with your Hadoop cluster.
$ wget -P /tmp http://download.elastic.co/hadoop/elasticsearch-hadoop-5.5.1.zip && unzip /tmp/elasticsearch-hadoop-5.5.1.zip -d /tmp
$ cp /tmp/elasticsearch-hadoop-5.5.1/dist/elasticsearch-hadoop-5.5.1.jar /tmp/elasticsearch-hadoop-5.5.1.jar
$ hdfs dfs -copyFromLocal /tmp/elasticsearch-hadoop-5.5.1/dist/elasticsearch-hadoop-5.5.1.jar /tmp
Register the JAR
<property> <name>hive.aux.jars.path</name> <value>/path/to/elasticsearch-hadoop.jar</value> </property>
- Register the JAR through
hive.aux.jars.path
parameter ofhive-site.xml
in/usr/lib/hive/lib
- Restart the hiveserver2 service from the Cloudera Manager.
Create an Elasticsearch Index
Now, let’s create some ES index that will be offloaded to Hive (Hadoop):
curl -v -XPUT http://<hostname>:9200/hive_test
Create the Index Mappings
We also need to create an index mapping for our test “dog” type with several properties:
curl -XPUT "http://<hostname>:9200/hive_test/_mapping" -d { "dog": { "properties": { "breed": { "type": "string" }, "sex": { "type": "string" } } } }'
Add Docs to the Index
Next, let’s add some docs to our index using bulk indexing operation:
curl -v -XPOST http://<hostname>:9200/_bulk -d ' { "create": { "_index": "hive_test", "_type": "dog" } } { "breed": "shibainu", "sex": "male" } { "create": { "_index": "hive_test", "_type": "dog" } } { "breed": "frenchie", "sex": "female" } { "create": { "_index": "hive_test", "_type": "dog" } } { "breed": "corgi", "sex": "male" }'
We’ll need to set up Hive to work with our Elasticsearch cluster. To d this:
From the Hive> prompt:
1. Create a Hive Database:
CREATE DATABASE es_testdb USE es_testdb
2. Create an external Hive table so data can be pulled in from Elasticsearch:
CREATE EXTERNAL TABLE es_testtbl( breed STRING, sex STRING) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource'=hive_test/dog, 'es.nodes'=<hostname>:9200', 'es.index.auto.create'='TRUE', 'es.mapping.names' = breed:breed, sex:sex');
Note: Replace ‘<hostname>’
with hostname or IP of the Elasticsearch cluster’s master-eligible node.
3. Query ES From Hive
SELECT * FROM es_testtbl;
shibainu male frenchie female corgi male
That’s it! Now you can use Hive SQL engine to access data from your Elasticsearch index and process it using powerful data analytics features of Hive and Hadoop.
Summary
In this tutorial, we utilized the Elastic Hadoop connector to integrate Elasticsearch with a Hadoop cluster and introduced how external tables in Hive work with Elasticsearch mappings and bulk-loaded docs. In addition to these experimentations, having your ES indices offloaded into Hive can serve as a first step to an additional disaster-recovery strategy, due to the highly distributed and replicated nature of HDFS and Hadoop.
In the next tutorial, we will go a little deeper and learn how to import data sets from existing delimiter-separated values files into Elasticsearch and Hive for deeper analysis.
Other Articles
- How to Lock Down Elasticsearch, Kibana, and Logstash and Maintain Security
- How to Secure Your Elasticsearch with Your Own Authentication Plugin
- How to Index NMAP Port Scan Results into Elasticsearc
- How to Import from CSV into Elasticsearch via Logstash and Sincedb
- Introduction to the Logstash Translate Filter
Give It a Whirl!
It’s easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, Amazon or Microsoft Azure data centers. And you can now provision a replicated cluster.
Questions? Drop us a note, and we’ll get you a prompt response.
Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.