In the previous tutorial, we learned how to set up a QBox Cluster with the ES-Hadoop connector to interface with Hadoop’s data warehouse component, Hive, to perform SQL queries on top of Elasticsearch. The benefits of offloading and manipulating ES indices with Hive enable a multitude of possibilities for high-performing, deeper analysis across large data sets.  

In this tutorial we will take it a step further, by using Logstash to import an existing data set in the form of a CSV file into Elasticsearch in order to perform later batch-analytics in Hadoop’s powerful ecosystem.

Spin Up a Qbox Elasticsearch Cluster

Setting up and configuring Elasticsearch can be quite the undertaking. Hadoop is even a greater one. For this reason, we prefer to spin up a Qbox cluster to handle the Elasticsearch heavy lifting. You can get started with a Qbox Cluster here. For this tutorial we will be using Elasticsearch major version 5. We’re assuming you already have ELK setup as well.

Download and Install a Hadoop Distribution

There are plenty of Hadoop distributions to choose from, and a plethora of ways to deploy them (Ansible, Whirr, etc.), however for the sake of simplicity we prefer to use the Cloudera distribution and its Cloudera Manager (Version 5) and are running Ubuntu 16.04.

Check out the Open Beta for our Hosted Kubernetes as a Service

Add the CDH5 Repo (Ubuntu 16.04)

$ sudo wget 'https://archive.cloudera.com/cdh5/ubuntu/xenial/amd64/cdh/cloudera.list'\ -O /etc/apt/sources.list.d/cloudera.list
$ sudo apt-get update
$ wget https://archive.cloudera.com/cdh5/ubuntu/xenial/amd64/cdh/archive.key -O archive.key $ sudo apt-key add archive.key

Install Cloudera Manager

$ sudo apt-get -y install cloudera-manager-daemons cloudera-manager-server cloudera-manager-agent

Start the Agents on All Hosts in the Cluster

$ sudo service cloudera-scm-agent start

Start the Cloudera Manager Server Host

$ sudo service cloudera-scm-server start

Log into Cloudera Manager Admin Console

In a web browser, go to “http://<serverhost>:7180” and use admin as both the username and password. Once logged in accept the Terms & Conditions, select the Basic Edition license type.

When you get to the Cluster Setup, be sure to select the Core Hadoop services:

cloudera-manager.jpg#asset:1549

Follow the Cloudera Wizard on screen instructions. For more granular configuration and tuning settings, please refer to the official Cloudera documentation.

Download the ES Hadoop Connector

Go ahead and download the ES Hadoop Connector so that the Qbox ES cluster can communicate with your Hadoop cluster.

$ wget -P /tmp <a href="http://download.elastic.co/hadoop/elasticsearch-hadoop-5.5.1.zip">http://download.elastic.co/hadoop/elasticsearch-hadoop-5.5.1.zip</a> && unzip /tmp/elasticsearch-hadoop-5.5.1.zip -d /tmp
$ cp /tmp/elasticsearch-hadoop-5.5.1/dist/elasticsearch-hadoop-5.5.1.jar /tmp/elasticsearch-hadoop-5.5.1.jar
$ hdfs dfs -copyFromLocal /tmp/elasticsearch-hadoop-5.5.1/dist/elasticsearch-hadoop-5.5.1.jar /tmp

Register the JAR

<property>
  <name>hive.aux.jars.path</name>
  <value>/path/to/elasticsearch-hadoop.jar</value>
</property>
  1. Register the JAR through hive.aux.jars.path parameter of hive-site.xml in  /usr/lib/hive/lib.

  2. Restart the hiveserver2 service from the Cloudera Manager.

Grab a Sample Dataset

We’ll need some data to play with, so in this example we will use a sample data set of historic Wal-Mart store openings. Download the CSV to your ES Master:

$ wget <a href="http://users.econ.umn.edu/~holmes/data/WalMart/store_openings.csv">http://users.econ.umn.edu/~holmes/data/WalMart/store_openings.csv</a>

Determine the Columns We Want to Work With

Run the following command to show the first 10 lines of the CSV file, noting the column names in the first row

$ head store_openings.csv
"storenum","OPENDATE","date_super","conversion","st","county","STREETADDR","STRCITY","STRSTATE","ZIPCODE","type_store"
1,7/1/1962 ,3/1/1997 ,1,"05","007","2110 WEST WALNUT","Rogers","AR","72756","Supercenter"
2,8/1/1964 ,3/1/1996 ,1,"05","009","1417 HWY 62/65 N","Harrison","AR","72601","Supercenter"
3,4/12/1988 ,3/1/2000 ,1,"13","011","30983 HWY 441 SOUTH","Commerce","GA","30529","Supercenter"
4,8/1/1965 ,3/1/2002 ,1,"05","007","2901 HWY 412 EAST","Siloam Springs","AR","72761","Supercenter"
5,5/1/1972 ,3/1/1994 ,1,"05","045","1155 HWY 65 NORTH","Conway","AR","72032","Supercenter"
7,10/1/1967 ,,,"05","119","3801 CAMP ROBINSON RD.","North Little Rock","AR","72118","Wal-Mart"
8,10/1/1967 ,3/1/1993 ,1,"05","029","1621 NORTH BUSINESS 9","Morrilton","AR","72110","Supercenter"
9,3/1/1968 ,3/1/2000 ,1,"29","143","1303 SOUTH MAIN","Sikeston","MO","63801","Supercenter"
10,7/1/1968 ,3/1/1998 ,1,"40","021","2020 SOUTH MUSKOGEE","Tahlequah","OK","74464","Supercenter"

Create the ES Index and Mappings

curl -v -XPUT http://<hostname>:9200/storeopenings
curl -XPUT "http://<hostname>:9200/storeopenings/_mapping" -d
{
   "mappings": {
      "properties": {
         "storenum": {
            "type": "integer"
         },
         "OPENDATE": {
            "type": "date"
            "format": "strict_date_optional_time"
         },
         "STREETADR": {
            "type": "string"
         },
         "STRCITY": {
            "type": "string"
         },
         "STRSTATE": {
            "type": "string"
         },
         "ZIPCODE": {
            "type": "integer"
         }
     }
  }

Import Dataset to ES with Logstash

To import the CSV into our Elasticsearch cluster, we will use Logstash.

input {
    file {
        path => ["store_openings.csv"]
        start_position => "beginning"
    }
}
filter {
    csv {
        columns => [
            "storenum",
            "OPENDATE",
            "STREETADR",
            "STRCITY",
            "STRSTATE",
            "ZIPCODE"
        ]
    }
}
output {
    stdout { codec => rubydebug }
    elasticsearch {
        action => "index"
        hosts => ["<hostname>:9200"]
        index => "storeopenings"
        document_type => "mappings"
        workers => 1
    }
}

From the Hive> Prompt:


1. Create a Hive Database.

CREATE DATABASE es_store_db
USE es_store_db

2. Create an external Hive table so data can be pulled in from Elasticsearch.

CREATE EXTERNAL TABLE storeopenings (
    store_num int,
    open_date date,  
    street_adr string,
   str_city string,
    str_state string,
    str_zipcode bigint)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource'=storeopenings/mappings,
'es.nodes'=<hostname>:9200',
'es.index.auto.create'='TRUE',
'es.mapping.names' = 'store_num:storenum, open_date:OPENDATE, street_adr:STREETADR, str_city:STRCITY, str_state:STRSTATE, str_zipcode:ZIPCODE');

3. Query the table.

SELECT * FROM storeopenings LIMIT 50;
1,7/1/62,2110 WEST WALNUT,Rogers,AR,72756
2,8/1/64,1417 HWY 62/65 N,Harrison,AR,72601
3,4/12/88,30983 HWY 441 SOUTH,Commerce,GA,30529
4,8/1/65,2901 HWY 412 EAST,Siloam Springs,AR,72761
5,5/1/72,1155 HWY 65 NORTH,Conway,AR,72032
7,10/1/67,3801 CAMP ROBINSON RD.,North Little Rock,AR,72118
8,10/1/67,1621 NORTH BUSINESS 9,Morrilton,AR,72110
9,3/1/68,1303 SOUTH MAIN,Sikeston,MO,63801
10,7/1/68,2020 SOUTH MUSKOGEE,Tahlequah,OK,74464
11,3/1/68,65 WAL-MART DRIVE,Mountain Home,AR,72653
12,7/1/68,1500 LYNN RIGGS BLVD,Claremore,OK,74017
13,11/1/68,2705 GRAND AVE,Carthage,MO,64836
14,4/1/69,1800 S JEFFERSON,Lebanon,MO,65536
15,5/1/69,1310 PREACHER RD/HGWY 160,West Plains,MO,65775
16,4/1/69,2214 FAYETTEVILLE RD,Van Buren,AR,72956
17,5/1/69,3200 LUSK DRIVE,Neosho,MO,64850
18,11/1/69,2500 MALCOLM ST/HWY 67 NORTH,Newport,AR,72112
19,4/3/90,333 S WESTWOOD,Poplar Bluff,MO,63901
20,10/1/70,1712 EAST OHIO,Clinton,MO,64735
21,3/1/70,185 ST ROBERT BLVD,St. Robert,MO,65584
22,10/1/70,4901 SO. MILL ROAD,Pryor,OK,74361
23,11/1/70,1201 N SERVICE ROAD EAST,Ruston,LA,71270
24,4/1/71,2000 JOHN HARDEN DR,Jacksonville,AR,72076
25,2/1/71,4820 SO. CLARK ST,Mexico,MO,65265
26,11/1/70,3450 S. 4TH TRAFFICWAY,Leavenworth,KS,66048
27,2/1/71,1101 HWY 32 WEST,Salem,MO,65560
28,5/1/71,2415 N.W. MAIN ST,Miami,OK,74354
29,8/1/71,724 STADIUM WEST BLVD,Jefferson City,MO,65109
30,11/1/71,2025 BUS. HWY 60 WEST,Dexter,MO,63841
31,6/1/71,3108 N BROADWAY,Poteau,OK,74953
32,6/1/71,2050 WEST HWY 76,Branson,MO,65616
33,6/1/71,1710 SO. 4TH ST,Nashville,AR,71852
34,11/1/71,2250 LINCOLN AVENUE,Nevada,MO,64772
35,10/1/71,101 EAST BLUEMONT AVENUE,Manhattan,KS,66502
36,11/1/71,2802 WEST KINGS HIGHWAY,Paragould,AR,72450
37,9/1/71,701 WALTON DRIVE,Farmington,MO,63640
38,12/9/71,1907 SE WASHINGTON ST.,Idabel,OK,74745
39,3/1/72,2400 SOUTH MAIN,Fort Scott,KS,66701
40,11/1/71,1301 HWY 24 EAST,Moberly,MO,65270
41,5/1/72,4000 GREEN COUNTRY RD,Bartlesville,OK,74006
42,2/1/73,1705 WEST 11TH STREET,Coffeyville,KS,67337
43,5/1/72,521 EAST CHESTNUT ST,Junction City,KS,66441
44,2/1/72,1802 SOUTH BUSINESS HWY 54,Eldon,MO,65026
45,8/1/72,1815 EAST HIGHLAND DRIVE,Jonesboro,AR,72401
46,8/1/72,2451 S. SPRINGFIELD,Bolivar,MO,65613
47,7/1/72,1101 WEST RUTH AVE,Sallisaw,OK,74955
48,11/1/72,885 HIGHWAY 60,Monett,MO,65708
49,10/1/72,1911 E.JACKSON ST.,Hugo,OK,74743
50,9/1/73,268 SO. 7TH ST.,Vinita,OK,74301
Time taken: 0.043 seconds, Fetched: 11 row(s)

Summary

In this tutorial we learned how to import an existing CSV file into Elasticsearch with Logstash to perform later analysis with Hadoop’s Hive. Although Hive creates a convenience of enabling one to run familiar SQL to SQL-like queries using Hadoop’s core MapReduce component, it can get slow for extremely large datasets. 

For datasets in the terabyte to petabyte range and beyond, Apache Spark delivers a highly performant, real-time processing layer to Hadoop that can be up to 100x faster than MapReduce. 

Other Helpful Tutorials

Give It a Whirl!

It's easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, Amazon, or Microsoft Azure data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch

Questions? Drop us a note, and we'll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.