
How to Import a CSV into Elasticsearch and Run SQL Queries with Hive (Hadoop)
Posted by Chuong Nguyen December 14, 2017In the previous tutorial, we learned how to set up a Qbox Cluster with the ES-Hadoop connector to interface with Hadoop’s data warehouse component, Hive, to perform SQL queries on top of Elasticsearch. The benefits of offloading and manipulating ES indices with Hive enable a multitude of possibilities for high-performing, deeper analysis across large data sets.
In this tutorial we will take it a step further, by using Logstash to import an existing data set in the form of a CSV file into Elasticsearch in order to perform later batch-analytics in Hadoop’s powerful ecosystem.
Spin Up a Qbox Elasticsearch Cluster
Setting up and configuring Elasticsearch can be quite the undertaking. Hadoop is even a greater one. For this reason, we prefer to spin up a Qbox cluster to handle the Elasticsearch heavy lifting. You can get started with a Qbox Cluster here. For this tutorial we will be using Elasticsearch major version 5. We’re assuming you already have ELK setup as well.
Download and Install a Hadoop Distribution
There are plenty of Hadoop distributions to choose from, and a plethora of ways to deploy them (Ansible, Whirr, etc.), however for the sake of simplicity we prefer to use the Cloudera distribution and its Cloudera Manager (Version 5) and are running Ubuntu 16.04.
Check out the Open Beta for our Hosted Kubernetes as a Service
Add the CDH5 Repo (Ubuntu 16.04)
$ sudo wget 'https://archive.cloudera.com/cdh5/ubuntu/xenial/amd64/cdh/cloudera.list'\ -O /etc/apt/sources.list.d/cloudera.list $ sudo apt-get update $ wget https://archive.cloudera.com/cdh5/ubuntu/xenial/amd64/cdh/archive.key -O archive.key $ sudo apt-key add archive.key
Install Cloudera Manager
$ sudo apt-get -y install cloudera-manager-daemons cloudera-manager-server cloudera-manager-agent
Start the Agents on All Hosts in the Cluster
$ sudo service cloudera-scm-agent start
Start the Cloudera Manager Server Host
$ sudo service cloudera-scm-server start
Log into Cloudera Manager Admin Console
In a web browser, go to “http://<serverhost>:7180”
and use admin as both the username and password. Once logged in accept the Terms & Conditions, select the Basic Edition license type.
When you get to the Cluster Setup, be sure to select the Core Hadoop services:

Follow the Cloudera Wizard on screen instructions. For more granular configuration and tuning settings, please refer to the official Cloudera documentation.
Download the ES Hadoop Connector
Go ahead and download the ES Hadoop Connector so that the Qbox ES cluster can communicate with your Hadoop cluster.
$ wget -P /tmp http://download.elastic.co/hadoop/elasticsearch-hadoop-5.5.1.zip && unzip /tmp/elasticsearch-hadoop-5.5.1.zip -d /tmp
$ cp /tmp/elasticsearch-hadoop-5.5.1/dist/elasticsearch-hadoop-5.5.1.jar /tmp/elasticsearch-hadoop-5.5.1.jar
$ hdfs dfs -copyFromLocal /tmp/elasticsearch-hadoop-5.5.1/dist/elasticsearch-hadoop-5.5.1.jar /tmp
Register the JAR
<property> <name>hive.aux.jars.path</name> <value>/path/to/elasticsearch-hadoop.jar</value> </property>
- Register the JAR through
hive.aux.jars.path
parameter ofhive-site.xml
in/usr/lib/hive/lib
. - Restart the hiveserver2 service from the Cloudera Manager.
Grab a Sample Dataset
We’ll need some data to play with, so in this example we will use a sample data set of historic Wal-Mart store openings. Download the CSV to your ES Master:
$ wget http://users.econ.umn.edu/~holmes/data/WalMart/store_openings.csv
Determine the Columns We Want to Work With
Run the following command to show the first 10 lines of the CSV file, noting the column names in the first row:
$ head store_openings.csv
"storenum","OPENDATE","date_super","conversion","st","county","STREETADDR","STRCITY","STRSTATE","ZIPCODE","type_store" 1,7/1/1962 ,3/1/1997 ,1,"05","007","2110 WEST WALNUT","Rogers","AR","72756","Supercenter" 2,8/1/1964 ,3/1/1996 ,1,"05","009","1417 HWY 62/65 N","Harrison","AR","72601","Supercenter" 3,4/12/1988 ,3/1/2000 ,1,"13","011","30983 HWY 441 SOUTH","Commerce","GA","30529","Supercenter" 4,8/1/1965 ,3/1/2002 ,1,"05","007","2901 HWY 412 EAST","Siloam Springs","AR","72761","Supercenter" 5,5/1/1972 ,3/1/1994 ,1,"05","045","1155 HWY 65 NORTH","Conway","AR","72032","Supercenter" 7,10/1/1967 ,,,"05","119","3801 CAMP ROBINSON RD.","North Little Rock","AR","72118","Wal-Mart" 8,10/1/1967 ,3/1/1993 ,1,"05","029","1621 NORTH BUSINESS 9","Morrilton","AR","72110","Supercenter" 9,3/1/1968 ,3/1/2000 ,1,"29","143","1303 SOUTH MAIN","Sikeston","MO","63801","Supercenter" 10,7/1/1968 ,3/1/1998 ,1,"40","021","2020 SOUTH MUSKOGEE","Tahlequah","OK","74464","Supercenter"
Create the ES Index and Mappings
curl -v -XPUT http://<hostname>:9200/storeopenings
curl -XPUT "http://<hostname>:9200/storeopenings/_mapping" -d { "mappings": { "properties": { "storenum": { "type": "integer" }, "OPENDATE": { "type": "date" "format": "strict_date_optional_time" }, "STREETADR": { "type": "string" }, "STRCITY": { "type": "string" }, "STRSTATE": { "type": "string" }, "ZIPCODE": { "type": "integer" } } }
Import Dataset to ES with Logstash
To import the CSV into our Elasticsearch cluster, we will use Logstash. See this article for the details about installing and configuring Logstash. We’ll need the following edits included in the Logstash configuration file that typically resides under /etc/logstash/conf.d
on Linux distributions:
input { file { path => ["store_openings.csv"] start_position => "beginning" } } filter { csv { columns => [ "storenum", "OPENDATE", "STREETADR", "STRCITY", "STRSTATE", "ZIPCODE" ] } } output { stdout { codec => rubydebug } elasticsearch { action => "index" hosts => ["<hostname>:9200"] index => "storeopenings" document_type => "mappings" workers => 1 } }
Once Logstash is up and running, it will import our CSV records to the Elasticsearch index based on the rules specified in this configuration.
Note: don’t forget to input correct details about your Elasticsearch host and port or any credentials needed to connect to ES in the configuration file.
Next, we need to set up Hive to communicate with our Elasticsearch cluster.
From the Hive> Prompt:
1. Create a Hive Database.
CREATE DATABASE es_store_db USE es_store_db
2. Create an external Hive table so data can be pulled in from Elasticsearch.
CREATE EXTERNAL TABLE storeopenings ( store_num int, open_date date, street_adr string, str_city string, str_state string, str_zipcode bigint) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource'=storeopenings/mappings, 'es.nodes'=<hostname>:9200', 'es.index.auto.create'='TRUE', 'es.mapping.names' = 'store_num:storenum, open_date:OPENDATE, street_adr:STREETADR, str_city:STRCITY, str_state:STRSTATE, str_zipcode:ZIPCODE');
3. Query the table.
SELECT * FROM storeopenings LIMIT 50; 1,7/1/62,2110 WEST WALNUT,Rogers,AR,72756 2,8/1/64,1417 HWY 62/65 N,Harrison,AR,72601 3,4/12/88,30983 HWY 441 SOUTH,Commerce,GA,30529 4,8/1/65,2901 HWY 412 EAST,Siloam Springs,AR,72761 5,5/1/72,1155 HWY 65 NORTH,Conway,AR,72032 7,10/1/67,3801 CAMP ROBINSON RD.,North Little Rock,AR,72118 8,10/1/67,1621 NORTH BUSINESS 9,Morrilton,AR,72110 9,3/1/68,1303 SOUTH MAIN,Sikeston,MO,63801 10,7/1/68,2020 SOUTH MUSKOGEE,Tahlequah,OK,74464 11,3/1/68,65 WAL-MART DRIVE,Mountain Home,AR,72653 12,7/1/68,1500 LYNN RIGGS BLVD,Claremore,OK,74017 13,11/1/68,2705 GRAND AVE,Carthage,MO,64836 14,4/1/69,1800 S JEFFERSON,Lebanon,MO,65536 15,5/1/69,1310 PREACHER RD/HGWY 160,West Plains,MO,65775 16,4/1/69,2214 FAYETTEVILLE RD,Van Buren,AR,72956 17,5/1/69,3200 LUSK DRIVE,Neosho,MO,64850 18,11/1/69,2500 MALCOLM ST/HWY 67 NORTH,Newport,AR,72112 19,4/3/90,333 S WESTWOOD,Poplar Bluff,MO,63901 20,10/1/70,1712 EAST OHIO,Clinton,MO,64735 21,3/1/70,185 ST ROBERT BLVD,St. Robert,MO,65584 22,10/1/70,4901 SO. MILL ROAD,Pryor,OK,74361 23,11/1/70,1201 N SERVICE ROAD EAST,Ruston,LA,71270 24,4/1/71,2000 JOHN HARDEN DR,Jacksonville,AR,72076 25,2/1/71,4820 SO. CLARK ST,Mexico,MO,65265 26,11/1/70,3450 S. 4TH TRAFFICWAY,Leavenworth,KS,66048 27,2/1/71,1101 HWY 32 WEST,Salem,MO,65560 28,5/1/71,2415 N.W. MAIN ST,Miami,OK,74354 29,8/1/71,724 STADIUM WEST BLVD,Jefferson City,MO,65109 30,11/1/71,2025 BUS. HWY 60 WEST,Dexter,MO,63841 31,6/1/71,3108 N BROADWAY,Poteau,OK,74953 32,6/1/71,2050 WEST HWY 76,Branson,MO,65616 33,6/1/71,1710 SO. 4TH ST,Nashville,AR,71852 34,11/1/71,2250 LINCOLN AVENUE,Nevada,MO,64772 35,10/1/71,101 EAST BLUEMONT AVENUE,Manhattan,KS,66502 36,11/1/71,2802 WEST KINGS HIGHWAY,Paragould,AR,72450 37,9/1/71,701 WALTON DRIVE,Farmington,MO,63640 38,12/9/71,1907 SE WASHINGTON ST.,Idabel,OK,74745 39,3/1/72,2400 SOUTH MAIN,Fort Scott,KS,66701 40,11/1/71,1301 HWY 24 EAST,Moberly,MO,65270 41,5/1/72,4000 GREEN COUNTRY RD,Bartlesville,OK,74006 42,2/1/73,1705 WEST 11TH STREET,Coffeyville,KS,67337 43,5/1/72,521 EAST CHESTNUT ST,Junction City,KS,66441 44,2/1/72,1802 SOUTH BUSINESS HWY 54,Eldon,MO,65026 45,8/1/72,1815 EAST HIGHLAND DRIVE,Jonesboro,AR,72401 46,8/1/72,2451 S. SPRINGFIELD,Bolivar,MO,65613 47,7/1/72,1101 WEST RUTH AVE,Sallisaw,OK,74955 48,11/1/72,885 HIGHWAY 60,Monett,MO,65708 49,10/1/72,1911 E.JACKSON ST.,Hugo,OK,74743 50,9/1/73,268 SO. 7TH ST.,Vinita,OK,74301 Time taken: 0.043 seconds, Fetched: 11 row(s)
Summary
In this tutorial we learned how to import an existing CSV file into Elasticsearch with Logstash to perform later analysis with Hadoop’s Hive. Although Hive creates a convenience of enabling one to run familiar SQL to SQL-like queries using Hadoop’s core MapReduce component, it can get slow for extremely large datasets.
For datasets in the terabyte to petabyte range and beyond, Apache Spark delivers a highly performant, real-time processing layer to Hadoop that can be up to 100x faster than MapReduce.
Other Helpful Tutorials
- How to Ship MySQL Logs to Elasticsearch with Filebeat
- Data Retention Techniques in Logstash via Elasticsearch-Curator
- REST Calls Made Rustic: RS-ES in Idiomatic Rust Tutorial
- How to: Slack Alerting for Elasticsearch with ElastAlert
- How to Use Elasticsearch, Logstash, and Kibana to Manage Apache Logs
Give It a Whirl!
It’s easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, Amazon, or Microsoft Azure data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch.
Questions? Drop us a note, and we’ll get you a prompt response.
Not yet enjoying the benefits of a hosted ELK stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.