In this first article, we're going to set up some basic tools for doing fundamental data science exercises. Our goal is to run machine-learning classification algorithms against large data sets, using Apache Spark and Elasticsearch clusters in the cloud. Keep in mind that a major advantage of the approach that we take here is that the same techniques can scale up or down to data sets of varying size. We'll therefore start small.

First, we need to set up a local Ubuntu virtual machine, install Elasticsearch, and use Python to build an index from a small data set. We can then move on to bigger things. All the code for this post and future posts will be available in this GitHub repository. Let's begin.

Introductory note: Sloan Ahrens is a freelance data consultant and a co-founder of Qbox. In this, the first in a series of guest posts, he demonstrates how to set up a large-scale machine learning infrastructure using Apache Spark and Elasticsearch. –Mark Brandon

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click "Get Started" in the header navigation. If you need help setting up, refer to "Provisioning a Qbox Elasticsearch Cluster."


Step 1: Set up a Local Ubuntu VM

We generally recommend using a fresh Ubuntu virtual machine when starting a new project because doing so minimizes potential conflicts between various libraries and often results in better organization. Another benefit is that the same code can deploy to the cloud and also to the Ubuntu VMs. Everything built locally will work the same on the servers (or nearly so, with little extra effort). It's also good to keep a list of all the commands that you use in setup, along with some relevant notes—which is a time saver if you come back months later to try the same thing again. Much of it can be put into scripts that automate some of the deployment process.

Much of what we're going to do below should work on most systems, but we only vouch that it will work on a fresh Ubuntu 14 install. We suggest using Oracle's VirtualBox because it's free and easy to use. We'll also need an Ubuntu ISO image.

After installing VirtualBox, open the Oracle VM VirtualBox Manager and click the New button. Give the new VM a name, select Linux for the Type, and choose Ubuntu (64 bit) for the Version, as shown in the figure below:

new_ubuntu_VB-1

Continuing with the setup, we'll need to select the amount of RAM to for the VM (we recommend at least 6144MB). Next, we'll select Create a virtual hard drive now, and select VDI (VirtualBox Disk Image). Then we choose Dynamically allocated and also the size for our virtual hard drive (48GB is adequate). Click Create to initiate the new VM.

After it's ready, choose the new VM from from the list and click Start. At the prompt to choose a virtual disk from which to install, use the Ubuntu ISO that you downloaded from the link above:

load_ubuntu_iso-1

The installation process may take a bit of time to complete. There are number of things that are necessary to finish, some of which will be different for various users. Our selections are Download updates while installing, Install this third-party software, and Erase disk and install Ubuntu. Other installation task include choosing the username and password. You'll need to reboot at the end of the process.

Next, install the VirtualBox Guest Additions: from the VirtualBox menu, with the new VM running, click Devices->Insert Guest Additions CD image. These additions will make for a much more user-friendly virtual machine. After rebooting the VM, it will be ready to use. We also recommend enabling the shared clipboard: select the VM, click Settings, then set General->Advanced->Shared Clipboard to Bidirectional.

Now we have a fresh Ubuntu VM.

We also recommend a couple of other things that go along with setting up a new VM, such as installing SublimeText, configuring Git, and setting up a shared folder with the host OS. Look for the instructions and required commands in the GitHub repo that corresponds to this article.


Step 2: Install Elasticsearch

We'll use the latest stable release, which can be found here. Elasticsearch recommends using Java 8, so we can use the Oracle Java 8 installer and follow the instructions here. These are the commands to get it done:

# install oracle java 8
sudo apt-get purge openjdk*   # just in case
sudo apt-get install software-properties-common
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer

We will also need cURL, which is quite easy to install:

sudo apt-get install curl

And we're now ready to install Elasticsearch! There are several ways to do this, including the "hard way," which means running these commands:

# install Elasticsearch 1.3.2
wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.3.2.tar.gz -O elasticsearch.tar.gz
tar -xf elasticsearch.tar.gz
rm elasticsearch.tar.gz
sudo mv elasticsearch-* elasticsearch
sudo mv elasticsearch /usr/local/share

Next, we need to setup Elasticsearch as a service, using these commands:

# set up ES as service
curl -L http://github.com/elasticsearch/elasticsearch-servicewrapper/tarball/master | tar -xz
sudo mv *servicewrapper*/service /usr/local/share/elasticsearch/bin/
rm -Rf *servicewrapper*
sudo /usr/local/share/elasticsearch/bin/service/elasticsearch install
sudo ln -s 'readlink -f /usr/local/share/elasticsearch/bin/service/elasticsearch' /usr/local/bin/rcelasticsearch
 
# start ES service
sudo service elasticsearch start

With Elasticsearch running as a service, it will be accessible at http://localhost:9200. We can use cURL to ensure ES is up and running with this command:

curl localhost:9200

The response should be similar to this:

{
  "status" : 200,
  "name" : "Shadrac",
  "version" : {
    "number" : "1.3.2",
    "build_hash" : "dee175dbe2f254f3f26992f5d7591939aaefd12f",
    "build_timestamp" : "2014-08-13T14:29:30Z",
    "build_snapshot" : false,
    "lucene_version" : "4.9"
  },
  "tagline" : "You Know, for Search"
}

NOTE: If you would like to run all of the commands together, you can use install_es.sh to do so by running:

bash install_es.sh

Step 3: Build an Index

It's time to give Elasticsearch some data. For the purposes of illustration, we're going to use a small data set from Kaggle. If you have any interest in data science and/or machine learning, we encourage you to explore Kaggle. The Titanic Survivors competition has some very nice tutorials.

Have a look a the description of the fields in the data set here (you'll need a Kaggle account). The data set contains both a training set and a test set, and we'll use the training set to create an index in Elasticsearch. You can access the file in an Amazon S3 bucket at http://apps.sloanahrens.com/qbox-blog-resources/kaggle-titanic-data/train.csv. We look more closely at this below.

NOTE: You'll notice that this is a very small data set (about 60KB), and we could easily include it in the GitHub repo given above. But our objective is to configure tools for managing large data sets, which don't fit well in a GitHub repo. Amazon S3 is a good place to store large files that need to be accessible from other servers, such as a Hadoop cluster, so i

It's good to get comfortable with such large storage environments now.

Here, we'll use Python to quickly scan the CSV and use the data to build an Elasticsearch index. The version of Python that comes with our Ubuntu release is 2.7.6, which is great for our purposes. We'll need to use the python Elasticsearch client, which can be installed as follows:

# install pip and the python ES client
sudo apt-get install python-setuptools
sudo easy_install pip
sudo pip install elasticsearch

NOTE: We actually could have done this with a single line: sudo apt-get install elasticsearch. However, both setuptools and pip are worth having onboard if you're going to work with Python.

Although we're going to walk through it step-by-step, you can find the full code for creating the index here.

First, we'll assign a few variables that we'll need, including the URL of the data file, details of the Elasticsearch host (running locally in our VM), and some meta-data for the index:

# build_index.py
FILE_URL = "http://apps.sloanahrens.com/qbox-blog-resources/kaggle-titanic-data/test.csv"
ES_HOST = {"host" : "localhost", "port" : 9200}
INDEX_NAME = 'titanic'
TYPE_NAME = 'passenger'
ID_FIELD = 'passengerid'

Next, we'll read in the data from the file and capture the information in the header to use when building our index:

import csv
import urllib2
response = urllib2.urlopen(FILE_URL)
csv_file_object = csv.reader(response)
 
header = csv_file_object.next()
header = [item.lower() for item in header]

We'll now build up a Python dictionary of our data set in a format that the Python ES client can use. We are going to load the data by means of bulk indexing. According to the Elasticsearch Bulk API docs, the body of the bulk index request must consist of two lines for each operation: one specifying the meta-data for the operation; and one specifying the actual data that it will index. The code below will build a dictionary that meets these requirements for our data:

bulk_data = [] 
for row in csv_file_object:
    data_dict = {}
    for i in range(len(row)):
        data_dict[header[i]] = row[i]
    op_dict = {
        "index": {
            "_index": INDEX_NAME, 
            "_type": TYPE_NAME, 
            "_id": data_dict[ID_FIELD]
        }
    }
    bulk_data.append(op_dict)
    bulk_data.append(data_dict)

Let's create our index using the Python ES client, deleting any index that might already exist—ensuring that our code is idempotent. We print out the responses as we go along so we can troubleshoot if necessary:

from elasticsearch import Elasticsearch
# create ES client, create index
es = Elasticsearch(hosts = [ES_HOST])
if es.indices.exists(INDEX_NAME):
    print("deleting '%s' index..." % (INDEX_NAME))
    res = es.indices.delete(index = INDEX_NAME)
    print(" response: '%s'" % (res))
# since we are running locally, use one shard and no replicas
request_body = {
    "settings" : {
        "number_of_shards": 1,
        "number_of_replicas": 0
    }
}
print("creating '%s' index..." % (INDEX_NAME))
res = es.indices.create(index = INDEX_NAME, body = request_body)
print(" response: '%s'" % (res))

We're all set to perform a bulk index on our data. To ensure that our data will be immediately available, let's specify refresh in the request. This will update the relevant primary and replica shards immediately after the bulk operation completes and also make it immediately searchable.

# bulk index the data
print("bulk indexing...")
res = es.bulk(index = INDEX_NAME, body = bulk_data, refresh = True)

Let's also run a simple match_all query to ensure that everything is in order:

# sanity check
res = es.search(index = INDEX_NAME, size=2, body={"query": {"match_all": {}}})
print(" response: '%s'" % (res))

To display the results more clearly, we can loop through them:

<code>print("results:")
for hit in res['hits']['hits']:
    print(hit["_source"])
</code>

The output will be similar to this:

results:
{
u'fare': u'7.8292', u'name': u'Kelly, Mr. James', u'embarked': u'Q', u'age': u'34.5', u'parch': u'0', u'pclass': u'3', u'sex': u'male', u'sibsp': u'0', u'passengerid': u'892', u'ticket': u'330911', u'cabin': u''
}
{
u'fare': u'7', u'name': u'Wilkes, Mrs. James (Ellen Needs)', u'embarked': u'S', u'age': u'47', u'parch': u'0', u'pclass': u'3', u'sex': u'female', u'sibsp': u'1', u'passengerid': u'893', u'ticket': u'363272', u'cabin': u''
}

And we're done!

In summary, we configured a new Ubuntu virtual machine with Elasticsearch, and then, with Python, we built a simple index for a small data set. In subsequent articles in this series, we'll use these to build up tools that have more sophistication and power. From within Apache Spark running on our VM, we'll read and write against an Elasticsearch index, and then deploy both Elasticsearch and Spark to the cloud to run on larger clusters. We'll also grapple with larger data sets. After that, we'll build up the basics from this article to explore classification with several different machine-learning algorithms.

We invite you to continue to the next article in this series, Elasticsearch in Apache Spark with Python, Machine Learning Series, Part 2.


A note about hosted Elasticsearch: We didn't set up a Qbox cluster in this post because it's good to learn how to do a local installation of Elasticsearch within a virtual machine. We'll cover the provisioning of a Qbox cluster in a future article, but we welcome you to create a cluster now by following the steps in our 3-minute video tutorial.


comments powered by Disqus