In this post we’re going to continue setting up some basic tools for doing data science. We began the setup in our first article in this series, Building an Elasticsearch Index with Python, Machine Learning Series, Part 1. The goal of this instruction throughout the series is to run machine learning classification algorithms against large data sets, using Apache Spark and Elasticsearch clusters in the cloud. In the first article, we set up a VirtualBox Ubuntu 14 virtual machine, installed Elasticsearch, and built a simple index using Python.
Here we will complete our setup by installing Spark on the VM that we established with the steps given in the first article. Then, we’ll perform some simple operations to exercise skill in reading data from an Elasticsearch index, do some transformations on that data, and then write the results into another Elasticsearch index. All the code for the posts in this series will be available in this GitHub repository.
For this second segment, we’ll remain local on our Ubuntu 14 VM. Our plan for the next article is to migrate our setup to the cloud.