There is growing interest in the power of Apache Spark to do large-scale data analytics, including tests of machine-learning algorithms against large datasets. We also take interest in Spark as part of a larger technical solution featuring a web front-end that allowing users to start jobs on the back end. In this article, we take you through the building of a software-as-a-service application.

Keep reading

In the last article, Sparse Matrix Multiplication with Elasticsearch and Apache Spark, we went through a method of doing large-scale, sparse matrix multiplication using Apache Spark and Elasticsearch clusters in the cloud. This article is really an extension to that tutorial, in which we will generalize the method to rectangular matrices. This is important because we are crafting a future article that will make use of this technique. We also need it to function for non-square matrices.

Keep reading

In this article, we continue the work from Deploying Elasticsearch and Apache Spark to the Cloud, Machine-Learning Series, Part 3.

In previous posts, we’ve gone through a number of steps for creating a basic infrastructure for large-scale data analytics using Apache Spark and Elasticsearch clusters in the cloud. In this post will use that infrastructure to do a task that is common in machine-learning and data mining: a task known as sparse matrix multiplication.

Keep reading

In this article, we continue the work from Elasticsearch in Apache Spark with Python, Machine Learning Series, Part 2. We are making some basic tools for doing data science, in which our goal is to be able to run machine-learning classification algorithms against large data sets using Apache Spark and Elasticsearch clusters in the cloud.

Keep reading

In this post we’re going to continue setting up some basic tools for doing data science. We began the setup in our first article in this series, Building an Elasticsearch Index with Python, Machine Learning Series, Part 1. The goal of this instruction throughout the series is to run machine learning classification algorithms against large data sets, using Apache Spark and Elasticsearch clusters in the cloud. In the first article, we set up a VirtualBox Ubuntu 14 virtual machine, installed Elasticsearch, and built a simple index using Python.

Here we will complete our setup by installing Spark on the VM that we established with the steps given in the first article. Then, we’ll perform some simple operations to exercise skill in reading data from an Elasticsearch index, do some transformations on that data, and then write the results into another Elasticsearch index. All the code for the posts in this series will be available in this GitHub repository.

For this second segment, we’ll remain local on our Ubuntu 14 VM. Our plan for the next article is to migrate our setup to the cloud.

Keep reading

In this first article, we’re going to set up some basic tools for doing fundamental data science exercises. Our goal is to run machine-learning classification algorithms against large data sets, using Apache Spark and Elasticsearch clusters in the cloud. Keep in mind that a major advantage of the approach that we take here is that the same techniques can scale up or down to data sets of varying size. We’ll therefore start small.

First, we need to set up a local Ubuntu virtual machine, install Elasticsearch, and use Python to build an index from a small data set. We can then move on to bigger things. All the code for this post and future posts will be available in this GitHub repository. Let’s begin.

Keep reading