Recent Posts by Sloan Ahrens

Elasticsearch 2.0.0 introduced a number of breaking changes. When I set out to install ES 2.0 to do some local testing, I found that the techniques that I had been using to set up virtual machines for local development (for instance, here) were no longer adequate. So I set out to discover what the “proper” method should be, and along the way I ran into a few problems. I’ll outline those issues here, hopefully saving some other people a little bit of trouble.

The instructions that follow will assume you are using OSX. It should be straightforward to adapt them to other operating systems, but I will not address those considerations here.

Keep reading

There is growing interest in the power of Apache Spark to do large-scale data analytics, including tests of machine-learning algorithms against large datasets. We also take interest in Spark as part of a larger technical solution featuring a web front-end that allowing users to start jobs on the back end. In this article, we take you through the building of a software-as-a-service application.

Keep reading

After answering a question about autocomplete on StackOverflow, we thought it best to come over to the Qbox blog and write more extensively about the different ways of approaching autocomplete.

In this article, we include an example of how to get autocomplete up and running quickly in Elasticsearch with the Completion Suggest feature. We don't intend for this to be a complete treatment of the topic, but we do aim to give you enough information to get going as painlessly as possible.

Keep reading

In this post we will walk though the basics of using ngrams in Elasticsearch.

Wikipedia has this to say about ngrams:

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles.

In the fields of machine learning and data mining, "ngram" will often refer to sequences of n words. In Elasticsearch, however, an "ngram" is a sequnce of n characters. There are various ays these sequences can be generated and used. We'll take a look at some of the most common.

Keep reading

In the last article, Sparse Matrix Multiplication with Elasticsearch and Apache Spark, we went through a method of doing large-scale, sparse matrix multiplication using Apache Spark and Elasticsearch clusters in the cloud. This article is really an extension to that tutorial, in which we will generalize the method to rectangular matrices. This is important because we are crafting a future article that will make use of this technique. We also need it to function for non-square matrices.

Keep reading

In this article, we continue the work from Deploying Elasticsearch and Apache Spark to the Cloud, Machine-Learning Series, Part 3.

In previous posts, we've gone through a number of steps for creating a basic infrastructure for large-scale data analytics using Apache Spark and Elasticsearch clusters in the cloud. In this post will use that infrastructure to do a task that is common in machine-learning and data mining: a task known as sparse matrix multiplication.

Keep reading

In this article, we continue the work from Elasticsearch in Apache Spark with Python, Machine Learning Series, Part 2. We are making some basic tools for doing data science, in which our goal is to be able to run machine-learning classification algorithms against large data sets using Apache Spark and Elasticsearch clusters in the cloud.

Keep reading

In this post we're going to continue setting up some basic tools for doing data science. We began the setup in our first article in this series, Building an Elasticsearch Index with Python, Machine Learning Series, Part 1. The goal of this instruction throughout the series is to run machine learning classification algorithms against large data sets, using Apache Spark and Elasticsearch clusters in the cloud. In the first article, we set up a VirtualBox Ubuntu 14 virtual machine, installed Elasticsearch, and built a simple index using Python.

Here we will complete our setup by installing Spark on the VM that we established with the steps given in the first article. Then, we'll perform some simple operations to exercise skill in reading data from an Elasticsearch index, do some transformations on that data, and then write the results into another Elasticsearch index. All the code for the posts in this series will be available in this GitHub repository.

For this second segment, we'll remain local on our Ubuntu 14 VM. Our plan for the next article is to migrate our setup to the cloud.

Keep reading

In this first article, we're going to set up some basic tools for doing fundamental data science exercises. Our goal is to run machine-learning classification algorithms against large data sets, using Apache Spark and Elasticsearch clusters in the cloud. Keep in mind that a major advantage of the approach that we take here is that the same techniques can scale up or down to data sets of varying size. We'll therefore start small.

First, we need to set up a local Ubuntu virtual machine, install Elasticsearch, and use Python to build an index from a small data set. We can then move on to bigger things. All the code for this post and future posts will be available in this GitHub repository. Let's begin.

Keep reading

An Elasticsearch Primer

Posted by Sloan Ahrens January 31, 2014

NWA TechFest Talk

This post consists of the materials used for my talk at NWA TechFest 2014. I hate building slide decks, and I love writing blog posts, so I decided to use a blog post for my slides. It's sort of an experiment, so if you attended my talk, feel free to leave me some feedback in the comment section below.

Keep reading