An Elasticsearch Primer

Posted by Sloan Ahrens January 31, 2014

NWA TechFest Talk

This post consists of the materials used for my talk at NWA TechFest 2014. I hate building slide decks, and I love writing blog posts, so I decided to use a blog post for my slides. It’s sort of an experiment, so if you attended my talk, feel free to leave me some feedback in the comment section below.

Who Am I?

http://www.linkedin.com/in/sloanahrens

Sloan Ahrens
CTO and Co-founder
StackSearch, Inc.

StackSearch is an Ark Challenge alumnus company. We do hosted Elasticsearch at https://qbox.io/, as well as Elasticsearch-based data management and web integration consulting.


What is Elasticsearch?

According to elasticsearch.org, Elasticsearch is:

distributed RESTful search and analytics

Wikipedia says:

“Elasticsearch is a search server based on Lucene. It provides a distributed, multitenant-capable full-text search engine with a RESTful web interface and schema-free JSON documents. Elasticsearch is developed in Java and is released as open source under the terms of the Apache License.”

Basically, Elasticsearch is a really powerful way to build a search engine. It is designed to scale seamlessly both horizontally (more servers) and vertically (bigger servers), and was built from the ground up to be a distributed system. If you set your system up carefully, Elasticsearch can handle as much data as you want to give it.


Inverted Index

Elasticsearch is built on top of Lucene. Without going into much detail, Lucene is the most advanced open-source full text search library out there. In Elasticsearch, an index consists of multiple shards, with zero or more replicas of each shards, and these shards are allocated across the servers in the cluster in a way that maximizes failover and performance. Each shard (replicas are shards too) is a Lucene index.

While the actual implementation is quite complex, at its heart a full text search index consists of an inverted index (or maybe several). When a particular string of text is indexed, it is divided into tokens, and each token becomes an entry in the lookup table for the index. A token is associated with pointers to the documents in which that token appears. How the text is divided into tokens depends on the analyzer used, and Elasticsearch gives us plenty of options for analysis. We’ll come back to this issue later.


A Note on Tools/Props

Endpoint. Using Elasticsearch requires an endpoint, which is nothing but a URL. If you download and install Elasticsearch locally on your own machine, the endpoint will be:

http://localhost:9200

That is the easiest way to get started, and I encourage you to do so if you want to play around with Elasticsearch. I use Elasticsearch a lot, and I prefer not to clutter up my development machine with different versions and lots of test indexes. Also, at StackSearch we do hosted Elasticsearch, so it’s easy for me to use a Qbox cluster endpoint in the cloud, so that’s what I’m going to do. The endpoint I will be using for this talk is:

https://6c555678fc0fa941000.qbox.io/

In case you’re curious, this cluster consists of a single m1.small in Amazon EC2, running the Elasticsearch 1.0 Release Candidate. I’m not really going to get into the DevOps side of Elasticsearch in this talk (even though that is the name of the track), though I’d be happy to talk about that off-line later if anybody wants to.

cURL. The most straightforward way to interact with Elasticsearch is with curl. If you aren’t familiar with curl don’t worry about it; it’s just a command line utility for sending HTTP requests. The simplest request you can send to Elasticsearch looks like this:

curl -XGET "https://6c555678fc0fa941000.qbox.io/"

and the response (in this case) looks like this:

{
   "status": 200,
   "name": "qbox-52dc89b571626f5bc1030000-p",
   "version": {
      "number": "1.0.0.RC1",
      "build_hash": "c6155c5142f7995e054f9f3b7d82f923dd3620bc",
      "build_timestamp": "2014-01-15T17:02:32Z",
      "build_snapshot": false,
      "lucene_version": "4.6"
   },
   "tagline": "You Know, for Search"
}

Sense. Sense is a Chrome extension that provides a handy user interface for interacting with Elasticsearch. It’s available in the Chrome web store. You can paste in curl commands and it will parse them for you, and you can also export Sense code blocks as curl commands, which can be very useful for blog posts. 🙂

Sense Gisterator. At StackSearch we needed a handy way to share Elasticsearch code via a URL. Sense is a fantastic tool, and open-source, so we decided to extend it to use as a sort of “ES Fiddle” that lets us share runnable Elasticsearch code via links. I just cloned the repository and made a few simple code changes to allow code to be saved to an Elasticsearch index, and loaded by the ID in the URL. It’s still something of a work in progress, but if you’re interested in seeing what I did, you can find my fork here: https://github.com/sloanahrens/sense. You can see the tool in action here: https://sense.qbox.io/gist/, and I will be using it to illustrate code for the remainder of the post, though in the post itself I will use the curl syntax. The above example looks very simple in Sense:

http://sense.qbox.io/gist/d20f64acc6e70b6079845f2fe357732929550ae1

GET /

Implicit Schema Index

For the first few examples we will use this gist:

https://sense.qbox.io/gist/9534e69b0bcc324454feffe54db6a9c437c7ae30

Like many NoSQL datastore technologies, you do not have to define a schema to save documents. Elasticsearch calls them “mappings”, and if you do not define one explicitly, Elasticsearch will define one implicitly for you. It chooses what are usually sensible defaults. This can be a good thing or a bad thing, depending on what you are trying to do.

To create an index, I need only do something as simple as:

curl -XPUT "https://6c555678fc0fa941000.qbox.io/bestbuy1/"

If I then take a look at the mapping, there isn’t a whole lot going on yet:

curl -XGET "https://6c555678fc0fa941000.qbox.io/bestbuy1/_mapping"

response:

{}

Next we will index a couple of documents, and thereby define an implicit mapping as well.

The URL used for creating/updating documents has the following form:

curl -XPUT "[endpoint]/[index_name]/[type_name]/[id]" -d'
{
    [document body]
}'

Now let’s go ahead and index a document:

curl -XPUT "https://6c555678fc0fa941000.qbox.io/bestbuy1/movies/7672734" -d'
{
   "studio":"Walt Disney Video",
   "genre":"Childrens and Family",
   "mpaaRating":"G",
   "format":"DVD",
   "sku":7672734,
   "releaseDate":"2006-03-21",
   "name":"Chicken Little (DVD)",
   "salePrice":10.99
}'

The following response is returned:

{
   "_index": "bestbuy1",
   "_type": "movies",
   "_id": "7672734",
   "_version": 1,
   "created": true
}

This response gives us some useful information, including the index name, type name, id of the document, its version number, and whether it was created. If we run exactly the same request a second time, we get back:

{
   "_index": "bestbuy1",
   "_type": "movies",
   "_id": "7672734",
   "_version": 2,
   "created": false
}

As you can see, Elasticsearch is smart enough to create the document if it doesn’t already exist, and update it if it does.

Now if we take a look at the mapping:

curl -XGET "http://6c555678fc0fa941000.qbox.io/bestbuy1/_mapping"

response:

{
   "bestbuy1": {
      "mappings": {
         "movies": {
            "properties": {
               "format": {
                  "type": "string"
               },
               "genre": {
                  "type": "string"
               },
               "mpaaRating": {
                  "type": "string"
               },
               "name": {
                  "type": "string"
               },
               "releaseDate": {
                  "type": "date",
                  "format": "dateOptionalTime"
               },
               "salePrice": {
                  "type": "double"
               },
               "sku": {
                  "type": "long"
               },
               "studio": {
                  "type": "string"
               }
            }
         }
      }
   }
}

We’ll see that Elasticsearch did a pretty good job of figuring out what types we needed from the context. Sometimes this can cause us problems later, though, as we will see.


Bulk Indexing

In the previous example we created a document at a time, but it is also possible to do many CRUD operations at once. In the following example I will update two documents, one that already exists and one that doesn’t:

curl -XPOST "http://6c555678fc0fa941000.qbox.io/bestbuy1/_bulk" -d'
{
   "index": {
      "_index": "bestbuy1",
      "_type": "movies",
      "_id": 7672734
   }
}
{
   "studio": "Walt Disney Video",
   "genre": "Childrens and Family",
   "mpaaRating": "G",
   "format": "DVD",
   "sku": 7672734,
   "releaseDate": "2006-03-21",
   "name": "Chicken Little (DVD)",
   "salePrice": 10.99
}
{
   "index": {
      "_index": "bestbuy1",
      "_type": "movies",
      "_id": 2868671
   }
}
{
   "studio": "Paramount",
   "genre": "Action and Adventure",
   "mpaaRating": "PG",
   "format": "Blu-ray Disc",
   "sku": 2868671,
   "releaseDate": "2013-12-17",
   "name": "Indiana Jones and the Temple of Doom (Blu-ray Disc)",
   "salePrice": 19.99
}
'

response:

{
   "took": 4,
   "errors": false,
   "items": [
      {
         "index": {
            "_index": "bestbuy1",
            "_type": "movies",
            "_id": "7672734",
            "_version": 3,
            "status": 200
         }
      },
      {
         "index": {
            "_index": "bestbuy1",
            "_type": "movies",
            "_id": "2868671",
            "_version": 2,
            "status": 200
         }
      }
   ]
}

Notice that in the bulk request, we have alternating json document that are not separated by commas; each document to be indexed is preceded by a metadata object describing the operation. You can do bulk CRUD operations across several indices and types, and even do deletions. One gotcha here: the trailing newline character is required. You will get an error without it (note sense indenting issue). You can read more about bulk indexing in the Elasticsearch docs.


Simple GET Queries

Now that we have an index with some documents saved, let’s do some simple querying. The simplest possible search query is just:

curl -XGET "http://6c555678fc0fa941000.qbox.io/bestbuy1/_search"

which in our case will return both docs:

{
   "took": 0,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 1,
      "hits": [
         {
            "_index": "bestbuy1",
            "_type": "movies",
            "_id": "7672734",
            "_score": 1,
            "_source": {
               "studio": "Walt Disney Video",
               "genre": "Childrens and Family",
               "mpaaRating": "G",
               "format": "DVD",
               "sku": 7672734,
               "releaseDate": "2006-03-21",
               "name": "Chicken Little (DVD)",
               "salePrice": 10.99
            }
         },
         {
            "_index": "bestbuy1",
            "_type": "movies",
            "_id": "2868671",
            "_score": 1,
            "_source": {
               "studio": "Paramount",
               "genre": "Action and Adventure",
               "mpaaRating": "PG",
               "format": "Blu-ray Disc",
               "sku": 2868671,
               "releaseDate": "2013-12-17",
               "name": "Indiana Jones and the Temple of Doom (Blu-ray Disc)",
               "salePrice": 19.99
            }
         }
      ]
   }
}

Suppose we want to return only the documents that contain “chicken”, we could do

curl -XGET "http://6c555678fc0fa941000.qbox.io/bestbuy1/_search?q=chicken"

response:

{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.25,
      "hits": [
         {
            "_index": "bestbuy1",
            "_type": "movies",
            "_id": "7672734",
            "_score": 0.25,
            "_source": {
               "studio": "Walt Disney Video",
               "genre": "Childrens and Family",
               "mpaaRating": "G",
               "format": "DVD",
               "sku": 7672734,
               "releaseDate": "2006-03-21",
               "name": "Chicken Little (DVD)",
               "salePrice": 10.99
            }
         }
      ]
   }
}

Query DSL, Match vs. Term

Using only the URL query string will only get you so far. It quickly becomes easier to use Elasticsearch’s powerful domain specific language for querying. The first query we ran is equivalent to:

curl -XPOST "http://6c555678fc0fa941000.qbox.io/bestbuy1/_search" -d'
{
    "query": {
        "match_all": {}
    }
}'

"match_all" does what you might expect; it just matches all documents, and returns the first n documents it finds. (we will look at defining n a little later). The second query we ran can be accomplished with query DSL by:

curl -XPOST "http://6c555678fc0fa941000.qbox.io/bestbuy1/_search" -d'
{
    "query": {
        "match": {
           "name": "chicken"
        }
    }
}'

The "match" query is a very powerful type of query that is also quite robust. I will not go into too much detail here, except to say that the query text is analyzed, and we will see why that matters below. The "match" query is often the type of query you want to use in user-facing search functions, like the “search box” on a website.

Another type of query that is used a lot (though often as a filter rather than a query; we’ll get to that in a minute) is the "term" query. This query does not analyze the search text, but only tries to match it as-is against terms in the lookup table for the given field. So, while this query returns the result we would expect:

curl -XPOST "http://6c555678fc0fa941000.qbox.io/bestbuy1/_search" -d'
{
    "query": {
        "term": {
           "name": {
              "value": "chicken"
           }
        }
    }
}'

response:

{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.5,
      "hits": [
         {
            "_index": "bestbuy1",
            "_type": "movies",
            "_id": "7672734",
            "_score": 0.5,
            "_source": {
               "studio": "Walt Disney Video",
               "genre": "Childrens and Family",
               "mpaaRating": "G",
               "format": "DVD",
               "sku": 7672734,
               "releaseDate": "2006-03-21",
               "name": "Chicken Little (DVD)",
               "salePrice": 10.99
            }
         }
      ]
   }
}

This one does not:

curl -XPOST "http://6c555678fc0fa941000.qbox.io/bestbuy1/_search" -d'
{
    "query": {
        "term": {
           "name": {
              "value": "Chicken"
           }
        }
    }
}'

response:

{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 0,
      "max_score": null,
      "hits": []
   }
}

What’s going on here? It has to do with how the document text is analyzed and broken down into terms. Since we did not specify an analyzer when creating the mapping, the standard analyzer is used. This analyzer splits text on whitespace (also on puncuation and special characters), then normalizes the resulting tokens to lower-case (and does a couple more things) before inserting them into the lookup table. So, while “chicken” is a term in the lookup table, “Chicken” is not. The term query does not do any analysis on the search term, so the first query retrieves a result but the second one does not. If we use the match query, however, which does analyze the search text (with the standard analyzer, in this case), we will get a result:

curl -XPOST "http://6c555678fc0fa941000.qbox.io/bestbuy1/_search" -d'
{
    "query": {
        "match": {
           "name": "Chicken"
        }
    }
}'

response:

{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.5,
      "hits": [
         {
            "_index": "bestbuy1",
            "_type": "movies",
            "_id": "7672734",
            "_score": 0.5,
            "_source": {
               "studio": "Walt Disney Video",
               "genre": "Childrens and Family",
               "mpaaRating": "G",
               "format": "DVD",
               "sku": 7672734,
               "releaseDate": "2006-03-21",
               "name": "Chicken Little (DVD)",
               "salePrice": 10.99
            }
         }
      ]
   }
}

Don’t worry if this is confusing. We’re going to come back to the issue of terms in a minute.


Facets

Let’s suppose we want to see a list of the values that have been stored for a particular field, and the number of documents corresponding to each term. Especially a field like "studio" or "genre", that, for example, might be used to filter results on an e-commerce website. Facets are a handy way to accomplish this. Take a look at the example below.

curl -XPOST "http://6c555678fc0fa941000.qbox.io/bestbuy1/_search" -d'
{
    "size": 0, 
    "query": {
        "match_all": {}
    },
    "facets": {
       "studio_terms": {
          "terms": {
             "field": "studio",
             "size": 10
          }
       }
    }
}'

response:

{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0,
      "hits": []
   },
   "facets": {
      "studio_terms": {
         "_type": "terms",
         "missing": 0,
         "total": 4,
         "other": 0,
         "terms": [
            {
               "term": "walt",
               "count": 1
            },
            {
               "term": "video",
               "count": 1
            },
            {
               "term": "paramount",
               "count": 1
            },
            {
               "term": "disney",
               "count": 1
            }
         ]
      }
   }
}

I used "size": 0 here, since we are interested in facet results, not document results. Notice that the response tells us how many document hits there were (all 2 of them, in this case), but does not return any since we asked for zero results.

If you look at the results we are getting for "studio", it may be apparent that this is not really what we want. If we were going to use these to construct navigation links on an e-commerce site, say, we would probably want the full, unmodified value “Walt Disney Video”, rather than the analyzed terms we are seeing. How can we accomplish this? We will have to define an explicit mapping.


Mappings

For the next few examples, I’ll be using the code in this gist:

http://sense.qbox.io/gist/84ba2dac5bf76a6ae0ec3984f7ff1791ecea887f

Elasticsearch provides a pretty simple way to be sure that only the exact, complete text of a field gets added as a term in the lookup table: all we have to do is add "index": "not_analyzed" to the definition for that field. But to do this, we have to define an explicit mapping, and we have to do it before we add any documents to the index. So next I’m going to create a new index with an explicit mapping; this mapping is very similar to the implicit one we saw before, but certain fields (the ones I want to facet on) will no longer be analzyed:

curl -XPUT "http://6c555678fc0fa941000.qbox.io/bestbuy2/"
curl -XPUT "http://6c555678fc0fa941000.qbox.io/bestbuy2/movies/_mapping" -d'
{
   "movies": {
      "properties": {
         "format": {
            "type": "string",
            "index": "not_analyzed"
         },
         "genre": {
            "type": "string",
            "index": "not_analyzed"
         },
         "mpaaRating": {
            "type": "string",
            "index": "not_analyzed"
         },
         "name": {
            "type": "string"
         },
         "releaseDate": {
            "type": "date",
            "format": "dateOptionalTime"
         },
         "salePrice": {
            "type": "double"
         },
         "sku": {
            "type": "long"
         },
         "studio": {
            "type": "string",
            "index": "not_analyzed"
         }
      }
   }
}'

So now, when I facet on the "studio" field, I get the results that I expect:

curl -XPOST "http://6c555678fc0fa941000.qbox.io/bestbuy2/_search" -d'
{
   "size": 0,
   "query": {
      "match_all": {}
   },
   "facets": {
      "studio_terms": {
         "terms": {
            "field": "studio",
            "size": 10
         }
      }
   }
}'

response:

{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0,
      "hits": []
   },
   "facets": {
      "studio_terms": {
         "_type": "terms",
         "missing": 0,
         "total": 2,
         "other": 0,
         "terms": [
            {
               "term": "Walt Disney Video",
               "count": 1
            },
            {
               "term": "Paramount",
               "count": 1
            }
         ]
      }
   }
}

Filtered Queries

With the explicitly mapped index that I just created, I can now filter on the "studio" field using a filtered query:

curl -XPOST "http://6c555678fc0fa941000.qbox.io/bestbuy2/_search" -d'
{
   "query": {
      "filtered": {
         "query": {
            "match_all": {}
         },
         "filter": {
            "term": {
               "studio": "Walt Disney Video"
            }
         }
      }
   }
}'

response:

{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 1,
      "hits": [
         {
            "_index": "bestbuy2",
            "_type": "movies",
            "_id": "7672734",
            "_score": 1,
            "_source": {
               "studio": "Walt Disney Video",
               "genre": "Childrens and Family",
               "mpaaRating": "G",
               "format": "DVD",
               "sku": 7672734,
               "releaseDate": "2006-03-21",
               "name": "Chicken Little (DVD)",
               "salePrice": 10.99
            }
         }
      ]
   }
}

More Facets and Filters

For the next few examples, I’ll be using the code in this gist:

http://sense.qbox.io/gist/dbaa89bf0690a6a6dc2ae767491a231c743de3cc

I will use the same mapping but in a slightly different form, and a few more documents.

First take a look at the index definition:

curl -XPUT "http://6c555678fc0fa941000.qbox.io/bestbuy3/" -d'
{
   "settings": {
      "number_of_shards": 2,
      "number_of_replicas": 1
   },
   "mappings": {
      "movies": {
         "properties": {
            "format": {
               "type": "string",
               "index": "not_analyzed"
            },
            "genre": {
               "type": "string",
               "index": "not_analyzed"
            },
            "mpaaRating": {
               "type": "string",
               "index": "not_analyzed"
            },
            "name": {
               "type": "string"
            },
            "releaseDate": {
               "type": "date",
               "format": "dateOptionalTime"
            },
            "salePrice": {
               "type": "double"
            },
            "sku": {
               "type": "long"
            },
            "studio": {
               "type": "string",
               "index": "not_analyzed"
            }
         }
      }
   }
}'

Here we are defining the index with a single command instead of two. Also, we are defining how many shards and replicas we want our index to have. As I mentioned earlier, each shard/replica is a single Lucene index. The idea is that the index should be divided across several servers for performance reasons. The replica of each primary shard will live on a different server than the primary, so if a server is lost, a replica shard can be promoted to a primary, and we will still have a copy of all the data. Elasticsearch handles all this for us in the background. You can change the number of replicas in an existing index, but you cannot change the number of shards (without building a new index). So it’s important to think about how many shards you think you will need when you create an index.

After creating our new index, we will add some data (7 docs instead of 2, this time):

curl -XPOST "http://6c555678fc0fa941000.qbox.io/bestbuy3/_bulk" -d'
{"index":{"_index":"bestbuy3","_type":"movies","_id":7672734}}
{"studio":"Walt Disney Video","genre":"Childrens and Family","mpaaRating":"G","format":"DVD","sku":7672734,"releaseDate":"2006-03-21","name":"Chicken Little (DVD)","salePrice":10.99}
{"index":{"_index":"bestbuy3","_type":"movies","_id":9270898}}
{"studio":"Walt Disney Video","genre":"Fantasy","mpaaRating":"G","format":"Blu-ray Disc","sku":9270898,"releaseDate":"2009-05-19","name":"A Bug\"s Life (2 Disc) (Blu-ray Disc)","salePrice":29.99}
{"index":{"_index":"bestbuy3","_type":"movies","_id":2868671}}
{"studio":"Paramount","genre":"Action and Adventure","mpaaRating":"PG","format":"Blu-ray Disc","sku":2868671,"releaseDate":"2013-12-17","name":"Indiana Jones and the Temple of Doom (Blu-ray Disc)","salePrice":19.99}
{"index":{"_index":"bestbuy3","_type":"movies","_id":2119035}}
{"studio":"Funimation Prod","genre":"Sci-Fi","mpaaRating":"R","format":"Blu-ray Disc","sku":2119035,"releaseDate":"2013-11-12","quantityLimit":3,"name":"Akira (2 Disc) (Blu-ray Disc)","salePrice":24.99}
{"index":{"_index":"bestbuy3","_type":"movies","_id":2374372}}
{"studio":"Warner Home Video","genre":"Sci-Fi","mpaaRating":"R","format":"Blu-ray Disc","sku":2374372,"releaseDate":"2011-05-31","name":"A Clockwork Orange (2 Disc) (Anniversary Edition) (Blu-ray Disc)","salePrice":9.99}
{"index":{"_index":"bestbuy3","_type":"movies","_id":9877862}}
{"studio":"Sony Pictures","genre":"Sci-Fi","mpaaRating":"R","format":"DVD","sku":9877862,"releaseDate":"2010-05-25","name":"The Road (DVD)","salePrice":7.99}
{"index":{"_index":"bestbuy3","_type":"movies","_id":6502332}}
{"studio":"Warner Home Video","genre":"Westerns","mpaaRating":"NR","format":"DVD","sku":6502332,"releaseDate":"2010-11-09","name":"Rio Bravo (DVD)","salePrice":11.99}
'

And now let’s explore facets and filtering a bit more. We’re going to request terms facets on "studio" and "genre", and a histogram facet on "salePrice" with an interval of 5:

curl -XPOST "http://6c555678fc0fa941000.qbox.io/bestbuy3/_search" -d'
{
   "size": 0,
   "query": {
      "match_all": {}
   },
   "facets": {
      "studio_terms": {
         "terms": {
            "field": "studio",
            "size": 10
         }
      },
      "genre_terms": {
         "terms": {
            "field": "genre",
            "size": 10
         }
      },
      "price_hist": {
         "histogram": {
            "field": "salePrice",
            "interval": 5
         }
      }
   }
}'

The response is:

{
   "took": 5,
   "timed_out": false,
   "_shards": {
      "total": 2,
      "successful": 2,
      "failed": 0
   },
   "hits": {
      "total": 7,
      "max_score": 0,
      "hits": []
   },
   "facets": {
      "studio_terms": {
         "_type": "terms",
         "missing": 0,
         "total": 7,
         "other": 0,
         "terms": [
            {
               "term": "Warner Home Video",
               "count": 2
            },
            {
               "term": "Walt Disney Video",
               "count": 2
            },
            {
               "term": "Sony Pictures",
               "count": 1
            },
            {
               "term": "Paramount",
               "count": 1
            },
            {
               "term": "Funimation Prod",
               "count": 1
            }
         ]
      },
      "genre_terms": {
         "_type": "terms",
         "missing": 0,
         "total": 7,
         "other": 0,
         "terms": [
            {
               "term": "Sci-Fi",
               "count": 3
            },
            {
               "term": "Westerns",
               "count": 1
            },
            {
               "term": "Fantasy",
               "count": 1
            },
            {
               "term": "Childrens and Family",
               "count": 1
            },
            {
               "term": "Action and Adventure",
               "count": 1
            }
         ]
      },
      "price_hist": {
         "_type": "histogram",
         "entries": [
            {
               "key": 5,
               "count": 2
            },
            {
               "key": 10,
               "count": 2
            },
            {
               "key": 15,
               "count": 1
            },
            {
               "key": 20,
               "count": 1
            },
            {
               "key": 25,
               "count": 1
            }
         ]
      }
   }
}

histogram facet works on a numeric field, and segments the data into “buckets”, with the interval each bucket represents defined by the "interval" setting in the request. The key of each result, by default, is the minimum value for that bucket. Histogram facets are quite powerful; you can even define scripts to create custom keys and values for the histogram buckets.

Facets are often used to present a user with filtering options. We know from our facet results that there are two documents with a price between $5 and $10, so let’s retrieve those documents with a range filter:

curl -XPOST "http://6c555678fc0fa941000.qbox.io/bestbuy3/_search" -d'
{
   "query": {
      "filtered": {
         "query": {
            "match_all": {}
         },
         "filter": {
            "range": {
               "salePrice": {
                  "from": 5,
                  "to": 10
               }
            }
         }
      }
   }
}'

response:

{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 2,
      "successful": 2,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 1,
      "hits": [
         {
            "_index": "bestbuy3",
            "_type": "movies",
            "_id": "9877862",
            "_score": 1,
            "_source": {
               "studio": "Sony Pictures",
               "genre": "Sci-Fi",
               "mpaaRating": "R",
               "format": "DVD",
               "sku": 9877862,
               "releaseDate": "2010-05-25",
               "name": "The Road (DVD)",
               "salePrice": 7.99
            }
         },
         {
            "_index": "bestbuy3",
            "_type": "movies",
            "_id": "2374372",
            "_score": 1,
            "_source": {
               "studio": "Warner Home Video",
               "genre": "Sci-Fi",
               "mpaaRating": "R",
               "format": "Blu-ray Disc",
               "sku": 2374372,
               "releaseDate": "2011-05-31",
               "name": "A Clockwork Orange (2 Disc) (Anniversary Edition) (Blu-ray Disc)",
               "salePrice": 9.99
            }
         }
      ]
   }
}

Now suppose we want to further filter our results by "studio", we can use a bool filter to combine our two constraints:

curl -XPOST "http://6c555678fc0fa941000.qbox.io/bestbuy3/_search" -d'
{
   "query": {
      "filtered": {
         "query": {
            "match_all": {}
         },
         "filter": {
            "bool": {
               "must": [
                  {
                     "range": {
                        "salePrice": {
                           "from": 5,
                           "to": 10
                        }
                     }
                  },
                  {
                     "term": {
                        "studio": "Sony Pictures"
                     }
                  }
               ]
            }
         }
      }
   }
}'

response:

{
   "took": 3,
   "timed_out": false,
   "_shards": {
      "total": 2,
      "successful": 2,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 1,
      "hits": [
         {
            "_index": "bestbuy3",
            "_type": "movies",
            "_id": "9877862",
            "_score": 1,
            "_source": {
               "studio": "Sony Pictures",
               "genre": "Sci-Fi",
               "mpaaRating": "R",
               "format": "DVD",
               "sku": 9877862,
               "releaseDate": "2010-05-25",
               "name": "The Road (DVD)",
               "salePrice": 7.99
            }
         }
      ]
   }
}

Conclusion

We have covered a lot of ground here. Don’t worry if it didn’t all completely sink in. Elasticsearch is a very rich topic, and worthy of extended study. That’s one reason I wanted to write this talk up as a blog post, so anybody who wants to can refer back to it later. If you have any questions or feedback for me, don’t hesitate to use the comments below. Thank you for listening (and/or reading)!


Bonus: Autocomplete

I doubt there will be enough time, but if there is, we can dive into a quick example of using Elasticsearch to implement autocomplete.