In this tutorial article, we step through an introduction on a very popular Elasticsearch feature, the Percolator.

When most ES developers think conventionally, they design documents according the structure of data and store them in an index. When they subsequently want to retrieve these documents, they define queries through the search API. The percolator works in the opposite direction. First you store queries into an index and then—through the Percolate API—you define documents in order to retrieve these queries. Continue reading to see how you can use Percolate to perform these reverse searches.

Setup

If you've never done an install and basic setup of Elasticsearch, we recommend that you invest 15 minutes to acquaint yourself with our Elasticsearch tutorial. After installing it, you can run any of the code that we provide in the examples below.

Percolate

The Percolate API is a commonly-used utility in Elasticsearch for alerting and monitoring documents. A good way to think about the main function of Percolate is "search in reverse." Elasticsearch usually queries a set of documents, looking for relevance of each one to a specific search request. Percolate works in the opposite way, running your documents up against registered queries (percolators) for matches.

Elasticsearch-percolate-API-2

A Little History

Release 1.0 of Elasticsearch brought a major change to how the Percolate API distributes its registered queries. Percolator 0.90.x and previous versions have a single-shard index restriction. With a single shard, performance continues to degrade as the number of registered queries grows.

To get around this bottleneck, you could either partition queries against multiple single-shard indices or manipulate Percolate queries to reduce the execution time. However, using these methods would still cause scaling limitations for any Percolator index shard.

Having to “get around” the bottleneck was a concern for the Elasticsearch team, and they were working on a distributed enhancement to the Percolator. Since release 1.0.0, distributed Percolation has done away with all of these concerns, dropping the previous _percolator index shard restriction for a .percolator type in an index.


Distributed Percolation

A .percolator type gives users a distributed Percolator API environment for full shard distribution. You can now configure the number of shards necessary for your Percolator queries, changing from a restricted single shard execution to a parallelized execution between all shards within that index. Multiple shards mean support for routing and preference, just like the other Search APIs (except the Explain API).

Elasticsearch-percolate-API-3

Dropping the old _percolator index shard restriction does break backwards compatibility with the 0.90.x Percolator, but the necessity of making such changes is a great reason to provide renovations and features.

Structure of a Percolator

Registering a .percolator is easy. In this example, we register a match query for the sport field containing “baseball.”

curl -XPUT 'localhost:9200/sports/.percolator/1' -d '{
   "query" : { 
       "match" : {
           "sport" : "baseball"
       }
   }
}'

Default mapping for a .percolator type is a query field type of object, with enabled set to false. (Enabled allows disabling of parsing and indexing on a named object.) It is worth noting that this new index type could exist on a dedicated Percolator index. When you are using a dedicated Percolator index, remember to include the mapping of the documents that you _percolate. If you don't take the trouble to set up the correct mapping for the documents that you _percolate, then it's likely that any .percolator queries will be parsed incorrectly.

Here is our example request:

curl -XGET "http://localhost:9200/sports/_mapping"

Response:

{
  "sports" : {
    "mappings" : {
      ".percolator" : {
        "_id" : {
        "index" : "not_analyzed"
      },
      "properties" : {
        "query" : {
          "type" : "object",
          "enabled" : false
        }
      }
    }
  }
}

Percolate

Running _percolate through the .percolator below will return a match if it matches a .percolator relevance. There are several ways we can run our documents against our Percolator. First, we will use the very standard “doc” body to execute the _percolator API. We would typically use this method on documents that do not already exist.

Here's the percolator:

curl -XPUT 'localhost:9200/sports/.percolator/1' -d '{
   "query" : {
       "match" : {
           "sport" : "baseball"
       }
   }
}'


Percolating on a “doc” body:

curl -XPOST "http://localhost:9200/sports/athlete/_percolate/" -d '{
  "doc": {
     "name": "Jeff",
     "birthdate": "1990-4-1",
     "sport": "Baseball",
     "rating": 2,
     "location": "46.12,-68.55"
  }
}'


This sports index has a single .percolator with “_id”:”1” that our document matches. You can see in the response given below that it took 1ms, that 5 out of 5 shards were successful, and that we get a match on one Percolator in the sports index with “_id”: “1”.

Response:

{
  "took": 1,
  "_shards": {
     "total": 5,
     "successful": 5,
     "failed": 0
  },
  "total": 1,
  "matches": [
     {
        "_index": "sports",
        "_id": "1"
     }
  ]
}

You can achieve bulk document percolation with the multi-Percolate API, which is similar to the bulk API. The structure of the multi-percolate API begins with a header in which you specify your index, type, and id. Following the header is the JSON document body. Keep in mind that no JSON document is necessary when you're percolating an existing document; you only need to reference the _id of the document.

Request:

curl -XGET 'localhost:9200/sports/athlete/_mpercolate' --data-binary @multi-percolate.txt; echo


Multi-percolate.text:

{"percolate" : {"index" :”sport", "type" : "athlete"}}
{"doc" : {"name":"Michael", "birthdate":"1989-10-1", "sport":"Baseball", "rating": ["5", "4"],  "location":"46.22,-68.45"}}
{"percolate" : {"index" : twitter", "type" : "tweet", "id" : "1"}}
{}


To _percolate a single existing document, simply include the _id of the document:

curl -XGET 'localhost:9200/sports/athlete/1/_percolate'


Another format for the standard _percolate response is count, which only responds with the total number of matches.

curl -XPOST "http://localhost:9200/sports/athlete/_percolate/count" -d '{
  "doc": {
     "name": "Jeff",
     "birthdate": "1990-4-1",
     "sport": "Baseball",
     "rating": 2,
     "location": "46.12,-68.55"
  }
}'

This is the response:

{
  "took": 3,
  "_shards": {
     "total": 5,
     "successful": 5,
     "failed": 0
  },
  "total": 1
}

One method to percolate specific athletes with the sport baseball would be to be apply a filter. We could then, for example, create a .percolator on a specific field such as birthdate.

curl -XPOST "http://localhost:9200/sports/athlete/_percolate/" -d '{
  "doc": {
     "name": "Jeff",
     "birthdate": "1990-4-1",
     "sport": "Baseball",
     "rating": 2,
     "location": "46.12,-68.55"
  },
  "filter": {
     "term": {
        "sport": "baseball"
     }
  }
}'
curl -XPUT "http://localhost:9200/sports/.percolator/2" -d '{
   "query":{
       "match": {
          "birthdate": "1990-4-1"
       }
   }
}'


Other query string options for _percolate include size, track_scores, sort, facets, aggs, and highlight. Query and filter options only differ according to the computation of the query score. You can use the computed score to show the document score, which is based on the query’s match to the Percolate query’s metadata. You can also use highlight, facets, or aggregations on these request bodies, and use size to specify the number of matches to return (defaults to unlimited).

If properly configured, distributed percolation can be a robust solution for some of the most active databases in production today.

If you enjoyed this post, you’ll want to check out some of our other tutorials, such as An Introduction to Elasticsearch Aggregations or Quick and Dirty Autocomplete with Elasticsearch Completion Suggest.



comments powered by Disqus