Autocomplete is everywhere. Users have come to expect this feature in almost any search experience, and an elegant way to implement it is an essential tool for every software developer.

There are at least two broad types of autocomplete, what I will call Search Suggest, and Result Suggest. Search Suggest returns suggestions for search phrases, usually based on previously logged searches, ranked by popularity or some other metric. This is what Google does, and it is what you will see on many large e-commerce sites. This approach requires logging users' searches and ranking them so that the autocomplete suggestions evolve over time. There really isn't a good way to implement this sort of feature without logging searches, which is why DuckDuckGo doesn't have autocomplete. It also suffers from a chicken-and-egg problem in that it will not work well to begin with unless you have a good set of seed data.

The second type of autocomplete is Result Suggest. In this case the suggestions are actual results rather than search phrase suggestions. An example of this is the Elasticsearch documentation guide. This style of autocomplete works well with a reasonably small data set, and it has the advantage of not requiring a large set of previously logged searches in order to be useful. In this post I'm going to describe a method of implementing Result Suggest using Elasticsearch.

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click "Get Started" in the header navigation. If you need help setting up, refer to "Provisioning a Qbox Elasticsearch Cluster."

Completion Suggest

Elasticsearch provides a convenient way to get autocomplete up and running quickly with its completion suggester feature. This feature is very powerful, very fast, and very easy to use. It can be used to implement either type of autocomplete (although for Search Suggest you will need a second index for storing logged searches). I made a short post about completion suggest last week, and if you need to get up and running quickly you should definitely try it out.

Completion suggest has a few constraints, however, due to the nature of how it works.

Single field. To use completion suggest, you have to specify a field of type "completion" in your mapping (here is an example). While you can specify many inputs and a single unified output, only this field can be used with a _suggest query. If you want the _suggest results to correspond to search inputs from many different fields in your document, you have to provide all of those values as inputs at index time.

Duplicate data. In order for completion suggesting to be useful, it has to return results that match the text query, and these matches are determined at index time by the inputs specified in the "completion" field (and stemming of those inputs). This usually means that, as in this example, you end up with duplicated data.

Prefix query only. Completion suggest is designed to be a powerful and easily implemented solution for autocomplete that works well in most cases. Most of the time autocomplete need only work as a prefix query. If I type "word" then I expect "wordpress" as a suggestion, but not "afterword." If I want more general partial word completion, however, I must look elsewhere.

No filtering, or advanced queries. In many, and perhaps most, autocomplete applications, no advanced querying is required. Let's suppose, however, that I only want auto-complete results to conform to some set of filters that have already been established (by the selection of category facets on an e-commerce site, for example). There is no way to handle this with completion suggest. If I need more advanced querying capabilities I will need to set up a different sort of autocomplete system.

I'm going to explain a technique for implementing autocomplete (it also works for standard search functionality) that does not suffer from these limitations. But first I want to show you the dataset I will be using and a demonstration site that uses the technique I will be explaining.

Best Buy Movies Demo and Index

ngrams_autocomplete_1.png#asset:489

http://demos.qbox.io/bestbuy/

For the remainder of this post I will refer to the demo at the link above as well as the Elasticsearch index it uses to provide both search results and autocomplete. The demo is useful because it shows a real-world (well, close to real-world) example of the issues we will be discussing. It is a single-page e-commerce search application that pulls its data from an Elasticsearch index. The index lives here (on a Qbox hosted Elasticsearch cluster, of course!):

http://be6c2e3260c3e2af000.qbox.io/blurays/

The index was constructed using the Best Buy Developer API. Here is an example document, so you can see what the structure looks like:

{
   "studio":"Walt Disney Video",
   "genre":"Fantasy",
   "mpaaRating":"G",
   "image":"http://images.bestbuy.com/BestBuy_US/en_US/images/musicmoviegame//pdpimages/1008628.jpg",
   "format":"DVD",
   "sku":1008628,
   "plot":"1963's The Sword in the Stone is Disney's animated take on Arthurian legend. In the midst of the Dark Ages, when England has no rightful ruler, a sword imbedded in a stone mysteriously appears in a London churchyard, bearing the inscription \"Whoso pulleth out the sword of this stone and anvil is rightwise king born of England.\" Scores of would-be kings travel to London to attempt the feat and thereby claim the throne. They all fail. Years later, in the English countryside, an 11-year-old squire nicknamed Wart (Rickie Sorensen) is devotedly helping his incompetent foster brother, Kay (Norman Alden), train to become a knight, when he meets the great magician Merlin (Karl Swenson). The well meaning, but absentminded, wizard declares himself Wart's mentor and claims that he will lead the boy to his destiny. Spirited and full of spunk, Wart (whose real name is Arthur) approaches Merlin's lessons with the same determination that he applies to Kay's hopeless training and to the monotonous chores he is assigned by his guardian. He soon finds himself accompanying Kay to London for a jousting tournament that will determine England's new king. There, Wart forgets to bring Kay's weapon to the joust, but finds an abandoned sword in a nearby churchyard -- which he effortlessly pulls out of a stone. ~ Aubry Anne D'Arminio, Rovi",
   "releaseDate":"2013-08-06",
   "quantityLimit":3,
   "name":"The Sword in the Stone (DVD)",
   "addToCartUrl":"http://www.bestbuy.com/site/olspage.jsp?id=pcmcat152200050035&type=category&cmp=RMX&ky=2f5IBQ5bGt0gS6BfPZApsjVfKmtgVmI14&qvsids=1008628",
   "salePrice":21.99
}

Project Requirements

Suppose that we are given the following requirements for autocomplete (and/or search) from a manager or client:

Partial word matching. The query must match partial words. So typing "disn" should return results containing "Disney". Matches should be returned even if the search term occurs in the middle of a word. So for example, "day" should return results containing "holiday".

Multiple search fields. The query must match across several fields. So typing "Disney 2013" should match Disney movies with a 2013 release date. For concreteness, the fields that queries must be matched against are: ["name", "genre", "studio", "sku", "releaseDate"].

Filtered search. The results returned should match the currently selected filters. For example, if a user of the demo site given above has already selected Studio: "Walt Disney Video", MPAA Rating: "G", and Genre: "Sci-Fi" and then types "wall", she should easily be able to find "Wall-E" (you can see this in action here).

nGrams

What is an n-gram? Well, in this context an n-gram is just a sequence of characters constructed by taking a substring of a given string. To understand why this is important, we need to talk about analyzers, tokenizers and token filters.

When you index documents with Elasticsearch, it uses them to build an inverted index. This is basically a dictionary containing a list of terms (a.k.a. "tokens"), together with references to the documents in which those terms appear. Each field in the mapping (whether the mapping is explicit or implicit) is associated with an "analyzer", and an analyzer consists of a "tokenizer" and zero or more "token filters." The analyzer is responsible for transforming the text of a given document field into the tokens in the lookup table used by the inverted index. When a text search is performed, the search text is also analyzed (usually), and the resulting tokens are compared against those in the inverted index; if matches are found, the referenced documents are returned.

For example, given the document above, if the "studio" field is analyzed using the standard analyzer (the default, when no analyzer is specified), then the text "Walt Disney Video" would be transformed into the tokens ["walt", "disney", "video"], and so a search for the term "disney" would match one of the terms listed for that document, and the document would be returned.

The "nGram" tokenizer and token filter can be used to generate tokens from substrings of the field value. There are edgeNGram versions of both, which only generate tokens that start at the beginning of words ("front") or end at the end of words ("back"). I will be using nGram token filter in my index analyzer below.

The Analyzers

Now I'm going to show you my solution to the project requirements given above, for the Best Buy movie data we've been looking at. Here is the first part of the settings used by the index (in curl syntax):

curl -XPUT "http://localhost:9200/blurays " -d'
{
   "settings": {
      "analysis": {
         "filter": {
            "nGram_filter": {
               "type": "nGram",
               "min_gram": 2,
               "max_gram": 20,
               "token_chars": [
                  "letter",
                  "digit",
                  "punctuation",
                  "symbol"
               ]
            }
         },
         "analyzer": {
            "nGram_analyzer": {
               "type": "custom",
               "tokenizer": "whitespace",
               "filter": [
                  "lowercase",
                  "asciifolding",
                  "nGram_filter"
               ]
            },
            "whitespace_analyzer": {
               "type": "custom",
               "tokenizer": "whitespace",
               "filter": [
                  "lowercase",
                  "asciifolding"
               ]
            }
         }
      }
   },
   "mappings": {
      ...
   }
}'

I'll get to the mapping in a minute, but first let's take a look at the analyzers. First, notice that there are two analyzers in the index settings: "whitespace_analyzer" and "nGram_analyzer" (these are names that I defined, and I could have called them anything I wanted to).

"whitespace_analyzer." The "whitespace_analyzer" will be used as the search analyzer (the analyzer that tokenizes the search text when a query is executed). I'll discuss why that is important in a minute, but first let's look at how it works. This analyzer uses the whitespace tokenizer, which simply splits text on whitespace, and then applies two token filters. The lowercase token filter normalizes all the tokens to lower-case, and the ascii folding token filter cleans up non-standard characters that might otherwise cause problems.

"nGram_analyzer." The "nGram_analyzer" does everything the "whitespace_analyzer" does, but then it also applies the "nGram_filter." The "nGram_filter" is what generates all of the substrings that will be used in the index lookup table. It is a token filter of "type": "nGram". "min_gram": 2 and "max_gram": 20 set the minimum and maximum length of substrings that will be generated and added to the lookup table. "token_chars" specifies what types of characters are allowed in tokens. Punctuation and special characters will normally be removed from the tokens (for example, with the standard analyzer), but specifying "token_chars" the way I have means we can do fun stuff like this (to, ahem, depart from the Disney theme for a moment).

The "_all" Field

We want to be able to search across multiple fields, and the easiest way to do that is with the "_all" field, as long as some care is taken in the mapping definition. One of our requirements was that we must perform search against only certain fields, and so we can keep the other fields from showing up in the "_all" field by setting "include_in_all" : false in the fields we don't want to search against. Here is a simplified version of the mapping being used in the demonstration index:

curl -XPUT "http://localhost:9200/blurays" -d'
{
   "settings": {
      "analysis": {
    ...
      }
   },
   "mappings": {
      "movies": {
         "_all": {
            "index_analyzer": "nGram_analyzer",
            "search_analyzer": "whitespace_analyzer"
         },
         "properties": {
            "addToCartUrl": {
               "type": "string",
               "index": "no",
               "include_in_all": false
            },
            "format": {
               "type": "string",
               "index": "not_analyzed"
            },
            "genre": {
               "type": "string",
               "index": "not_analyzed"
            },
            "image": {
               "type": "string",
               "index": "no",
               "include_in_all": false
            },
            "mpaaRating": {
               "type": "string",
               "index": "not_analyzed",
               "include_in_all": false
            },
            "name": {
               "type": "string",
               "index": "not_analyzed"
            },
            "plot": {
               "type": "string",
               "index": "no",
               "include_in_all": false
            },
            "price": {
               "type": "double",
               "include_in_all": false
            },
            "quantityLimit": {
               "type": "integer",
               "include_in_all": false
            },
            "releaseDate": {
               "type": "date",
               "format": "yyyy-MM-dd"
            },
            "sku": {
               "type": "string",
               "index": "not_analyzed"
            },
            "studio": {
               "type": "string",
               "index": "not_analyzed"
            }
         }
      }
   }
}'

There are several things to notice here. The first is that the fields we do not want to search against have "include_in_all" : false set in their definitions. Secondly, notice the "index" setting. Setting "index": "not_analyzed" means that Elasticsearch will not perform any sort of analysis on that field when building the tokens for the lookup table; so the text "Walt Disney Video" will be saved unchanged, for example. This is useful for faceting. Setting "index": "no" means that that field will not even be indexed. Since we are doing nothing with the "plot" field but displaying it when we show results in the UI, there is no reason to index it (build a lookup table from it), so we can save some space by not doing so.

Finally, take a look at the definition of the "_all" field. This is where we put our analyzers to use. Notice that both an "index_analyzer" and a "search_analyzer" have been defined. The "index_analyzer" is the one used to construct the tokens used in the lookup table for the index. This work is done at index time. The "search_analyzer" is the one used to analyze the search text that we send in a search query. We don't want to tokenize our search text into nGrams because doing so would generate lots of false positive matches. For example, if we search for "disn", we probably don't want to match every document that contains "is"; we only want to match against documents that contain the full string "disn". We do want to do a little bit of simple analysis though, namely splitting on whitespace, lower-casing, and "ascii_folding".

Full Example

Now that we've explained all the pieces, it's time to put them together. If you go to the demo and type in disn 123 2013, you will see the following:

ngrams_autocomplete_2.png#asset:490

As you can see from the highlighting (that part is being done with JavaScript, not Elasticsearch, although it is possible to do highlighting with Elasticsearch), the search text has been matched against several different fields: "disn" matches on the "studio" field, "123" matches on "sku", and "2013" matches on "releaseDate".

Here is what the query looks like (translated to curl):

curl -XPOST "http://be6c2e3260c3e2af000.qbox.io/blurays/_search" -d'
{
   "size": 10,
   "query": {
      "match": {
         "_all": {
            "query": "disn 123 2013",
            "operator": "and"
         }
      }
   }
}'

Notice how simple this query is. We've already done all the hard work at index time, so writing the search query itself is quite simple. We just do a "match" query against the "_all" field, being sure to specify "and" as the operator ("or" is the default).

Now, suppose we have selected the filter "genre":"Cartoons and Animation", and then type in the same search query; this time we only get two results:

ngrams_autocomplete_3.png#asset:491

This is because the JavaScript constructing the query knows we have selected the filter, and applies it to the search query.

In this case, the query looks like:

curl -XPOST "http://be6c2e3260c3e2af000.qbox.io/blurays/_search" -d'
{
   "size": 10,
   "query": {
      "filtered": {
         "query": {
            "match": {
               "_all": {
                  "query": "disn 123 2013",
                  "operator": "and"
               }
            }
         },
         "filter": {
            "bool": {
               "must": [
                  {
                     "term": {
                        "genre": "Cartoons and Animation"
                     }
                  }
               ]
            }
         }
      }
   }
}'

Conclusion

This has been a long post, and we've covered a lot of ground. We have seen how to create autocomplete functionality that can match multiple-word text queries across several fields without requiring any duplication of data, matches partial words against the beginning, end or even in the middle of words in target fields, and can be used with filters to limit the possible document matches to only the most relevant. This system can be used to provide robust and user-friendly autocomplete functionality in a production setting, and it can be modified to meet the needs of most situations. I hope this post has been useful for you, and happy Elasticsearching!

comments powered by Disqus