In this post we will walk though the basics of using ngrams in Elasticsearch.

Wikipedia has this to say about ngrams:

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles.

In the fields of machine learning and data mining, "ngram" will often refer to sequences of n words. In Elasticsearch, however, an "ngram" is a sequnce of n characters. There are various ays these sequences can be generated and used. We'll take a look at some of the most common.

Note to the impatient: Need some quick ngram code to get a basic version of autocomplete working? See the TL;DR at the end of this blog post.

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click "Get Started" in the header navigation. If you need help setting up, refer to "Provisioning a Qbox Elasticsearch Cluster."


Code

All the code used in this post can be found here:

http://sense.qbox.io/gist/6f5519cc3db0772ab347bb85d969db14d85858f2


A Quick Note on Analysis

Understanding ngrams in Elasticsearch requires a passing familiarity with the concept of analysis in Elasticsearch. There are a great many options for indexing and analysis, and covering them all would be beyond the scope of this blog post, but I'll try to give you a basic idea of the system as it's commonly used.

When a document is "indexed," there are actually (potentially) several inverted indexes created, one for each field (unless the field mapping has the setting "index": "no"). The inverted index for a given field consists, essentially, of a list of terms for that field, and pointers to documents containing each term. Therefore, when a search query matches a term in the inverted index, Elasticsearch returns the documents corresponding to that term.

For example, supposed that I've indexed the following document (I took the primary definition from Dictionary.com):

{
    "word": "democracy",
    "definition": "government by the people; a form of government in which the supreme power is vested in the people and exercised directly by them or by their elected agents under a free electoral system."
}

If I used the standard analyzer in the mapping for the "word" field, then the inverted index for that field will contain the term "democracy" with a pointer to this document, and "democracy" will be the only term in the inverted index for that field that points to this document.

On the other hand, for the "definition" field of this document, the standard analyzer will produce many terms, one for each word in the text, minus spaces and punctuation.

How are these terms generated? As the ES documentation tells us:

Analyzers are composed of a single Tokenizer and zero or more TokenFilters. The tokenizer may be preceded by one or more CharFilters.

CharFilters remove or replace characters in the source text; this can be useful for stripping html tags, for example. That's all I'll say about them here. Tokenizers divide the source text into sub-strings, or "tokens" (more about this in a minute). Token filters perform various kinds of operations on the tokens supplied by the tokenizer to generate new tokens. Now we're almost ready to talk about ngrams.


Ngram Tokenizer versus Ngram Token Filter

At first glance the distinction between using the ngram tokenizer or the ngram token filter can be a bit confusing. The difference is perhaps best explained with examples, so I'll show how the text "Hello, World!" can be analyzed in a few different ways.

Term Vectors

Term vectors can be a handy way to take a look at the results of an analyzer applied to a specific document. (Another way is the analyze API.) I will use them here to help us see what our analyzers are doing. Term vectors do add some overhead, so you may not want to use them in production if you don't need them, but they can be very useful for development.

Mapping for the First Example

For this first set of examples, I'm going to use a very simple mapping with a single field, and index only a single document, then ask Elasticsearch for the term vector for that document and field. As a reference, I'll start with the standard analyzer. Here is the mapping:

PUT /test_index
{
   "settings": {
      "number_of_shards": 1
   },
   "mappings": {
      "doc": {
         "properties": {
            "text_field": {
               "type": "string",
               "term_vector": "yes"
            }
         }
      }
   }
}

(I used a single shard because that's all I need, and it also makes it easier to read errors if any come up.)

Now I index a single document with a PUT request:

PUT /test_index/doc/1
{
    "text_field": "Hello, World!"
}

And now I can take a look at the terms that were generated when the document was indexed, using a term vector request:

GET /test_index/doc/1/_termvector?fields=text_field
...
{
   "_index": "test_index",
   "_type": "doc",
   "_id": "1",
   "_version": 1,
   "found": true,
   "term_vectors": {
      "text_field": {
         "field_statistics": {
            "sum_doc_freq": 2,
            "doc_count": 1,
            "sum_ttf": 2
         },
         "terms": {
            "hello": {
               "term_freq": 1
            },
            "world": {
               "term_freq": 1
            }
         }
      }
   }
}

The two terms "hello" and "world" are returned. (Hopefully this isn't too surprising.)

Ngram Tokenizer

Next let's take a look at the same text analyzed using the ngram tokenizer. For simplicity and readability, I've set up the analyzer to generate only ngrams of length 4 (also known as 4-grams). In the mapping, I define a tokenizer of type "nGram" and an analyzer that uses it, and then specify that the "text_field" field in the mapping use that analyzer. So I delete and rebuild the index with the new mapping:

DELETE /test_index
PUT /test_index
{
   "settings": {
      "number_of_shards": 1,
      "analysis": {
         "tokenizer": {
            "ngram_tokenizer": {
               "type": "nGram",
               "min_gram": 4,
               "max_gram": 4
            }
         },
         "analyzer": {
            "ngram_tokenizer_analyzer": {
               "type": "custom",
               "tokenizer": "ngram_tokenizer"
            }
         }
      }
   },
   "mappings": {
      "doc": {
         "properties": {
            "text_field": {
               "type": "string",
               "term_vector": "yes",
               "analyzer": "ngram_tokenizer_analyzer"
            }
         }
      }
   }
}

Now I reindex the document, and request the term vector again:

PUT /test_index/doc/1
{
    "text_field": "Hello, World!"
}
GET /test_index/doc/1/_termvector?fields=text_field

And this time the term vector is rather longer:

{
   "_index": "test_index",
   "_type": "doc",
   "_id": "1",
   "_version": 1,
   "found": true,
   "term_vectors": {
      "text_field": {
         "field_statistics": {
            "sum_doc_freq": 10,
            "doc_count": 1,
            "sum_ttf": 10
         },
         "terms": {
            " Wor": {
               "term_freq": 1
            },
            ", Wo": {
               "term_freq": 1
            },
            "Hell": {
               "term_freq": 1
            },
            "Worl": {
               "term_freq": 1
            },
            "ello": {
               "term_freq": 1
            },
            "llo,": {
               "term_freq": 1
            },
            "lo, ": {
               "term_freq": 1
            },
            "o, W": {
               "term_freq": 1
            },
            "orld": {
               "term_freq": 1
            },
            "rld!": {
               "term_freq": 1
            }
         }
      }
   }
}

Notice that the ngram tokens have been generated without regard to the type of character; the terms include spaces and punctuation characters, and the characters have not been converted to lower-case. There are times when this behavior is useful; for example, you might have product names that contain weird characters and you want your autocomplete functionality to account for them.

Lowercase, Alphanumeric Ngrams

I can adjust both of these issues pretty easily (assuming I want to). The ngram tokenizer takes a parameter called token_chars that allows five different character classes to be specified as characters to "keep." Elasticsearch will tokenize ("split") on characters not specified. If you don't specify any character classes, then all characters are kept (which is what happened in the previous example). In the next example I'll tell Elasticsearch to keep only alphanumeric characters and discard the rest.

If I want the tokens to be converted to all lower-case, I can add the lower-case token filter to my analyzer. Here is the mapping with both of these refinements made:

DELETE /test_index
PUT /test_index
{
   "settings": {
      "number_of_shards": 1,
      "analysis": {
         "tokenizer": {
            "ngram_tokenizer": {
               "type": "nGram",
               "min_gram": 4,
               "max_gram": 4,
               "token_chars": [ "letter", "digit" ]
            }
         },
         "analyzer": {
            "ngram_tokenizer_analyzer": {
               "type": "custom",
               "tokenizer": "ngram_tokenizer",
               "filter": [
                  "lowercase"
               ]
            }
         }
      }
   },
   "mappings": {
      "doc": {
         "properties": {
            "text_field": {
               "type": "string",
               "term_vector": "yes",
               "analyzer": "ngram_tokenizer_analyzer"
            }
         }
      }
   }
}

Indexing the document again, and requesting the term vector, I get:

PUT /test_index/doc/1
{
    "text_field": "Hello, World!"
}
GET /test_index/doc/1/_termvector?fields=text_field
...
{
   "_index": "test_index",
   "_type": "doc",
   "_id": "1",
   "_version": 1,
   "found": true,
   "term_vectors": {
      "text_field": {
         "field_statistics": {
            "sum_doc_freq": 4,
            "doc_count": 1,
            "sum_ttf": 4
         },
         "terms": {
            "ello": {
               "term_freq": 1
            },
            "hell": {
               "term_freq": 1
            },
            "orld": {
               "term_freq": 1
            },
            "worl": {
               "term_freq": 1
            }
         }
      }
   }
}

Ngram Token Filter

I can generate the same effect using an ngram token filter instead, together with the standard tokenizer and the lower-case token filter again. So in this case, the raw text is tokenized by the standard tokenizer, which just splits on whitespace and punctuation. Then the tokens are passed through the lowercase filter and finally through the ngram filter where the four-character tokens are generated.

Here is the code:

DELETE /test_index
PUT /test_index
{
   "settings": {
      "number_of_shards": 1,
      "analysis": {
         "filter": {
            "ngram_filter": {
               "type": "nGram",
               "min_gram": 4,
               "max_gram": 4
            }
         },
         "analyzer": {
            "ngram_filter_analyzer": {
               "type": "custom",
               "tokenizer": "standard",
               "filter": [
                  "lowercase",
                  "ngram_filter"
               ]
            }
         }
      }
   },
   "mappings": {
      "doc": {
         "properties": {
            "text_field": {
               "type": "string",
               "term_vector": "yes",
               "analyzer": "ngram_filter_analyzer"
            }
         }
      }
   }
}
PUT /test_index/doc/1
{
    "text_field": "Hello, World!"
}
GET /test_index/doc/1/_termvector?fields=text_field
...
{
   "_index": "test_index",
   "_type": "doc",
   "_id": "1",
   "_version": 1,
   "found": true,
   "term_vectors": {
      "text_field": {
         "field_statistics": {
            "sum_doc_freq": 4,
            "doc_count": 1,
            "sum_ttf": 4
         },
         "terms": {
            "ello": {
               "term_freq": 1
            },
            "hell": {
               "term_freq": 1
            },
            "orld": {
               "term_freq": 1
            },
            "worl": {
               "term_freq": 1
            }
         }
      }
   }
}

For this example the last two approaches are equivalent. Depending on the circumstances one approach may be better than the other. As I mentioned, if you need special characters in your search terms, you will probably need to use the ngram tokenizer in your mapping. It's useful to know how to use both. I'm going to use the token filter approach in the examples that follow.


Matching Partial Words Across Fields

The previous set of examples was somewhat contrived because the intention was to illustrate basic properties of the ngram tokenizer and token filter. In the examples that follow I'll use a slightly more realistic data set and query the index in a more realistic way.

The Mapping

Here is the mapping I'll be using for the next example. I'll explain it piece by piece.

PUT /test_index
{
   "settings": {
      "number_of_shards": 1,
      "analysis": {
         "filter": {
            "ngram_filter": {
               "type": "ngram",
               "min_gram": 2,
               "max_gram": 20
            }
         },
         "analyzer": {
            "ngram_analyzer": {
               "type": "custom",
               "tokenizer": "standard",
               "filter": [
                  "lowercase",
                  "ngram_filter"
               ]
            }
         }
      }
   },
   "mappings": {
      "doc": {
         "_all": {
            "type": "string",
            "index_analyzer": "ngram_analyzer",
            "search_analyzer": "standard"
         },
         "properties": {
            "word": {
               "type": "string",
               "include_in_all": true,
               "term_vector": "yes",
               "index_analyzer": "ngram_analyzer",
               "search_analyzer": "standard"
            },
            "definition": {
               "type": "string",
               "include_in_all": true,
               "term_vector": "yes"
            }
         }
      }
   }
}

Mingram/Maxgram Size

Notice that the minimum ngram size I'm using here is 2, and the maximum size is 20. These are values that have worked for me in the past, but the right numbers depend on the circumstances.

A common use of ngrams is for autocomplete, and users tend to expect to see suggestions after only a few keystrokes. Single character tokens will match so many things that the suggestions are often not helpful, especially when searching against a large dataset, so 2 is usually the smallest useful value of mingram. On the other hand, what is the longest ngram against which we should match search text? 20 is a little arbitrary, so you may want to experiment to find out what works best for you.

Another issue that should be considered is performance. Generating a lot of ngrams will take up a lot of space and use more CPU cycles for searching, so you should be careful not to set mingram any lower, and maxgram any higher, than you really need (at least if you have a large dataset).

The _all Field

If you want to search across several fields at once, the all field can be a convenient way to do so, as long as you know at mapping time which fields you will want to search together. You can tell Elasticsearch which fields to include in the _all field using the "include_in_all" parameter (defaults to true). Here I've simply included both fields (which is redundant since that would be the default behavior, but I wanted to make it explicit).

index_analyzer versus search_analyzer

This one is a bit subtle and problematic sometimes. If only analyzer is specified in the mapping for a field, then that analyzer will be used for both indexing and searching. If I want a different analyzer to be used for searching than for indexing, then I have to specify both.

An added complication is that some types of queries are analyzed, and others are not. For example, a match query uses the search analyzer to analyze the query text before attempting to match it to terms in the inverted index. On the other hand, a term query (or filter) does NOT analyze the query text but instead attempts to match it verbatim against terms in the inverted index. Neglecting this subtlety can sometimes lead to confusing results.

In the above mapping, I'm using the custom ngram_analyzer as the index_analyzer, and the standard analyzer as the search_analyzer. This setup works well in many situations. If you need to be able to match symbols or punctuation in your queries, you might have to get a bit more creative.

Some Example Data

Here are a few example documents I put together from Dictionary.com that we can use to illustrate ngram behavior:

PUT /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"word":"democracy", "definition":"government by the people; a form of government in which the supreme power is vested in the people and exercised directly by them or by their elected agents under a free electoral system."}
{"index":{"_index":"test_index","_type":"doc","_id":2}}
{"word":"republic", "definition":"a state in which the supreme power rests in the body of citizens entitled to vote and is exercised by representatives chosen directly or indirectly by them."}
{"index":{"_index":"test_index","_type":"doc","_id":3}}
{"word":"oligarchy", "definition":"a form of government in which all power is vested in a few persons or in a dominant class or clique; government by the few."}
{"index":{"_index":"test_index","_type":"doc","_id":4}}
{"word":"plutocracy", "definition":"the rule or power of wealth or of the wealthy."}
{"index":{"_index":"test_index","_type":"doc","_id":5}}
{"word":"theocracy", "definition":"a form of government in which God or a deity is recognized as the supreme civil ruler, the God's or deity's laws being interpreted by the ecclesiastical authorities."}
{"index":{"_index":"test_index","_type":"doc","_id":6}}
{"word":"monarchy", "definition":"a state or nation in which the supreme power is actually or nominally lodged in a monarch."}
{"index":{"_index":"test_index","_type":"doc","_id":7}}
{"word":"capitalism", "definition":"an economic system in which investment in and ownership of the means of production, distribution, and exchange of wealth is made and maintained chiefly by private individuals or corporations, especially as contrasted to cooperatively or state-owned means of wealth."}
{"index":{"_index":"test_index","_type":"doc","_id":8}}
{"word":"socialism", "definition":"a theory or system of social organization that advocates the vesting of the ownership and control of the means of production and distribution, of capital, land, etc., in the community as a whole."}
{"index":{"_index":"test_index","_type":"doc","_id":9}}
{"word":"communism", "definition":"a theory or system of social organization based on the holding of all property in common, actual ownership being ascribed to the community as a whole or to the state."}
{"index":{"_index":"test_index","_type":"doc","_id":10}}
{"word":"feudalism", "definition":"the feudal system, or its principles and practices."}
{"index":{"_index":"test_index","_type":"doc","_id":11}}
{"word":"monopoly", "definition":"exclusive control of a commodity or service in a particular market, or a control that makes possible the manipulation of prices."}
{"index":{"_index":"test_index","_type":"doc","_id":12}}
{"word":"oligopoly", "definition":"the market condition that exists when there are few sellers, as a result of which they can greatly influence price and other market factors."}

A Few Example Queries

Now let's take a look at the results we get from a few different queries. As I mentioned before, match queries are analyzed, and term queries are not.

So if I run a simple match query for the text "go," I'll get back the documents that have that text anywhere in either of the the two fields:

POST /test_index/_search
{
    "query": {
        "match": {
           "_all": "go"
        }
    }
}
...
{
   "took": 108,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 4,
      "max_score": 0.6090763,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "5",
            "_score": 0.6090763,
            "_source": {
               "word": "theocracy",
               "definition": "a form of government in which God or a deity is recognized as the supreme civil ruler, the God's or deity's laws being interpreted by the ecclesiastical authorities."
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "3",
            "_score": 0.49730873,
            "_source": {
               "word": "oligarchy",
               "definition": "a form of government in which all power is vested in a few persons or in a dominant class or clique; government by the few."
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "1",
            "_score": 0.41442394,
            "_source": {
               "word": "democracy",
               "definition": "government by the people; a form of government in which the supreme power is vested in the people and exercised directly by them or by their elected agents under a free electoral system."
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "12",
            "_score": 0.3516504,
            "_source": {
               "word": "oligopoly",
               "definition": "the market condition that exists when there are few sellers, as a result of which they can greatly influence price and other market factors."
            }
         }
      ]
   }
}

This also works if I use the text "Go" because since a match query will use the search_analyzer on the search text. In our case that's the standard analyzer, so the text gets converted to "go", which matches terms as before:

POST /test_index/_search
{
    "query": {
        "match": {
           "_all": "Go"
        }
    }
}
...
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 4,
      "max_score": 0.6090763,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "5",
            "_score": 0.6090763,
            "_source": {
               "word": "theocracy",
               "definition": "a form of government in which God or a deity is recognized as the supreme civil ruler, the God's or deity's laws being interpreted by the ecclesiastical authorities."
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "3",
            "_score": 0.49730873,
            "_source": {
               "word": "oligarchy",
               "definition": "a form of government in which all power is vested in a few persons or in a dominant class or clique; government by the few."
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "1",
            "_score": 0.41442394,
            "_source": {
               "word": "democracy",
               "definition": "government by the people; a form of government in which the supreme power is vested in the people and exercised directly by them or by their elected agents under a free electoral system."
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "12",
            "_score": 0.3516504,
            "_source": {
               "word": "oligopoly",
               "definition": "the market condition that exists when there are few sellers, as a result of which they can greatly influence price and other market factors."
            }
         }
      ]
   }
}

On the other hand, if I try the text "Go" with a term query, I get nothing:

POST /test_index/_search
{
    "query": {
        "term": {
           "_all": "Go"
        }
    }
}
...
{
   "took": 0,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 0,
      "max_score": null,
      "hits": []
   }
}

However, a term query for "go" works as expected:

POST /test_index/_search
{
    "query": {
        "term": {
           "_all": "go"
        }
    }
}
...
{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 4,
      "max_score": 0.6090763,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "5",
            "_score": 0.6090763,
            "_source": {
               "word": "theocracy",
               "definition": "a form of government in which God or a deity is recognized as the supreme civil ruler, the God's or deity's laws being interpreted by the ecclesiastical authorities."
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "3",
            "_score": 0.49730873,
            "_source": {
               "word": "oligarchy",
               "definition": "a form of government in which all power is vested in a few persons or in a dominant class or clique; government by the few."
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "1",
            "_score": 0.41442394,
            "_source": {
               "word": "democracy",
               "definition": "government by the people; a form of government in which the supreme power is vested in the people and exercised directly by them or by their elected agents under a free electoral system."
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "12",
            "_score": 0.3516504,
            "_source": {
               "word": "oligopoly",
               "definition": "the market condition that exists when there are few sellers, as a result of which they can greatly influence price and other market factors."
            }
         }
      ]
   }
}

Term Vector

For reference, let's take a look at the term vector for the text "democracy." I'll use this for comparison in the next section. It's pretty long, so hopefully you can scroll fast.

GET /test_index/doc/1/_termvector?fields=word
...
{
   "_index": "test_index",
   "_type": "doc",
   "_id": "1",
   "_version": 1,
   "found": true,
   "term_vectors": {
      "word": {
         "field_statistics": {
            "sum_doc_freq": 425,
            "doc_count": 12,
            "sum_ttf": 426
         },
         "terms": {
            "ac": {
               "term_freq": 1
            },
            "acy": {
               "term_freq": 1
            },
            "cr": {
               "term_freq": 1
            },
            "cra": {
               "term_freq": 1
            },
            "crac": {
               "term_freq": 1
            },
            "cracy": {
               "term_freq": 1
            },
            "cy": {
               "term_freq": 1
            },
            "de": {
               "term_freq": 1
            },
            "dem": {
               "term_freq": 1
            },
            "demo": {
               "term_freq": 1
            },
            "democ": {
               "term_freq": 1
            },
            "democr": {
               "term_freq": 1
            },
            "democra": {
               "term_freq": 1
            },
            "democrac": {
               "term_freq": 1
            },
            "democracy": {
               "term_freq": 1
            },
            "em": {
               "term_freq": 1
            },
            "emo": {
               "term_freq": 1
            },
            "emoc": {
               "term_freq": 1
            },
            "emocr": {
               "term_freq": 1
            },
            "emocra": {
               "term_freq": 1
            },
            "emocrac": {
               "term_freq": 1
            },
            "emocracy": {
               "term_freq": 1
            },
            "mo": {
               "term_freq": 1
            },
            "moc": {
               "term_freq": 1
            },
            "mocr": {
               "term_freq": 1
            },
            "mocra": {
               "term_freq": 1
            },
            "mocrac": {
               "term_freq": 1
            },
            "mocracy": {
               "term_freq": 1
            },
            "oc": {
               "term_freq": 1
            },
            "ocr": {
               "term_freq": 1
            },
            "ocra": {
               "term_freq": 1
            },
            "ocrac": {
               "term_freq": 1
            },
            "ocracy": {
               "term_freq": 1
            },
            "ra": {
               "term_freq": 1
            },
            "rac": {
               "term_freq": 1
            },
            "racy": {
               "term_freq": 1
            }
         }
      }
   }
}

Edge Ngrams

For many applications, only ngrams that start at the beginning of words are needed. When that is the case, it makes more sense to use edge ngrams instead. To illustrate, I can use exactly the same mapping as the previous example, except that I use edge_ngram instead of ngram as the token filter type:

PUT /test_index
{
   "settings": {
      "number_of_shards": 1,
      "analysis": {
         "filter": {
            "ngram_filter": {
               "type": "edge_ngram",
               "min_gram": 2,
               "max_gram": 20
            }
         },
         "analyzer": {
            "ngram_analyzer": {
               "type": "custom",
               "tokenizer": "standard",
               "filter": [
                  "lowercase",
                  "ngram_filter"
               ]
            }
         }
      }
   },
   "mappings": {
      "doc": {
         "_all": {
            "type": "string",
            "index_analyzer": "ngram_analyzer",
            "search_analyzer": "standard"
         },
         "properties": {
            "word": {
               "type": "string",
               "include_in_all": true,
               "term_vector": "yes",
               "index_analyzer": "ngram_analyzer",
               "search_analyzer": "standard"
            },
            "definition": {
               "type": "string",
               "include_in_all": true,
               "term_vector": "yes"
            }
         }
      }
   }
}

After running the same bulk index operation as in the previous example, if I run my match query for "go" again, I get back only documents in which one of the words begins with "go":

POST /test_index/_search
{
    "query": {
        "match": {
           "_all": "go"
        }
    }
}
...
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 3,
      "max_score": 0.68154424,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "5",
            "_score": 0.68154424,
            "_source": {
               "word": "theocracy",
               "definition": "a form of government in which God or a deity is recognized as the supreme civil ruler, the God's or deity's laws being interpreted by the ecclesiastical authorities."
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "3",
            "_score": 0.5564785,
            "_source": {
               "word": "oligarchy",
               "definition": "a form of government in which all power is vested in a few persons or in a dominant class or clique; government by the few."
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "1",
            "_score": 0.46373212,
            "_source": {
               "word": "democracy",
               "definition": "government by the people; a form of government in which the supreme power is vested in the people and exercised directly by them or by their elected agents under a free electoral system."
            }
         }
      ]
   }
}

If we take a look at the the term vector for the "word" field of the first document again, the difference is pretty clear:

GET /test_index/doc/1/_termvector?fields=word
...
{
   "_index": "test_index",
   "_type": "doc",
   "_id": "1",
   "_version": 1,
   "found": true,
   "term_vectors": {
      "word": {
         "field_statistics": {
            "sum_doc_freq": 95,
            "doc_count": 12,
            "sum_ttf": 95
         },
         "terms": {
            "de": {
               "term_freq": 1
            },
            "dem": {
               "term_freq": 1
            },
            "demo": {
               "term_freq": 1
            },
            "democ": {
               "term_freq": 1
            },
            "democr": {
               "term_freq": 1
            },
            "democra": {
               "term_freq": 1
            },
            "democrac": {
               "term_freq": 1
            },
            "democracy": {
               "term_freq": 1
            }
         }
      }
   }
}

This (mostly) concludes the post. I hope I've helped you learn a little bit about how to use ngrams in Elasticsearch. Please leave us your thoughts in the comments!


TL;DR: General-purpose Autocomplete

Here is a mapping that will work well for many implementations of autocomplete, and it is usually a good place to start. It's not elaborate -- just the basics:

PUT /test_index
{
   "settings": {
      "analysis": {
         "filter": {
            "edge_ngram_filter": {
               "type": "edge_ngram",
               "min_gram": 2,
               "max_gram": 20
            }
         },
         "analyzer": {
            "edge_ngram_analyzer": {
               "type": "custom",
               "tokenizer": "standard",
               "filter": [
                  "lowercase",
                  "edge_ngram_filter"
               ]
            }
         }
      }
   },
   "mappings": {
      "doc": {
         "properties": {
            "text_field": {
               "type": "string",
               "index_analyzer": "edge_ngram_analyzer",
               "search_analyzer": "standard"
            }
         }
      }
   }
}

And that's a wrap. You're welcome! Come back and check the Qbox blog again soon!)