Welcome to Episode #3 of our Elasticsearch tutorial. In our last episode we searched with and learned about some of the Query DSL of Elasticsearch. Today we’ll create unstructured search in Elasticsearch using analyzers. After you have an Elasticsearch cluster started per instructions of Episode #1, we’ll get started.

Install Elasticsearch

http://elasticsearch.org/download/

If you don’t yet have Elasticsearch on your machine, we’ll download it using 1.1.0, the current stable release. Breaking version changes of Elasticsearch are very well documented, so if you’re having any trouble with this tutorial, feel free to add a comment down below. Or, if you have a new cluster on Qbox, feel free to use our support for your questions.

Install Ruby

http://rvm.io/rvm/install

If you don’t yet have rvm and ruby, please visit this link for a list of commands for installation. Once you’ve installed ruby, grab the ruby retire gem. This will allow you to send an import to Elasticsearch of all the files in new-sports-data with the athlete-import.rb script. The documents for this tutorial aren’t too important, because this tutorial will focus on analyzers in your mappings and settings.

Episode

https://github.com/StackSearchInc/qbox-elasticsearch-tutorial/tree/episode-3

Download the github repository linked above, and open up README.md. Mentioned at the bottom of the file is a import for a new and larger set of documents. As in Episode #2, the index is still sports and the type is athlete. Every athlete has a name, birthdate, sport, rating, and location.

mapping-visual.png#asset:262

Also in the README.md file you’ll find the new settings for the documents, but we need to understand what they’re doing before we apply them to our index. For this tutorial we will explain and use the settings shown below.

curl -XPUT 'localhost:9200/sports' -d '{
  "settings": {
    "analysis": {
      "filter": {
        "nGram_filter": {
          "type": "nGram",
            "min_gram": 2,
            "max_gram": 20,
            "token_chars": [
              "letter",
            "digit",
            "punctuation",
            "symbol"
              ]
        }
      },
        "analyzer": {
          "nGram_analyzer": {
            "type": "custom",
            "tokenizer": "whitespace",
            "filter": [
              "lowercase",
            "asciifolding",
            "nGram_filter"
              ]
          },
          "whitespace_analyzer": {
            "type": "custom",
            "tokenizer": "whitespace",
            "filter": [
              "lowercase",
            "asciifolding"
              ]
          }
        }
    }
  }
}'

Analyzers

Before we explain how we use analyzers, we need to understand how Elasticsearch creates the index to search on. Indexing documents into Elasticsearch with JSON is simple to understand. Understanding how Elasticsearch searches every document is far more complicated, and important to understanding Elasticsearch. The core term to understand when using Elasticsearch is an Inverted Index.

inverted-index2.png#asset:466

In the image above we have a simple diagram to help illustrate the concept of an Inverted Index. Taking the document field value and splitting it into separate tokens, we create a list of unique tokens, and we show which documents contain those values. In this example we’re using two documents featured in this tutorials document set. We look for which document match those terms when we need to find a match for “Owen Mason”.

Each document matches the value that we’re matching on, but one has more matching values than the other which gives the document more relevance. As you may have noticed, we have multiple terms with values we would want to be equivalent in our search. “Mason” and “mason” should be considered the same for our use case, which is one of uses an is analyzer for.

An analyzer is a group comprised of a single Tokenizer which tokens the value and can have one or more CharFilters to process the tokens before other included analysis. Zero or more TokenFilters can be provided to modify a stream of tokens from the tokenizer.

analyzer.png#asset:412

By using this structure we can create several hundred combinations of analyzers. Settings are where we specify what analysis our index will have. After we begin our settings analysis request on the sports index, we specify our custom TokenFilter (we can use any name) “nGram_filter”. We create this custom TokenFilter for use in our analyzers we will specify.

curl -XPUT 'localhost:9200/sports' -d '{
 "settings": {
   "analysis": {
     "filter": {
       "nGram_filter": {
         "type": "nGram",
           "min_gram": 2,
           "max_gram": 20,
           "token_chars": [
             "letter",
             "digit",
             "punctuation",
             "symbol"
           ]
       }
     }...

Our filter will be an “nGram”, which we can give several settings. Providing minimum or maximum grams creates the size of the n number “gram” values that will created from the value. Token chars accept letter, digit, whitespace, punctuation, and symbol character classes. We use the provided “token_chars” to split on characters that don’t belong to the provided character classes.

We then create analyzers to use in our mapping for index and search analyzers (which I explain in the “Index and Search Analysis” section). Starting with analyzer we first provide the name of our analyzer for a reference, “nGram_analyzer” and “whitespace_analyzer”. Both have a tokenizer of whitespace, which tokens the value on (you guessed it) whitespace. We then provide TokenFilters for our analyzers to use. For “nGram_analyzer” we use lowercase, asciifolding, and our custom filter “nGram_filter”. Lowercase, changes character casing to lower, asciifolding converts alphabetic, numeric, and symbolic unicode characters that are not in the first 127 ASCII characters into their ASCII equivalent.

Mapping

curl -XPUT 'localhost:9200/sports/athlete/_mapping' -d '{
  "_all":{
    "index_analyzer": "nGram_analyzer",
      "search_analyzer": "whitespace_analyzer"
  },
    "properties": {
      "birthdate": {
        "type": "date",
        "format": "dateOptionalTime"
      },
      "location": {
        "type": "geo_point"
      },
      "name": {
        "type": "string"
      },
      "rating": {
        "type": "integer"
      },
      "sport": {
        "type": "string",
        "index": "not_analyzed"
      }
    }
}'

We need to place these analyzers in our mapping so we can use them on our documents. The _all field in Elasticsearch is the text of one or more fields in the document index, and this is useful when you want to have a search that will score your search on all the text of a single document.

This can be fairly inaccurate when your documents have dozens of very large fields, and it comes at the cost of CPU cycles and index size. This is why you’re able to completely disable the _all field. For the purpose of this tutorial it demonstrates a simple way to analyze all of the fields of a document.

Index and Search Analyzers

index-search-analyzer.png#asset:463

Field mapping allows for analyzers to be used on each field. When providing an “analyzer”, you’re providing both the “index_analyzer” and “search_analyzer” in one. Index and search analyzers differ, as the name implies, by analyzing when Indexing your documents (as shown above) and analyzing your search requests. When I index the document above (“Owen Mason”), it will be analyzed into the tokens specified in the index_analyzer. When I’m searching for “Owen Mason” on the _all field, my search will be analyzed (into “owen, mason”) using the search_analyzer we provided for it.

As with Episode #2, I’ve included a simple AngularJS application in which to use this document set.

Thanks for following along with us in the three episodes of our Elasticsearch Tutorial series. We hope they are beneficial to you.

As always, we welcome your comments and suggestions!