Analyzers are made up of two main components: a Tokenizer and a set of Token Filters. The tokenizer splits text into tokens according to some set of rules, and the token filters each perform operations on those tokens. The result is a stream of processed tokens, which are either stored in the index or used to query results.

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click "Get Started" in the header navigation. If you need help setting up, refer to "Provisioning a Qbox Elasticsearch Cluster." 

The default analyzer is the Standard analyzer and contains these components:

  • Tokenizer:

    Standard tokenizer -- provides grammar-based tokenization

  • TokenFilters:
    • Standard token filter -- acts as a placeholder and does nothing
    • Lowercase token filter -- lowercases all tokens
    • Stop token filter -- removes tokens identified as stop words

Elasticsearch ships with a collection of language analyzers that provide good, basic, out-of-the-box support for many of the world’s most common languages:

Arabic, Armenian, Basque, Brazilian, Bulgarian, Catalan, Chinese, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Korean, Kurdish, Norwegian, Persian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish, and Thai.

These analyzers typically perform four roles:

  • Tokenize text into individual words: The quick brown foxes → [The, quick, brown, foxes]

  • Lowercase tokens: The → the

  • Remove common stopwords: [The, quick, brown, foxes] → [quick, brown, foxes]

  • Stem tokens to their root form: foxes → fox

Each analyzer may also apply other transformations specific to its language in order to make words from that language more searchable:

  • The English analyzer removes the possessive 's: John's → john

  • The French analyzer removes elisions like l' and qu' and diacritics like ¨ or ^: l'église → eglis

  • The German analyzer normalizes terms, replacing ä and ae with a, or ß with ss, among others: äußerst → ausserst

The English analyzer is really just Lucene’s EnglishAnalyzer.  Lucene EnglishAnalyzer class uses following components:

The built-in language analyzers are available globally and need not be configured before being used. They can be specified directly in the field mapping:

curl -XPUT 'localhost:9200/english_analyzer_index -d'{
  "mappings": {
    "book": {
      "properties": {
        "name": {
          "type": "string",
          "analyzer": "english"        
        }
      }
    }
  }
}'
curl -XGET localhost:9200/english_analyzer_index/_analyze?field=name&text="The Quick for jumped over the lazy dog"
{
  "tokens": [
    {
      "token": "quick",
      "start_offset": 5,
      "end_offset": 10,
      "type": "",
      "position": 1
    },
    {
      "token": "jump",
      "start_offset": 15,
      "end_offset": 21,
      "type": "",
      "position": 3
    },
    {
      "token": "over",
      "start_offset": 22,
      "end_offset": 26,
      "type": "",
      "position": 4
    },
    {
      "token": "lazi",
      "start_offset": 31,
      "end_offset": 35,
      "type": "",
      "position": 6
    },
    {
      "token": "dog",
      "start_offset": 36,
      "end_offset": 39,
      "type": "",
      "position": 7
    }
  ]
}

The English analyzer increases recall as we can match more loosely, but it reduces our ability to rank documents accurately.

In order to get the best of both worlds, we can use multi fields to index the title field twice: once with the English analyzer and once with the standard analyzer.

curl -XPUT ‘localhost:9200/english_analyzer_index’ -d '{
  "mappings": {
    "book": {
      "properties": {
        "name": { // standard analyzer
          "type": "string",
          "fields": {
            "english": { 
              "type":     "string", // english analyzer
              "analyzer": "english"
            }
          }
        }
      }
    }
  }
}'

We can now index some test documents to demonstrate how to use both fields at query time:

curl -XPUT 'localhost:9200/english_analyzer_index/book/1' -d '{ "name": "Qbox Elasticsearch Hosting is not at all difficult" }'
curl -XPUT 'localhost:9200/english_analyzer_index/book/2' -d '{ "name": "Unconventional Elasticsearch Hosting is difficult" }'
curl -XGET 'localhost:9200/english_analyzer_index/book/_search' -d '{
  "query": {
    "multi_match": {
      "type":     "most_fields",
      "query":    "not difficult Qbox host",
      "fields": [ "name", "name.english" ]
    }
  }
}'

The response is as follows:

{
  "took": 9,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.15171182,
    "hits": [
      {
        "_index": "my_index",
        "_type": "book",
        "_id": "1",
        "_score": 0.15171182,
        "_source": {
          "name": "Qbox Elasticsearch Hosting is not at all difficult"
        }
      },
      {
        "_index": "my_index",
        "_type": "book",
        "_id": "2",
        "_score": 0.035310008,
        "_source": {
          "name": "Unconventional Elasticsearch Hosting is difficult"
        }
      }
    ]
  }
}

Even though neither of our documents contain the word host, both documents are returned as results due to the word stemming on the title.english field. The first document is ranked as more relevant because the word "not" matches on the title field.

Configuring Language Analyzers

  • Stopwords -- All language analyzers support setting custom stopwords either internally in the config or by using an external stopwords file by setting stopwords_path.
  • Stemming -- English analyzer supports english, light_english, minimal_english, possessive_english, porter2, lovins as stemmers.

  • Excluding words from stemming -- The stem_exclusion parameter allows us to specify an array of lowercase words that should not be stemmed. Internally, this functionality is implemented by adding the keyword_marker token filter with the keywords set to the value of the stem_exclusion parameter.

The English analyzer could be reimplemented as a custom analyzer as follows:

curl -XPUT ‘localhost:9200/english_analyzer_index/’ -d '{
  "settings": {
    "analysis": {
      "filter": {
        "english_stop": {
          "type": "stop",
          "stopwords": "_english_" // english language default stop words
        },
        "english_keywords": {
          "type": "keyword_marker",
          "keywords": ["conventional", "hosting"] // stem_exclusion
        },
        "english_stemmer": {
          "type": "stemmer",
          "Language": "english"
        },
        "english_possessive_stemmer": {
          "type": "stemmer",
          "Language": "possessive_english"
        }
      },
      "analyzer": {
        "english": {
          "Tokenizer": "standard",
          "filter": [
            "english_possessive_stemmer",
            "lowercase",
            "english_stop",
            "english_keywords",
            "english_stemmer"
          ]
        }
      }
    }
  }
}'

Let’s hit a test request to analyze our custom English analyzer:

curl -XGET localhost:9200/english_analyzer_index/_analyze?analyzer=english&text="Qbox provides good conventional hosting for elasticsearch"

The response is as follows:

{
  "tokens": [
    {
      "token": "qbox",
      "start_offset": 1,
      "end_offset": 5,
      "type": "",
      "position": 0
    },
    {
      "token": "provid", // reduced to stem
      "start_offset": 6,
      "end_offset": 14,
      "type": "",
      "position": 1
    },
    {
      "token": "good",
      "start_offset": 15,
      "end_offset": 19,
      "type": "",
      "position": 2
    },
    {
      "token": "conventional", // did not stem
      "start_offset": 20,
      "end_offset": 32,
      "type": "",
      "position": 3
    },
    {
      "token": "hosting", // did not stem
      "start_offset": 33,
      "end_offset": 40,
      "type": "",
      "position": 4
    },
    {
      "token": "elasticsearch",
      "start_offset": 45,
      "end_offset": 58,
      "type": "",
      "position": 6
    }
  ]
}

Conclusion

In general, 10 tokenizers, 31 token filters, and 3 character filters ship with the Elasticsearch distribution, a somewhat overwhelming number of options. This number can be increased further still through plugins, making the choices even harder to wrap one's head around. Combinations of these tokenizers, token filters, and character filters create what's called an analyzer.

The English analyzer is one of many language analyzers that are predefined in ElasticSearch. Analyzers are the special algorithms that determine how a string field

in a document is transformed into terms in an inverted index. Choosing the appropriate analyzer for an Elasticsearch query can be as much art as science. 

Analyzers are clearly powerful and essential tools for relevance engineering. When starting with Elasticsearch, it is very important that the user get acquainted with the different filters and tokenizers to be able to seize and optimize their full potential.

Other Helpful Tutorials

Give It a Whirl!

It's easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, or Amazon data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch

Questions? Drop us a note, and we'll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.

comments powered by Disqus