In most cases, stop words add little semantic value to a sentence. They are filler words that help sentences flow better, but provide very little context on their own. Stop words are usually words like “to”, “I”, “has”, “the”, “be”, “or”, etc. 

Stop words are a fundamental part of core Information Retrieval. Common wisdom dictates that we should identify and remove stop words from our index. This increases both performance (fewer terms in your dictionary) and more relevant search results.

Stop words are the most frequent words in the English language. Thus, most sentences share a similar percentage of stop words. Stop words bloat the index without providing any extra value.

If they are both common and lacking in much useful information, why not remove them?

Removing stop words helps decrease the size of the index as well as the size of the query. Fewer terms is always a win with regards to performance and since stop words are  semantically empty, relevance scores are unaffected.

Stemming is important, not just for making searches broader and increasing retrieval, but also as a tool for compressing index size. Another way to reduce index size is simply to index fewer words. For search purposes, some words are more important than others. A significant reduction in index size can be achieved by indexing only the more important terms.

So which terms can be left out? We can divide terms roughly into two groups:

  • Low-frequency terms

Words that appear in relatively few documents in the collection. Because of their rarity, they have a high value, or weight.

  • High-frequency terms

Common words that appear in many documents in the index, such as the, and, and is. These words have a low weight and contribute little to the relevance score.

The default English stop words used in Elasticsearch are as follows:

a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with

The primary advantage of removing stopwords is performance. Imagine that we search an index with one million documents for the word qbox. Let’s say qbox appears in only 50 of them, which means that Elasticsearch has to calculate the relevance _score for 50 documents in order to return the top 10. Now, we change search query to ‘the OR qbox’. The word ‘the’ probably occurs in almost all the documents, which means that Elasticsearch has to calculate the _score for all one million documents. This second query simply cannot perform as good as the first.

On top of that, by removing stop words from the index, we are reducing our ability to perform certain types of searches. Filtering out the words listed previously prevents us from doing the following:

  • Distinguishing cool from not cool.

  • Searching for the band “The Not”.

  • Finding Shakespeare’s quotation “To be, or not to be

  • Using the country code for Norway: “no

The removal of stopwords is handled by the stop token filter which can be used when creating a custom analyzer. However, some out-of-the-box analyzers come with the stop filter pre-integrated:

  • Language analyzers - Each language analyzer defaults to using the appropriate stopwords list for that language. For instance, the english analyzer uses the _english_ stopwords list.
  • Standard analyzer - Defaults to the empty stopwords list: _none_, essentially disabling stopwords.
  • Pattern analyzer - Defaults to _none_, like the standard analyzer.

Specifying Stopwords

Stopwords can be passed inline by specifying an array:

"stopwords": [ "and", "the" ]
The default stopword list for a particular language can be specified using the _lang_ notation:
"stopwords": "_english_"
The predefined language-specific stopword lists available in Elasticsearch can be found in the stop token filter documentation.

Stopwords can be disabled by specifying the special list: _none_. For instance, to use the english analyzer without stop words, we can do the following:

curl -XPUT 'localhost:9200/my_index’ -d '{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english_analyzer": {
          "type": "english",
          "stopwords": "_none_"
        }
      }
    }
  }
}'

Stop words can also be listed in a file with one word per line. The file must be present on all nodes in the cluster, and the path can be specified with the stopwords_path parameter.

curl -XPUT 'localhost:9200/my_index’ -d '{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english_analyzer": {
          "type": "english",
          "stopwords_path": "stopwords/english.txt"
        }
      }
    }
  }
}'

Stopwords Analyzer

Let’s use custom stopwords in conjunction with the standard analyzer. The following curl creates a configured version of the analyzer and pass in the list of stopwords that we require:

curl -XPUT 'localhost:9200/my_index' -d '{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": { 
          "type": "standard", 
          "stopwords": [ "and", "the" ]  // stopwords to filter out
        }
      }
    }
  }
}'

This same technique can be used to configure custom stopword lists for any of the language analyzers.

The output from the analyze API is quite interesting:

curl -XGET localhost:9200/my_index/_analyze?analyzer=my_analyzer&text=”The quick fox and the lazy dog kept running”
{
  "tokens": [
    {
      "token": "quick",
      "start_offset": 5,
      "end_offset": 10,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "fox",
      "start_offset": 11,
      "end_offset": 14,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "lazy",
      "start_offset": 23,
      "end_offset": 27,
      "type": "<ALPHANUM>",
      "position": 5
    },
    {
      "token": "dog",
      "start_offset": 28,
      "end_offset": 31,
      "type": "<ALPHANUM>",
      "position": 6
    },
    {
      "token": "kept",
      "start_offset": 32,
      "end_offset": 36,
      "type": "<ALPHANUM>",
      "position": 7
    },
    {
      "token": "running",
      "start_offset": 37,
      "end_offset": 44,
      "type": "<ALPHANUM>",
      "position": 8
    }
  ]
}

The stop words have been filtered out, as expected, but the interesting part is that the position of the two remaining terms is unchanged: quick is the second word in the original sentence, and lazy is the sixth.

Stop Token Filter

The stop token filter can be combined with a tokenizer and other token filters in order to create a custom analyzer. For instance, let’s say that we want to create an English analyzer with the following:

  • A custom stopwords list

  • The light_english stemmer

  • The asciifolding filter to remove diacritics

Some English language terms have letters with diacritical marks. Most of the words are loanwords from French, with others coming from Spanish, German, or other languages. Some are however originally English, or at least their diacritics are. Asciifolding is more widely used with spanish, german or french language.

We can set up a custom analyzer my_spanish as follows:

curl -XPUT 'localhost:9200/my_index’ -d '{
  "settings": {
    "analysis": {
      "filter": {
        "english_stop": {
          "type":        "stop",
          "stopwords":"_english_"
        },
        "light_english": {
          "type":     "stemmer",
          "language": "light_english"
        }
      },
      "analyzer": {
        "my_spanish": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "asciifolding",
            "english_stop",
            "light_english"
          ]
        }
      }
    }
  }
}'

We have placed the english_stop filter after the asciifolding filter. This means that words will first have their diacritics removed to become just normal english words, which will then be removed if present in standard english stop words list. If the diacritics in words are to be retained, the order of asciifolding and english_stop can be reversed.

Performance

Most words typically occur in much fewer than 0.1% of all documents but a few words such as ‘the’ may occur in almost all of them. Imagine we have an index of five million documents. A query for ‘the lazy dog’ may match fewer than 500 documents. But a query for ‘the lazy dog’ may have to score and sort almost all of the five million documents in the index, just in order to return the top 100.

The performance limiting factors can be configured to have minimum impact as follows:

and Operator

The problem is that ‘the lazy dog’ is really a query for ‘the OR lazy OR dog’. Any document that contains nothing more than the almost meaningless term ‘the’ is included in the result set. What we need is a way of reducing the number of documents that need to be scored.

The easiest way to reduce the number of documents is simply to use the and operator with the match query, in order to make all words required.

A match query like this

{
   "match": {
       "text": {
           "query":    "the lazy dog",
           "operator": "and"
       }
   }
}

can be rewritten as a bool query like this:

{
    "bool": {
        "must": [
            { "term": { "text": "the" }},
            { "term": { "text": "lazy" }},
            { "term": { "text": "dog" }}
        ]
    }
}

The bool query is intelligent enough to execute each term query in the optimal order. It starts with the least frequent term and only documents that contain the least frequent term can possibly match.

minimum_should_match

minimum_should_match operator trims the long tail of less-relevant results and thus controls precision. It also has a nice side effect, it offers a similar performance benefit as the and operator.

Consider this match query:

{
    "match": {
        "text": {
            "query": "the lazy dog",
            "minimum_should_match": "60%"
        }
    }
}In this query, at least two out of the three terms must match or only those docs need to be considered that contain either the least or second least frequent terms. This offers a huge performance gain over a simple query with the default <b>or</b> operator.

Updating Stopwords

If you specify stopwords inline with the stopwords parameter, the only option is to close the index and update the analyzer configuration with the update index settings API and then reopen the index.

Stopwords can also be updated if we specify them in a file with the stopwords_path parameter. We can just update the file (on every node in the cluster) and then force the re-instantiation of analyzers by either of these actions:

  • Closing and reopening the index, or

  • Restarting each node in the cluster, one by one

The updated stopwords will apply only to searches and to new or updated documents. In order to apply the changes to existing documents, we will be required to reindex the data.

Other Helpful Tutorials

Give It a Whirl!

It's easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, Amazon, or Microsoft Azure data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch

Questions? Drop us a note, and we'll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.

comments powered by Disqus