In this tutorial, we discuss stopwords. We explain what they are, why they are needed, and the various types of stopwords. We also show how to use them correctly, how to delete them, and how to create your own. 

Stopwords

In a text, stopwords are the common words that search engines filter out after processing. For example, if you have a limited size of HDD or RAM, or you want to get better performance, stopwords can help. They help you keep a smaller index.

Types of Stopwords

In newspapers, books, or other texts you can graduate words by their importance. The same method can be applied to stopwords in Elasticsearch. Stopwords are divided roughly into two groups:

  1. Low-frequency -- these are the words that are contained in only a few documents from all of the data collection. The word `Javascript` in the book `Learning Python` may not occur at all, but if it is, it will have a Low-frequency.
  2. High-frequency -- these are the words that are contained in almost all of the documents from the data collection. It can be short function words, such as the, is, at, with and etc., or the word `dolphin` in the book `All about the dolphin`.

Using Stopwords 

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click "Get Started" in the header navigation. If you need help setting up, refer to "Provisioning a Qbox Elasticsearch Cluster."

To start using stopwords, we need to create index with an analyzer.

curl -XPUT 'localhost:9200/blogs?pretty' -d '
  
{
 "settings": {
   "analysis": {
     "analyzer": {
       "blogs_analyzer": {
         "type": "standard",
         "stopwords": "_english_"
       }
     }
   }
 }
}'

We create our analyzer with the name blogs_analyzer, set type to standard and add _english_ stopwords.

Examples of the "type" parameter:

In our case, we use "stopwords": "_english_". The default English Stopwords used in Elasticsearch are:

a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with

Elasticsearch supports other languages, too: 

  _arabic_, _armenian_, _basque_, _brazilian_, _bulgarian_, _catalan_, _czech_, _danish_, _dutch_, _english_, _finnish_, _french_, _galician_, _german_, _greek_, _hindi_, _hungarian_, _indonesian_, _irish_, _italian_, _latvian_, _norwegian_, _persian_, _portuguese_, _romanian_, _russian_, _sorani_, _spanish_, _swedish_, _thai_, _turkish_.

Or you can specify the "stopwords" field:

"stopwords": [ "you", "use" , "stopwords"]

Another option is to add the txt file instead of "stopwords":

"stopwords_path": "path_to_file/your_file_with_stopwords.txt"

Let’s check our analyzer.

curl -XGET 'localhost:9200/blogs/_analyze?pretty&analyzer=blogs_analyzer' -d 'The quick brown fox jumps over the lazy dog.'

You will get the next output:

token1.png#asset:1174

As you can see, we don’t have the word "the" in a result.

Let’s modify the analyzer. To do this, you need to close the index, update the analyzer, and open the index again.

curl -X POST 'http://localhost:9200/blogs/_close?pretty'

You should get the confirm message:

{
 "acknowledged" : true
}

Now modify the stopwords.

curl -XPUT 'localhost:9200/blogs/_settings?pretty' -d '
  
{
 "settings": {
   "analysis": {
     "analyzer": {
       "blogs_analyzer": {
         "type": "standard",
         "stopwords": ["the", "brown", "fox", "dog"]
       }
     }
   }
 }
}'

And then open the index.

curl -X POST 'http://localhost:9200/blogs/_open?pretty'

Test your new analyzer.

curl -XGET 'localhost:9200/blogs/_analyze?pretty&analyzer=blogs_analyzer' -d 'The quick brown fox jumps over the lazy dog.'

token2.png#asset:1175

As you can see, there are no words "the", "brown", "fox", "dog" in a result because we’ve added them in stopwords. Now, let’s check how Elasticsearch will work with the stopwords file. File must be contained in config folder inside Elasticsearch folder.

elasticsearch-config-stopword.png#asset:

In the file, my_stopwords.txt, each stop word should be in its own line. The file is read in UTF8 format.

stopwords-text.png#asset:1177

Now we are ready to update an analyzer. Don’t forget to open and close the index!

curl -XPUT 'localhost:9200/blogs/_settings?pretty' -d '
  
{
 "settings": {
   "analysis": {
     "analyzer": {
       "blogs_analyzer": {
         "type": "standard",
         "stopwords_path": "stopwords/my_stopwords.txt"
       }
     }
   }
 }
}'

Relaunch Elasticsearch to enable it to work with my_stopwords.txt. Then, run the test.

curl -XGET 'localhost:9200/blogs/_analyze?pretty&analyzer=blogs_analyzer' -d 'The quick brown fox jumps over the lazy dog.'

tokens3.png#asset:1178

Our analyzer is now working with my_stopwords.txt. Add some query to the index. First, let’s delete and create the new index.

curl -XDELETE 'localhost:9200/blogs?pretty'
  
curl -XPUT 'localhost:9200/blogs?pretty' -d '
{
 "settings": {
   "analysis": {
     "analyzer": {
       "blogs_analyzer": {
         "type":       "stop",
         "stopwords":  "_english_"
       }
     }
   }
 },
 "mappings": {
   "post": {
     "properties": {
       "title":    { "type": "string" },
       "content": { "type": "string", "analyzer": "blogs_analyzer" }
     }
   }
}
}'

Add A Query

To add a query, you need to use Postman, Kibana, or something else.

json-raw.png#asset:1179

{
 "title": "Mount Everest",
 "content": "Mount Everest, also known in Nepal as Sagarmāthā and in Tibet as Chomolungma, is Earth's highest mountain. Its peak is 8,848 metres (29,029 ft) above sea level. Mount Everest is located in the Mahalangur mountain range in Nepal. The international border between China (Tibet Autonomous Region) and Nepal runs across Everest's precise summit point. Its massif includes neighbouring peaks Lhotse, 8,516 m (27,940 ft); Nuptse, 7,855 m (25,771 ft) and Changtse, 7,580 m (24,870 ft)."
}

Now, check how our content is searching.

curl -XGET 'http://127.0.0.1:9200/blogs/post/_search?pretty' -d '
{
 "fielddata_fields": ["title", "content"]
}'

content-field.png#asset:1180

In the content field you won't find the words which contain the _english_ stopwords. Only these words are will available for search.

Other Helpful Resources

Give It a Whirl!

It's easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, Amazon or Microsoft Azure data centers. And you can now provision a replicated cluster.

Questions? Drop us a note, and we'll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.

comments powered by Disqus