How to Use Elasticsearch to Remove Stopwords from a Query
Posted by Igor Bobriakov February 14, 2017In this tutorial, we discuss stopwords. We explain what they are, why they are needed, and the various types of stopwords. We also show how to use them correctly, how to delete them, and how to create your own.
Stopwords
In a text, stopwords are the common words that search engines filter out after processing. For example, if you have a limited size of HDD or RAM, or you want to get better performance, stopwords can help. They help you keep a smaller index.
Types of Stopwords
In newspapers, books, or other texts you can graduate words by their importance. The same method can be applied to stopwords in Elasticsearch. Stopwords are divided roughly into two groups:
- Low-frequency — these are the words that are contained in only a few documents from all of the data collection. The word `Javascript` in the book `Learning Python` may not occur at all, but if it is, it will have a Low-frequency.
- High-frequency — these are the words that are contained in almost all of the documents from the data collection. It can be short function words, such as the, is, at, with and etc., or the word `dolphin` in the book `All about the dolphin`.
Using Stopwords
For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click “Get Started” in the header navigation. If you need help setting up, refer to “Provisioning a Qbox Elasticsearch Cluster.“
To start using stopwords, we need to create index with an analyzer.
curl -XPUT 'localhost:9200/blogs?pretty' -d ' { "settings": { "analysis": { "analyzer": { "blogs_analyzer": { "type": "standard", "stopwords": "_english_" } } } } }'
We create our analyzer with the name blogs_analyzer
, set type
to standard
and add _english_ stopwords
.
Examples of the "type"
parameter:
"standard"
— This is a standard type of analyzer. It is built using the Standard Tokenizer with the Standard Token Filter, Lower Case Token Filter, and Stop Token Filter."simple"
— It is constructed using the Lower Case Token Filter."custom"
— Allows you to combine a Tokenizer with zero or more Token Filters, and zero or more Char Filters."keyword"
— This “tokenizes” an entire stream as a single token. It is useful for data like zip codes, IDs, and more."language"
— This is aimed at analyzing specific language text. The following types are supported: arabic, armenian, basque, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai."pattern"
— Uses a regular expression to split the text into terms."snowball"
— Is built using the standard tokenizer, with standard filter, lowercase filter, stop filter, and snowball filter."stop"
— Is built using a Lower Case Tokenizer, along with a Stop Token Filter."whitespace"
— Is built using a Whitespace Tokenizer.
In our case, we use "stopwords": "_english_"
. The default English Stopwords used in Elasticsearch are:
a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with
Elasticsearch supports other languages, too:
_arabic_, _armenian_, _basque_, _brazilian_, _bulgarian_, _catalan_, _czech_, _danish_, _dutch_, _english_, _finnish_, _french_, _galician_, _german_, _greek_, _hindi_, _hungarian_, _indonesian_, _irish_, _italian_, _latvian_, _norwegian_, _persian_, _portuguese_, _romanian_, _russian_, _sorani_, _spanish_, _swedish_, _thai_, _turkish_.
Or you can specify the "stopwords"
field:
"stopwords": [ "you", "use" , "stopwords"]
Another option is to add the txt file instead of "stopwords"
:
"stopwords_path": "path_to_file/your_file_with_stopwords.txt"
Let’s check our analyzer.
curl -XGET 'localhost:9200/blogs/_analyze?pretty&analyzer=blogs_analyzer' -d 'The quick brown fox jumps over the lazy dog.'
You will get the next output:
As you can see, we don’t have the word "the"
in a result.
Let’s modify the analyzer. To do this, you need to close the index, update the analyzer, and open the index again.
curl -X POST 'http://localhost:9200/blogs/_close?pretty'
You should get the confirm message:
{ "acknowledged" : true }
Now modify the stopwords.
curl -XPUT 'localhost:9200/blogs/_settings?pretty' -d ' { "settings": { "analysis": { "analyzer": { "blogs_analyzer": { "type": "standard", "stopwords": ["the", "brown", "fox", "dog"] } } } } }'
And then open the index.
curl -X POST 'http://localhost:9200/blogs/_open?pretty'
Test your new analyzer.
curl -XGET 'localhost:9200/blogs/_analyze?pretty&analyzer=blogs_analyzer' -d 'The quick brown fox jumps over the lazy dog.'
As you can see, there are no words "the"
, "brown"
, "fox"
, "dog"
in a result because we’ve added them in stopwords. Now, let’s check how Elasticsearch will work with the stopwords file. File must be contained in config folder inside Elasticsearch folder.
In the file, my_stopwords.txt
, each stop word should be in its own line. The file is read in UTF8 format.
Now we are ready to update an analyzer. Don’t forget to open and close the index!
curl -XPUT 'localhost:9200/blogs/_settings?pretty' -d ' { "settings": { "analysis": { "analyzer": { "blogs_analyzer": { "type": "standard", "stopwords_path": "stopwords/my_stopwords.txt" } } } } }'
Relaunch Elasticsearch to enable it to work with my_stopwords.txt
. Then, run the test.
curl -XGET 'localhost:9200/blogs/_analyze?pretty&analyzer=blogs_analyzer' -d 'The quick brown fox jumps over the lazy dog.'
Our analyzer is now working with my_stopwords.txt
. Add some query to the index. First, let’s delete and create the new index.
curl -XDELETE 'localhost:9200/blogs?pretty' curl -XPUT 'localhost:9200/blogs?pretty' -d ' { "settings": { "analysis": { "analyzer": { "blogs_analyzer": { "type": "stop", "stopwords": "_english_" } } } }, "mappings": { "post": { "properties": { "title": { "type": "string" }, "content": { "type": "string", "analyzer": "blogs_analyzer" } } } } }'
Add A Query
To add a query, you need to use Postman, Kibana, or something else.
{ "title": "Mount Everest", "content": "Mount Everest, also known in Nepal as Sagarmāthā and in Tibet as Chomolungma, is Earth's highest mountain. Its peak is 8,848 metres (29,029 ft) above sea level. Mount Everest is located in the Mahalangur mountain range in Nepal. The international border between China (Tibet Autonomous Region) and Nepal runs across Everest's precise summit point. Its massif includes neighbouring peaks Lhotse, 8,516 m (27,940 ft); Nuptse, 7,855 m (25,771 ft) and Changtse, 7,580 m (24,870 ft)." }
Now, check how our content is searching.
curl -XGET 'http://127.0.0.1:9200/blogs/post/_search?pretty' -d ' { "fielddata_fields": ["title", "content"] }'
In the content field you won’t find the words which contain the _english_
stopwords. Only these words are will available for search.
Other Helpful Resources
- Applying Elasticsearch Custom Analyzers
- Improving Your Free Query Results By Using Elasticsearch
- How to Search for Singular and Plural Tenses with Elasticsearch Analyzers
- Introduction to Using Moloch and Elasticsearch
- Troubleshooting in Elasticsearch: Queries, Mappings, and Scoring
- Synonyms Dictionaries in Elasticsearch
Give It a Whirl!
It’s easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, Amazon or Microsoft Azure data centers. And you can now provision a replicated cluster.
Questions? Drop us a note, and we’ll get you a prompt response.
Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.