Algorithmic Stemming in Elasticsearch
Posted by Adam Vanderbush May 11, 2017Stemming, in linguistic morphology and information retrieval science, is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form, generally a written word form. For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing.
Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set.
For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click “Get Started” in the header navigation. If you need help setting up, refer to “Provisioning a Qbox Elasticsearch Cluster.“
The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance:
am, are, is ⇒ be car, cars, car's, cars' ⇒ car
The result of this mapping of text will be something like:
the boy’s cars are different colors ⇒ the boy car be differ color
However, the two words differ in their flavor. Stemming usually refers to a crude heuristic process that chops off the ends of words and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.
Most languages of the world are inflected, meaning that words can change their form to express differences in the following:
Tense
: pay, paid, payingGender
: waiter, waitressCase
: I, me, myNumber
: fox, foxes
While inflection aids expressivity, it interferes with retrievability, as a single root word sense (or meaning) may be represented by many different sequences of letters. English is a weakly inflected language (you could ignore inflections and still get reasonable search results), but some other languages are highly inflected and need extra work in order to achieve high-quality search results.
Stemming has many implementations but each of them suffers with two issues: understemming and overstemming.
Understemming is the failure to reduce words with the same meaning to the same root. For example, jumped and jumps may be reduced to jump, while jumping may be reduced to jumpi.
Understemming reduces retrieval i.e., relevant documents are not returned.
Overstemming is the failure to keep two words with distinct meanings separate. For instance, general and generate may both be stemmed to gener. Overstemming reduces precision i.e., irrelevant documents are returned.
Algorithmic Stemmers
Algorithmic stemmers continue to have great utility in IR, despite the promise of out-performance by dictionary-based stemmers. Nevertheless, there are few algorithmic descriptions of stemmers, and even when they exist they are liable to misinterpretation. Most of the stemmers available in Elasticsearch are algorithmic in that they apply a series of rules to a word in order to reduce it to its root form, such as stripping the final s or es from plurals.
These algorithmic stemmers have the advantage that they are available out of the box, are fast, use little memory, and work well for regular words. The downside is that they don’t cope well with irregular words like be, are, and am, or mice and mouse.
Case Study: How Qbox Saved 5 Figures per Month using Supergiant
One of the earliest stemming algorithms is the Porter stemmer for English, which is still the recommended English stemmer today. Martin Porter subsequently went on to create the Snowball language for creating stemming algorithms, and a number of the stemmers available in Elasticsearch are written in Snowball. There were two main reasons for creating Snowball. One was the lack of readily available stemming algorithms for languages other than English. The other was the consciousness of a certain failure in promoting exact implementations of the Porter stemming algorithm.
Let’s design a custom english analyzer using the following settings:
curl -XPUT 'localhost:9200/custom_english_analyzer' -d '{ "settings": { "analysis": { "filter": { "english_stop": { "type": "stop", "stopwords": "_english_" }, "english_keywords": { "type": "keyword_marker", "Keywords": ["lazy"] // stem_exclusion }, "english_stemmer": { "type": "stemmer", "language": "english" }, "english_possessive_stemmer": { "type": "stemmer", "language": "possessive_english" } }, "analyzer": { "english": { "tokenizer": "standard", "filter": [ "english_possessive_stemmer", "lowercase", "english_stop", "english_keywords", "english_stemmer" ] } } } } }'
Our custom english analyzer is composed of:
- The
english_stop
configures the default stop words for english language. - The
keyword_marker
token filter lists words that should not be stemmed. This defaults to the empty list. - The english analyzer uses two stemmers: the
possessive_english
and theenglish_stemmer
. The possessive stemmer removes ‘s from any words before passing them on to theenglish_stop
,english_keywords
, andenglish_stemmer
.
Let’s check the output from the analyze API:
curl -XGET localhost:9200/custom_english_analyzer/_analyze?analyzer=english&text=”The quick fox jumped and the lazy dog kept snoring”
The response for above curl is:
{ "tokens": [ { "token": "quick", "start_offset": 5, "end_offset": 10, "type": "<ALPHANUM>", "position": 1 }, { "token": "fox", "start_offset": 11, "end_offset": 14, "type": "<ALPHANUM>", "position": 2 }, { "token": "jump", "start_offset": 15, "end_offset": 21, "type": "<ALPHANUM>", "position": 3 }, { "token": "lazy", "start_offset": 30, "end_offset": 34, "type": "<ALPHANUM>", "position": 6 }, { "token": "dog", "start_offset": 35, "end_offset": 38, "type": "<ALPHANUM>", "position": 7 }, { "token": "kept", "start_offset": 39, "end_offset": 43, "type": "<ALPHANUM>", "position": 8 }, { "token": "snore", "start_offset": 44, "end_offset": 51, "type": "<ALPHANUM>", "position": 9 } ] }
If the default stemmer used by the english analyzer is too aggressive and we want to make it less aggressive, we can use light_english stemmer. The english_stemmer from english maps to the porter_stem token filter whereas light_english maps to the less aggressive kstem token filter.
Elasticsearch supports following stemmers for English language:
english
– The porter_stem token filter.light_english
– The kstem token filter.minimal_english
– TheEnglishMinimalStemmer
in Lucene, which removes pluralslovins
– The Snowball based Lovins stemmer, the first stemmer ever produced.porter
– The Snowball based Porter stemmerporter2
– The Snowball based Porter2 stemmerpossessive_english
– TheEnglishPossessiveFilter
in Lucene which removes ‘s
The stemmer documentation page highlights the recommended stemmer for each language in bold, usually because it offers a reasonable compromise between performance and quality. But, the recommended stemmer may not be appropriate for all use cases as it depends very much on the requirements.
Controlling Stemming
The keyword_marker
and stemmer_override
token filters allow us to customize the stemming process.
The stem_exclusion
parameter for language analyzers allows us to specify a list of words that should not be stemmed. Internally, these language analyzers use the keyword_marker
token filter to mark the listed words as keywords, which prevents subsequent stemming token filters from touching these words.
The keyword_marker
token filter lists words that should not be stemmed. This defaults to the empty list. We used keyword_marker
token filter in our custom english analyzer to exclude the word ‘lazy’ from being stemmed.
"english_keywords": { "type": "keyword_marker", "Keywords": ["lazy"] // stem_exclusion }
Customizing Stemming
The stemmer_override
token filter allows us to specify our own custom stemming rules. It overrides stemming algorithms, by applying a custom mapping, then protecting these terms from being modified by stemmers. It must be placed before any other stemming filters.
curl -XPUT 'localhost:9200/custom_stemmer_index’ -d '{ "settings": { "analysis": { "filter": { "custom_stem": { "type": "stemmer_override", "rules": [ "operation=>operation", "console=>console", "constantly=>constantly" ] }, "english_stop": { "type": "stop", "stopwords": "_english_" } }, "analyzer": { "english": { "tokenizer": "standard", "filter": [ "lowercase", "custom_stem", "porter_stem" ] } } } } }'
Let’s try our custom english analyzer
curl -XGET localhost:9200/custom_stemmer_index/_analyze?analyzer=english&text=”Alex feet operation was being constantly monitored on health console”
The response to above curl request would emit following tokens:
Alex, feet, operation, wa, be, constantly, monitor, on, health, console
The response tokens with using only a english_stop token filter would have been
Alex, feet, oper, wa, be, constantli, monitor, on, health, consol
NOTE: Just as for the keyword_marker token filter, rules can be stored in a file whose location must then be specified with the rules_path
parameter.
Other Helpful Tutorials
- Getting Started with Elasticsearch on Qbox
- How to Use Elasticsearch, Logstash, and Kibana to Manage Logs
- How to Use Elasticsearch, Logstash, and Kibana to Manage NGINX Logs
- The Authoritative Guide to Elasticsearch Performance Tuning (Part 1)
- Using the ELK Stack and Python in Penetration Testing Workflow
Give It a Whirl!
It’s easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, or Amazon data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch.
Questions? Drop us a note, and we’ll get you a prompt response.
Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.