Stemming, in linguistic morphology and information retrieval science, is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form, generally a written word form. For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing

Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set.

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click "Get Started" in the header navigation. If you need help setting up, refer to "Provisioning a Qbox Elasticsearch Cluster." 

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance:

am, are, is ⇒ be
car, cars, car's, cars' ⇒ car

The result of this mapping of text will be something like:

the boy's cars are different colors ⇒ the boy car be differ color

However, the two words differ in their flavor. Stemming usually refers to a crude heuristic process that chops off the ends of words and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

Most languages of the world are inflected, meaning that words can change their form to express differences in the following:

  • Tense: pay, paid, paying

  • Gender: waiter, waitress

  • Case: I, me, my

  • Number: fox, foxes

While inflection aids expressivity, it interferes with retrievability, as a single root word sense (or meaning) may be represented by many different sequences of letters. English is a weakly inflected language (you could ignore inflections and still get reasonable search results), but some other languages are highly inflected and need extra work in order to achieve high-quality search results.

Stemming has many implementations but each of them suffers with two issues: understemming and overstemming.

Understemming is the failure to reduce words with the same meaning to the same root. For example, jumped and jumps may be reduced to jump, while jumping may be reduced to jumpi.

Understemming reduces retrieval i.e., relevant documents are not returned.

Overstemming is the failure to keep two words with distinct meanings separate. For instance, general and generate may both be stemmed to gener. Overstemming reduces precision i.e., irrelevant documents are returned.

Algorithmic Stemmers

Algorithmic stemmers continue to have great utility in IR, despite the promise of out-performance by dictionary-based stemmers. Nevertheless, there are few algorithmic descriptions of stemmers, and even when they exist they are liable to misinterpretation. Most of the stemmers available in Elasticsearch are algorithmic in that they apply a series of rules to a word in order to reduce it to its root form, such as stripping the final s or es from plurals.

These algorithmic stemmers have the advantage that they are available out of the box, are fast, use little memory, and work well for regular words. The downside is that they don’t cope well with irregular words like be, are, and am, or mice and mouse.

Case Study: How Qbox Saved 5 Figures per Month using Supergiant

One of the earliest stemming algorithms is the Porter stemmer for English, which is still the recommended English stemmer today. Martin Porter subsequently went on to create the Snowball language for creating stemming algorithms, and a number of the stemmers available in Elasticsearch are written in Snowball. There were two main reasons for creating Snowball. One was the lack of readily available stemming algorithms for languages other than English. The other was the consciousness of a certain failure in promoting exact implementations of the Porter stemming algorithm.

Let’s design a custom english analyzer using the following settings:

curl -XPUT 'localhost:9200/custom_english_analyzer' -d '{
  "settings": {
    "analysis": {
      "filter": {
        "english_stop": {
          "type": "stop",
          "stopwords": "_english_"
        },
        "english_keywords": {
          "type": "keyword_marker", 
          "Keywords": ["lazy"] // stem_exclusion
        },
        "english_stemmer": {
          "type": "stemmer",
          "language": "english" 
        },
        "english_possessive_stemmer": {
          "type": "stemmer",
          "language": "possessive_english" 
        }
      },
      "analyzer": {
        "english": {
          "tokenizer": "standard",
          "filter": [
            "english_possessive_stemmer",
            "lowercase",
            "english_stop",
            "english_keywords",
            "english_stemmer"
          ]
        }
      }
    }
  }
}'

Our custom english analyzer is composed of:

  • The english_stop configures the default stop words for english language.

  • The keyword_marker token filter lists words that should not be stemmed. This defaults to the empty list.

  • The english analyzer uses two stemmers: the possessive_english and the english_stemmer. The possessive stemmer removes 's from any words before passing them on to the english_stop, english_keywords, and english_stemmer.

Let’s check the output from the analyze API:

curl -XGET localhost:9200/custom_english_analyzer/_analyze?analyzer=english&text=”The quick fox jumped and the lazy dog kept snoring”

The response for above curl is:

{
  "tokens": [
    {
      "token": "quick",
      "start_offset": 5,
      "end_offset": 10,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "fox",
      "start_offset": 11,
      "end_offset": 14,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "jump",
      "start_offset": 15,
      "end_offset": 21,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "lazy",
      "start_offset": 30,
      "end_offset": 34,
      "type": "<ALPHANUM>",
      "position": 6
    },
    {
      "token": "dog",
      "start_offset": 35,
      "end_offset": 38,
      "type": "<ALPHANUM>",
      "position": 7
    },
    {
      "token": "kept",
      "start_offset": 39,
      "end_offset": 43,
      "type": "<ALPHANUM>",
      "position": 8
    },
    {
      "token": "snore",
      "start_offset": 44,
      "end_offset": 51,
      "type": "<ALPHANUM>",
      "position": 9
    }
  ]
}

If the default stemmer used by the english analyzer is too aggressive and we want to make it less aggressive, we can use light_english stemmer. The english_stemmer from english maps to the porter_stem token filter whereas light_english maps to the less aggressive kstem token filter.

Elasticsearch supports following stemmers for  English language:

  • english - The porter_stem token filter.

  • light_english - The kstem token filter.

  • minimal_english - The EnglishMinimalStemmer in Lucene, which removes plurals

  • lovins - The Snowball based Lovins stemmer, the first stemmer ever produced.

  • porter - The Snowball based Porter stemmer

  • porter2 - The Snowball based Porter2 stemmer

  • possessive_english - The EnglishPossessiveFilter in Lucene which removes 's

The stemmer documentation page highlights the recommended stemmer for each language in bold, usually because it offers a reasonable compromise between performance and quality. But, the recommended stemmer may not be appropriate for all use cases as it depends very much on the requirements.

Controlling Stemming

The keyword_marker and stemmer_override token filters allow us to customize the stemming process.

The stem_exclusion parameter for language analyzers allows us to specify a list of words that should not be stemmed. Internally, these language analyzers use the keyword_marker token filter to mark the listed words as keywords, which prevents subsequent stemming token filters from touching these words.

The keyword_marker token filter lists words that should not be stemmed. This defaults to the empty list. We used keyword_marker token filter in our custom english analyzer to exclude the word ‘lazy’ from being stemmed.

"english_keywords": {
    "type": "keyword_marker", 
    "Keywords": ["lazy"] // stem_exclusion
}

Customizing Stemming

The stemmer_override token filter allows us to specify our own custom stemming rules. It overrides stemming algorithms, by applying a custom mapping, then protecting these terms from being modified by stemmers. It must be placed before any other stemming filters.

curl -XPUT 'localhost:9200/custom_stemmer_index’ -d '{
  "settings": {
    "analysis": {
      "filter": {
        "custom_stem": {
          "type": "stemmer_override",
          "rules": [ 
            "operation=>operation",
            "console=>console",
            "constantly=>constantly"
          ]
        },
        "english_stop": {
          "type": "stop",
          "stopwords": "_english_"
       }
      },
      "analyzer": {
        "english": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "custom_stem",
            "porter_stem"
          ]
        }
      }
    }
  }
}'

Let’s try our custom english analyzer

curl -XGET localhost:9200/custom_stemmer_index/_analyze?analyzer=english&text=”Alex feet operation was being constantly monitored on health console”

The response to above curl request would emit following tokens:

Alex, feet, operation, wa, be, constantly, monitor, on, health, console

The response tokens with using only a english_stop token filter would have been

Alex, feet, oper, wa, be, constantli, monitor, on, health, consol

NOTE: Just as for the keyword_marker token filter, rules can be stored in a file whose location must then be specified with the rules_path parameter.

Other Helpful Tutorials

Give It a Whirl!

It's easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, or Amazon data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch

Questions? Drop us a note, and we'll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.

comments powered by Disqus