Imagine we have a website where users can put any query in any number of formats. We should map them to standard format and use the words that we have in the fields of our data set.

The main problems we can meet are:

– Incorrect words

– Pluralsingular forms

– Too many stopwords

In this post we will show you how to solve these problems in order to improve your free query results.

Description of the method

We plan to use some main methods for corrections of free query: filtering, stopwords removing and word analysis. Each of them has mathematical ground and is implemented in the Elasticsearch libraries.

Description of the process

1. Incorrect words

We have some Analyzers that can solve this problem. What do they do? They search words with the same root.

For this problem use the English stemmer with the standard tokenizer and build a filter_stemmer analyzer.

curl -XPUT  'http://localhost:9200/myindex/' -d '{
  "settings": {
    "index": {
      "analysis": {
        "filter": {
          "stemmers": {
            "type": "stemmer",
            "language": "english"
                  }
            },
        "analyzer": {
          "filter_stemmer": {
            "filter": [
              "standard",
              "lowercase",
              "stemmers"
            ],
            "tokenizer": "standard"
                  }
            }
      }
    }
  }
}'

Note: if you want to update already existing index settings, you should do the following:

  1. curl -XPOST 'http://localhost:9200/myindex/_close'
  2. Define new analyzers for the index
  3. curl -XPOST 'http://localhost:9200/myindex/_open'

Let’s check our analyzer:

curl -XGET 'http://localhost:9200/myindex/_analyze?analyzer=filter_stemmer&text=cloth+clothing+clothes+fine'

We receive response:

{
  "tokens" : [ {
    "token" : "cloth",
    "start_offset" : 0,
    "end_offset" : 5,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "cloth",
    "start_offset" : 6,
    "end_offset" : 14,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "cloth",
    "start_offset" : 15,
    "end_offset" : 22,
    "type" : "<ALPHANUM>",
    "position" : 3
  }, {
    "token" : "fine",
    "start_offset" : 23,
    "end_offset" : 27,
    "type" : "<ALPHANUM>",
    "position" : 4
  } ]
}

We can see that the words ”clothing” and “clothes” are replaced by the word “cloth”

Let’s start by putting some data in our index:

curl -X PUT "http://localhost:9200/myindex/words/word/1" -d '{ "text" : "shelves covered with bright red cloth" }'
curl -X PUT "http://localhost:9200/myindex/words/word/2" -d '{ "text" : "clothing shop" }'
curl -X PUT "http://localhost:9200/myindex/words/word/3" -d '{ "text" : "he stripped off his clothes" }'
curl -X PUT "http://localhost:9200/myindex/words/word/4" -d '{ "text" : "a man of the cloth" }'

Let’s try to make query with “cloth”,  “clothing” and  “clothes”.

curl -XGET 'http://localhost:9200/myindex/_search?pretty' -d '{
  "query": {
      "filtered": {
          "query": {
              "multi_match": {
                  "query": "cloth",
                  "fields": [
                      "_all"
                  ],
                  "analyzer": "filter_stemmer"
              }
          }
      }
  }
}'

Query_Blog_1.png#asset:784

curl -XGET 'https://localhost:9200/myindex/_search?pretty' -d '{
  "query": {
      "filtered": {
          "query": {
              "multi_match": {
                  "query": "clothing",
                  "fields": [
                      "_all"
                  ],
                  "analyzer": "filter_stemmer"
              }
          }
      }
  }
}'
 "hits" : {
   "total" : 2,
   "max_score" : 0.26010898,
}

Query_Blog_2.png#asset:785

curl -XGET 'http://localhost:9200/myindex/_search?pretty' -d '{
  "query": {
      "filtered": {
          "query": {
              "multi_match": {
                  "query": "clothes",
                  "fields": [
                      "_all"
                  ],
                  "analyzer": "filter_stemmer"
              }
          }
      }
  }
}'
"hits" : {
   "total" : 2,
   "max_score" : 0.26010898,
}

We see that search is carried out according to the keyword cloth and scores of the documents that are retrieved by a query have a similar number of documents: "total" : 2

Search Damerau–Levenshtein distance between words

For example: table and tabble would be the same word. We can search for terms that are similar to, but not exactly like our search terms, using the “fuzzy” operator. It uses the Damerau-Levenshtein distance to find all terms with a maximum of two changes, where change is an insertion, deletion or substitution of a single character or transposition of two adjacent characters.

curl -XGET "http://localhost:9200/myindex/_search " -d'
{ "query":
{ "fuzzy":
{ "text" :
{ "value" : "ccloth",
  "fuzziness" : 1 }
}
}
  }'

Query_Blog_3.png#asset:786

See that “total hits” is the same as in the query with the keyword “cloth”.

The plural\singular forms problem can be solved in the same way, with the filter_stemmer analyzer.

Add more data:

curl -X PUT "http://localhost:9200/myindex/word/5" -d '{ "text" : "Traditional African dress" }'
curl -X PUT "http://localhost:9200/myindex/word/6" -d '{ "text" : "Her red dress made her stand out" }'

And try querying:

curl -XGET 'http://localhost:9200/myindex/_search?pretty' -d '{
  "query": {
      "filtered": {
          "query": {
              "multi_match": {
                  "query": "dress",
                  "fields": [
                      "_all"
                  ],
                  "analyzer": "filter_stemmer"
              }
          }
      }
  }
}'
"hits" : {
  "total" : 2,
   "max_score" : 0.09535168,
}
curl -XGET 'http://localhost:9200/myindex/_search?pretty' -d '{
  "query": {
      "filtered": {
          "query": {
              "multi_match": {
                  "query": "dresses",
                  "fields": [
                      "_all"
                  ],
                  "analyzer": "filter_stemmer"
              }
          }
      }
  }
}'
"hits" : {
   "total" : 2,
   "max_score" : 0.09535168,
}

And it works well.

2. Stopwords removing

Let’s look at the next small example. A user can search “jacked”, “the jacked”, “all jacked”, “a jacked” and other  variations. But for us it is the same. So we can use an analyzer with the stopwords filter.
We can use the “English” dictionary from Elasticsearch.

curl -X PUT 'http://localhost:9200/myindex/_settings' -d '
    {
        "settings": {
            "analysis": {
                "filter": {
                    "my_stop": {
                        "type":       "stop",
                        "stopwords":  "_english_"
                        }
                  }
            }
          }
    }'

But this way is not completely accurate. For example, you will not find the word “all” there.
So it’s better to load your own dictionary in the following way:

curl -X PUT    'http://localhost:9200/myindex/_settings' -d '
{
        "settings": {
            "analysis": {
                "filter" : {
                        "english_stopwords": {
                              "type":       "stop",
                              "stopwords":  ["all","about","above","after"]
                            }
                },
                "analyzer": {
                            "filter_english_stopwords" : {
                                   "type" : "custom",
                                   "tokenizer" : "standard",
                                   "filter" : ["english_stopwords", "porter_stem"]
                               }
                }
            }
          }
    }'

Test this analyzer:

curl -XGET 'http://localhost:9200/myindex/_search?pretty' -d '{
  "query": {
      "filtered": {
          "query": {
              "multi_match": {
                  "query": "dress",
                  "fields": ["_all"],
                  "analyzer": "filter_english_stopwords"
              }
          }
      }
  }
}'

We receive:

"hits" : {
    "total" : 2,
   "max_score" : 0.375
 }

Test with the query “all dress”

curl -XGET 'http://localhost:9200/myindex/_search?pretty' -d '{
  "query": {
      "filtered": {
          "query": {
              "multi_match": {
                  "query": "all dress",
                  "fields": ["_all"],
"analyzer": "filter_english_stopwords"
              }
          }
      }
  }
}'

The result is the same:

Query_Blog_4.png#asset:787

Additional information

All analyzers, tokenizers, and token filters can be configured with a version parameter to control which Lucene version behavior they should use. Possible values are: 3.0 – 3.6, 4.0 – 4.3 (the highest version number is the default option).

About the author

Igor Bobriakov is a technology leader who is constantly researching latest trends in data science,
big data, internet of things and other digital innovations areas. He has worked in the roles of data scientist, CTO, software engineer and project manager with 13+ years of extensive IT experience.