A fuzzy search is a process that locates web pages or documents that are likely to be relevant to a search argument even when the argument does not exactly correspond to the desired information. 

A fuzzy search is done by means of a fuzzy matching query, which returns a list of results based on likely relevance even though search argument words and spellings may not exactly match. Exact and highly relevant matches appear near the top of the list.

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click "Get Started" in the header navigation. If you need help setting up, refer to "Provisioning a Qbox Elasticsearch Cluster."

Elasticsearch can be configured to provide fuzziness by mixing its built-in edit-distance matching and phonetic analysis with more generic analyzers and filters. However, this approach requires a complex query against multiple fields, and recall is completely determined by Lucene edit distance and Soundex/metaphone (phonetic similarity). 

As for precision, it is difficult to guarantee that the best results will be at the top, and Lucene document scores are notoriously unreliable to place a threshold on. No other types of variations (e.g., swapped name order) are taken into account in calculating the similarity.

If documents exist that do contain exactly what the user has queried, they should appear at the top of the result set, but weaker matches can be included further down the list. If no documents match exactly, at least we can show the user potential matches -- and they may even be what the user originally intended!

We have already looked at word stemming, stop word filtering, and synonyms matching, but all of those approaches presuppose that words are spelled correctly or that there is only one way to spell each word. Fuzzy matching allows for query-time matching of misspelled words, while phonetic token filters allows for  sounds-like matching at index time.

Different Types of Fuzzy Searches

Different types of fuzzy search are supported by Elasticsearch, and the differences can be confusing. The list below attempts to differentiate between these various types.

  • matchquery + fuzziness option: Adding the fuzziness parameter to a match query turns a plain match query into a fuzzy one. Analyzes the query text before performing the search.

  • fuzzy query: The Elasticsearch fuzzy query type should generally be avoided. Acts much like a term query. Does not analyze the query text first.

  • fuzzy_like_this/fuzzy_like_this_field: A more_like_this query, but supports fuzziness; has a tuned scoring algorithm that better handles the characteristics of fuzzy matched results.

  • suggesters: Suggesters are not an actual query type but rather a separate type of operation (internally built on top of fuzzy queries) that can be run either alongside a query or independently. Suggesters are great for 'did you mean' style functionality.

Fuzzy Query

The fuzzy query is the fuzzy equivalent of the term query. We will seldom use it directly ourselves, but understanding how it works will help us to use fuzziness in the higher-level match query.

To understand how it works, we will first index some documents:

curl -XPOST  'localhost:9200/index_name/index_type/_bulk -d '
{ "index": { "_id": 1 }}
{ "text": "The quick fox jumped over the lazy dog!"}
{ "index": { "_id": 2 }}
{ "text": "The quick fox kept jumping and screaming near the woods"}
{ "index": { "_id": 3 }}
{ "text": "The lazy dog kept snoring in his lawn but did not jump"}
’

Now if run a fuzzy query for the term jumpi:

curl -XGET 'localhost:9200/index_name/index_type/_search -d '{
  "query": {
    "fuzzy": {
      "text": "jumpi"
    }
  }
}'

In our example, jumpi is within an edit distance of 1 from ‘jump’ and within an edit distance of 2 from ‘jumped’ and ‘jumping’, so documents 1, 2, and 3 match. We could reduce the matches to just ‘jump’ with the following query:

curl -XGET  'localhost:9200/index_name/index_type/_search -d '{
  "query": {
    "fuzzy": {
      "text": {
        "value": "jomp",                                                
        "fuzziness": 1                                         
      }
    }
  }                                   
}'  

The fuzzy query is a term-level query, so it doesn’t do any analysis. It takes a single term and finds all terms in the term dictionary that are within the specified fuzziness. The default fuzziness is AUTO.

When querying text or keyword fields, fuzziness is interpreted as a Levenshtein Edit Distance, i.e.,  the number of one character changes that need to be made to one string to make it the same as another string.

The fuzziness parameter can be specified as:

  • 0, 1, 2 -- It is maximum allowed Levenshtein Edit Distance (or number of edits)
  • AUTO -- It generates an edit distance based on the length of the term. For lengths:
    • 0..2 -- must match exactly
    • 3..5 -- one edit allowed
    • > 5 -- two edits allowed

AUTO should generally be the preferred value for fuzziness.

The fuzzy query works by taking the original term and building a Levenshtein automaton. Levenshtein automata may be used for spelling correction to find words in a given dictionary that are close to a misspelled word. In this application, once a word is identified as being misspelled, its Levenshtein automaton may be constructed and then applied to all of the words in the dictionary to determine which ones are close to the misspelled word. 

Tutorial: Auto-Scaling Kubernetes Cluster on AWS EC2 with Supergiant

If the dictionary is stored in compressed form as a trie, the time for this algorithm (after the automaton has been constructed) is proportional to the number of nodes in the trie. This is significantly faster than using dynamic programming to compute the Levenshtein distance separately for each dictionary word.

The performance impact can be limited by following two factors:

  • prefix_length -- The number of prefix characters that will not be considered for fuzziness. Most spelling errors occur toward the end of the word, not toward the beginning. A prefix_length of 3 is sufficient enough.

  • max_expansions -- A fuzzy query expands to three or four fuzzy options or may expand to thousand options. If it produces 1,000 options, they may not be useful or essentially meaningless. max_expansions is used limit the total number of options that will be produced.

Fuzzy Match Query

The match query supports fuzzy matching out of the box:

curl -XGET 'localhost:9200/my_index/my_type/_search' -d '{
  "query": {
    "match": {                                             
      "text": {                                          
        "query": "jomped over me!",
        "fuzziness": "AUTO",
        "operator":  "and"
      }
    }                                              
  }                                                 
}'

The query string is first analyzed to produce the terms [jumped, over, me], and then each term is fuzzified using the specified fuzziness.

Fuzzy Multi-match Query

Similarly, the multi_match query also supports fuzziness but only when executing with type best_fields or most_fields:

curl -XGET ‘localhost:9200/my_index/my_type/_search’ -d ‘{
  "query": {
    "multi_match": {
      "fields":  [ "text", "title" ],
      "query":     "SURPRIZE ME!",
      "fuzziness": "AUTO"
    }
  }
}’

NOTE: Fuzziness works only with the basic match and multi_match queries. It doesn’t work with phrase matching, common terms, or cross_fields matches. There is however support for prefix_length and max_expansions in both the match and multi_match queries.

Scoring

Fuzzy matching should not be used for scoring purpose but only to widen the net of matching terms in case there are misspellings and typos.

The match query, by default, gives all fuzzy matches the constant score of 1. This is sufficient to add potential matches onto the end of the result list without interfering with the relevance scoring of non-fuzzy queries.

Performance Issues

Lucene's Levenshtein distance implementation is state of the art and quite fast, but it is still much slower than a plain match query. The runtime of the query grows with the number of unique terms in the index. Fuzzy queries use an advanced algorithm involving a DFA (<em>Deterministic finite automaton</em>) that must process a large number of terms. Processing the much larger number of terms required for a fuzzy search is always slower than a simple match(binary) search. We shall discuss here the two performance-limiting factors:

prefix_length -- The longer the prefix length, the faster the speedup. Specifying a prefix_length of 1 cuts the 300ms query down to a mere 100ms. Increasing this number yields further speed gains. Performance can be significantly improved by requiring that matches have exact prefix matches with the query. This dramatically shortens the search space at the cost of not finding words with a misspelling toward the front.

max_expansions -- It is important to understand that the max_expansions query limit works at the shard level, meaning that even if set to 1, multiple terms may match, all coming from different shards. Thus, counting unique returned terms is not a valid way to determine if max_expansions is working. Cutting down the query terms has a negative effect because some valid results may be pruned due to early termination of the query.

The basic insight behind Levenshtein automata is that it's possible to construct a Finite state automaton that recognizes exactly the set of strings within a given Levenshtein distance of a target word. We can then feed in any word, and the automaton will accept or reject it based on whether the Levenshtein distance to the target word is at most the distance specified when we constructed the automaton.

Thus, to sum up, Fuzzy matching can be enabled on Match and Multi-Match queries to catch spelling errors. The fuzziness value of "AUTO" is equivalent to specifying a value of 2 when the term length is greater than 5. However, assuming 80% of human misspellings have an edit distance of 1 and thus, setting the fuzziness to 1 may improve our overall search performance.

Related Helpful Resources

Give It a Whirl!

It's easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, Amazon, or Microsoft Azure data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch

Questions? Drop us a note, and we'll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.

comments powered by Disqus