Phrase suggester is an advanced version of the term suggester. The additional functionality, which phrase suggester employs, is the selection of entire corrected phrases instead of individual words. This is based on the ngram-language modeling, and phrase suggesters can make better choices of tokens based on both frequency and concurrency.

In this tutorial, we show you how to use the phrase suggester to correct spellings in phrases, which offers the feature "did you mean" search functionality in elasticsearch.

Sample Data Indexing

To help demonstrate phrase suggestion, let's start this tutorial by indexing some sample data. Following are the four documents we are going to index:

Document 1

curl -XPOST localhost:9200/phrase-suggester/my-type/1 -d '{"tagline": "The windshield got misty"}'

Document 2

curl -XPOST localhost:9200/phrase-suggester/my-type/2 -d '{"tagline": "The misty windshield and the dash"}'

Document 3

curl -XPOST localhost:9200/phrase-suggester/my-type/3 -d '{"tagline": "windhshield was broken and moist"}'

Document 4

curl -XPOST localhost:9200/phrase-suggester/my-type/4 -d '{"tagline": "days of misty windshield"}' 

Phrase Suggester Working

Have a look at the working of the phrase suggester. Let us search for a phrase with three words and two typos. Let our phrase be "windsheild got mitsy". You can see we have the typos in the 1st and the 3rd words. Query the index using the phrase suggester and see what is returned.

The query look likes below:

curl -XPOST localhost:9200/my_index/_search?pretty -d '{
 "size": 0,
 "suggest": {
   "text": "windsheild got mitsy",
   "phrase-suggestion-demo": {
     "phrase": {
       "field": "tagline"
     }
   }
 }
}'

As you can see, the query above is similar to the term suggester query which we used in the previous article, except that the "term" parameter is replaced by the "phrase" parameter.

Learn About Our Enterprise Kubernetes Support

The response for the above query is as below:

{
 "took": 192,
 "timed_out": false,
 "_shards": {
   "total": 5,
   "successful": 5,
   "failed": 0
 },
 "hits": {
   "total": 4,
   "max_score": 0,
   "hits": [
     
   ]
 },
 "suggest": {
   "phrase-suggestion-demo-01": [
     {
       "text": "windsheild got mitsy",
       "offset": 0,
       "length": 20,
       "options": [
         {
           "text": "windshield got misty",
           "score": 0.13930021
         },
         {
           "text": "windshield got mitsy",
           "score": 0.11107826
         },
         {
           "text": "windsheild got misty",
           "score": 0.1055392
         }
       ]
     }
   ]
 }
}

Much like the terms suggester, the suggestions are listed under the "options" array of the response. In the response we can see the first element of the options list, the spellings of both the words were corrected and the phrase was returned perfectly, and it has the highest score amongst all the others. For the second element in the options list, only one of the typos was corrected. Whereas in the last one in the list returned the searched phrase as it is.

Phrase Suggester with Options

In the above example we have seen the most basic usage of phrase suggester query. Now the query can be configured with a lot many settings to fine tune our search like highlighting,confidence, collate etc. Let us explore a few of the very compelling options.

Consider the below query for phrase suggestion:

 curl -XPOST localhost:9200/phrase-suggestion/_search -d '{
 "size": 0,
 "suggest": {
   "text": "windsheild got mitsy",
   "phrase-suggestion-demo-01": {
     "phrase": {
       "field": "tagline",
       "real_word_error_likelihood": 0.95,
       "max_errors": 0.5,
       "confidence": 0,
       "highlight": {
         "pre_tag": "<em>",
         "post_tag": "</em>"
       },
       "collate": {
         "query": {
           "inline": {
             "match": {
               "": ""
             }
           }
         },
         "params": {
           "field_name": "tagline"
         },
         "prune": true
       }
     }
   }
 }
}'

In the above query, you can see new parameters included. Let us explore each:

  1. real_word_error_likelihood - The default value for this option is 0.95. This options tells Elasticsearch that 5 percent of the terms that are in the index are misspelled. This means that as the value of this parameter gets lower, elasticsearch will treat more and more terms existing in the index as misspelled, even though they are correct.
  2. max_errors - This option defines the corrections to be returned with how many misspelled terms are there. The default value is 1.
  3. confidence - The default value is 1.0 and the maximum value, too. This acts as a threshold relating to the score of the suggestions. Only those suggestions having the scores exceeding this value would be shown.
  4. highlight - One of the most important helpful features in search is the highlighting feature. We can enable the same in the phrase suggester. The corrected words would be highlighted using this keyword. As shown in the above query, we can also employ which tag to be used to highlight (here we have used the <em> tag).
  5. collate - Here each suggestion is checked against the query specified. In this case, it is a match query. Since this query is a template query, the search query is the current suggestion, which is under the parameter in the query. Further fields can be added in the "params" object under the query. Also when the parameter "prune" is set to true, we will have an additional field "collate_match" in the response, indicating whether there was the match of all the corrected keywords in the suggested results.

The response for the above query can be found below:

{
 "suggest": {
   "simple_phrase": [
     {
       "text": "windsheild got mitsy",
       "offset": 0,
       "length": 20,
       "options": [
         {
           "text": "windshield got misty",
           "highlighted": "<em>windshield</em> got <em>misty</em>",
           "score": 0.13930021,
           "collate_match": true
         },
         {
           "text": "windshield got mitsy",
           "highlighted": "<em>windshield</em> got mitsy",
           "score": 0.11107826,
           "collate_match": true
         },
         {
           "text": "windsheild got misty",
           "highlighted": "windsheild got <em>misty</em>",
           "score": 0.1055392,
           "collate_match": true
         },
         {
           "text": "windsheild got mitsy",
           "highlighted": "windsheild got mitsy",
           "score": 0.08415717,
           "collate_match": false
         },
         {
           "text": "windhshield got mitsy",
           "highlighted": "<em>windhshield</em> got mitsy",
           "score": 0.058400385,
           "collate_match": true
         }
       ]
     }
   ]
 }
}

In the above response we can see the highlighted texts for the suggested corrections. Now there is the collate_match where in response to the “prune” parameter in the query. You can see that one result which has neither of the corrected keywords has the value false.

Change the values of the “confidence”, ”real_word_error_likelihood” and the “max_errors” parameter and compare the changes for better understanding of these parameters.

Conclusion

In this blog post on the suggest API series, we have shown phrase suggester usage in basic and advanced levels. 

Other Helpful Tutorials

Give It a Whirl!

It's easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, Amazon, or Microsoft Azure data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch

Questions? Drop us a note, and we'll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.