When working with thousands of documents, a question that emerges is how to find documents that are similar to a given document or a set of documents. There are often uses cases when one would like to show documents that are similar to the document that the user is viewing, or is interested in. Elasticsearch has a query feature called “More Like This Query”, also known as the MLT Query, that tackles these cases.

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click "Get Started" in the header navigation. If you need help setting up, refer to "Provisioning a Qbox Elasticsearch Cluster."

Let’s populate some data and explore More Like This Query.

curl -XPOST "http://localhost:9200/_bulk" -d ‘
{ "index":  { "_index": "library", "_type": "book","_id":1 }}
{"title":"Magic Of Thinking Big", "description":"Millions of people throughout the world have improved their lives using The Magic of Thinking Big. Dr. David J. Schwartz, long regarded as one of the foremost experts on motivation, will help you sell better, manage better, earn more money, and—most important of all—find greater happiness and peace of mind." }
{ "index":  { "_index": "library", "_type": "book","_id":2 }}
{"title":"The Power of Positive Thinking", "description":"The book describes the power positive thinking has and how a firm belief in something, does actually help in achieving it" }
{ "index":  { "_index": "library", "_type": "book","_id":3 }}
{"title":"Think and Grow Rich", "description":"Think And Grow Rich has earned itself the reputation of being considered a textbook for actionable techniques that can help one get better at doing anything, not just by rich and wealthy, but also by people doing wonderful work in their respective fields. " }
{ "index":  { "_index": "library", "_type": "book","_id":4 }}
{"title":"The Magic of thinking Big", "description":"First published in 1959, David J Schwartz's classic teachings are as powerful today as they were then. Practical,
empowering and hugely engaging, this book will not only inspire you, it will give you the tools to change your life for the better - starting from now." }
{ "index":  { "_index": "library", "_type": "book","_id":5 }}
{"title":"How to Stop Worrying and Start Living", "description":"The book is written to help readers by changing their habit of worrying. The author Dale Carnegie has shared his personal experiences, wherein he was mostly unsatisfied and worried about lot of life situations." }
{ "index":  { "_index": "library", "_type": "book","_id":6 }}
{"title":"Practicing The Power Of Now", "description":"To make the journey into The Power of Now we will need to leave our analytical mind and its false created self, the ego, behind." }
‘

The simplest use case of MLT query is to find documents that are documents that are similar to given text.  In the below query we are asking for all books that have some text similar to "Think Big","Positive Thinking" in the description attribute.

curl -XGET "http://localhost:9200/_search" -d ‘
{
   "query": {
       "more_like_this" : {
           "fields" : ["description"],
           "like" : ["Think Big","Positive Thinking"],
           "min_term_freq" : 1,
           "min_doc_freq":1
           
       }
   }
}’

Response:

{
 "took": 23,
 "timed_out": false,
 "_shards": {
   "total": 5,
   "successful": 5,
   "failed": 0
 },
 "hits": {
   "total": 3,
   "max_score": 2.0129368,
   "hits": [
     {
       "_index": "library",
       "_type": "book",
       "_id": "2",
       "_score": 2.0129368,
       "_source": {
         "title": "The Power of Positive Thinking",
         "description": "The book describes the power positive thinking has and how a firm belief in something, does actually help in achieving it"
       }
     },
     {
       "_index": "library",
       "_type": "book",
       "_id": "1",
       "_score": 0.5257321,
       "_source": {
         "title": "Magic Of Thinking Big",
         "description": "Millions of people throughout the world have improved their lives using The Magic of Thinking Big. Dr. David J. Schwartz, long regarded as one of the foremost experts on motivation, will help you sell better, manage better, earn more money, and—most important of all—find greater happiness and peace of mind."
       }
     },
     {
       "_index": "library",
       "_type": "book",
       "_id": "3",
       "_score": 0.23977731,
       "_source": {
         "title": "Think and Grow Rich",
         "description": "Think And Grow Rich has earned itself the reputation of being considered a textbook for actionable techniques that can help one get better at doing anything, not just by rich and wealthy, but also by people doing wonderful work in their respective fields. "
       }
     }
   ]
 }
}

One of the key requirements for MLT query is that it works only on fields that are of the type “text” in ES 5.1, or “string” in older versions of ES and the field should be “stored” or the document should have “_source” enabled. The way MLT query works is by simply extracting the text from the input document, analyzing it, usually using the same analyzer at the field. It then selects the top K terms with highest term frequency–inverse document frequency (tf-idf) to form a disjunctive query of these terms. In order to speed up the analysis process during search, one can enable term vectors for the field.

Blog Post: How Qbox Saved 5 Figures per Month using Supergiant

MLT query, apart from taking just the text to find documents that are similar to the given input document, can also provide lists of docs in the input search criteria. You can specify not to select terms found in a chosen set of documents by using “unlike” clause. By default the query is executed on “_all” field however, using the “fields” clause we can specify the list of fields to be used for identifying the terms. All the parameters as optional except “like” in MLT query.

Below is another example showcasing MLT query:

curl -XGET "http://localhost:9200/_search" -d ‘
{
"query": {
       "more_like_this" : {
           "like" :[
           {
               "_index" : "library",
               "_type" : "book",
               "_id" : "3"
}],
           "unlike":"chicken soul for",
           "min_term_freq" : 2,
           "min_doc_freq":2
           
       }
}
}’

Term parameters influence the term selection in the input document for the query. 

  • “min_term_freq” specifies the minimum term frequency below which the terms will be ignored from the input document. The default value is 2.  
  • “min_doc_freq” specifies the minimum document frequency below which the terms will be ignored from the input document. The default value is 5. 
  • “max_query_term” specifies the maximum number of query terms that will be selected. Increasing this value gives greater accuracy at the expense of query execution speed. The default value is 25.

Conclusion

If you are experimenting MLT query and have very few documents in the index and wondering if you are not getting results then make sure you override the above discussed parameters. Having explored one more cool feature of Elasticsearch, it’s time to get your hands dirty.

Related Helpful Resources

Give It a Whirl!

It's easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, Amazon or Microsoft Azure data centers. And you can now provision a replicated cluster.

Questions? Drop us a note, and we'll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.

comments powered by Disqus