We have already considered many concepts from Elasticsearch and studied various filters and search nuances. In this article, we discuss sorting and relevance of documents.

Let's see how elasticsearch scoring is calculated. To begin, let’s find out what happens during a search. First, Elasticsearch finds all the appropriate documents. This means it receives a Boolean response of 0 if the document is not suitable, and 1 if the document is suitable. Next, for all documents with the response equal to 1, the scoring will be calculated and they will be sorted by this value.

Now, let's take a look on how documents are scored. The value of scoring is a complex concept, but in general, when requesting matches with a document, more request matches result in a higher scoring.

But even full match does not guarantee that the document with the highest scoring is exactly what you are looking for. For example, if the user enters "Black Diamond," he may be looking for information about diamonds, and not about the famous climbing and clothing gear brand.

The point we're trying to make here is that just getting a match to one or more terms in a document field does not equate to relevance. Likewise, just because we didn't get a match, doesn't mean the document isn't relevant.

Typically, relevance is the numerical output of an algorithm that determines which documents are most textually similar to the query. Elasticsearch employs and enhances standard scoring algorithms and encapsulates these within script_score and function_score.

Together, these combine into a calculation of the weight of a single term in a particular document. By default, Elasticsearch makes use of the Lucene scoring formula, which represents the relevance score of each document with a positive floating-point number known as the _score. A higher _score results in a higher relevance of the document. A query clause generates a _score for each document, and the calculation of that score depends on the type of query clause.

Don't be afraid. We will consider, in detail, the main factors and some of the nuances that will help you understand scoring and relevance.

Tutorial

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click "Get Started" in the header navigation. If you need help setting up, refer to "Provisioning a Qbox Elasticsearch Cluster."

Let us proceed with the creation and preparation of the data in which we seek. As an example, consider the search through a library. To do that, we will create a few books. Create an index and mapping for the library.

curl -X PUT "http://localhost:9200/lib" -d '{
   "index": {
 },
   "analysis":{    
       "analyzer":{        
           "case_insensitive_sort" : { 
"tokenizer" : "keyword",
"filter" : ["lowercase"] }
}    }    }'
curl -X PUT "http://localhost:9200/lib/books/_mapping" -d '{
"books": {
"properties": {
"author": {
"type": "string",
"fields": {
"raw": { "type": "string",
"index": "not_analyzed" },
"keyword": { "type": "string",
"analyzer": "case_insensitive_sort" }
}    }    }    }
}'
curl -XPUT 'localhost:9200/lib/books/1?pretty' -d '{
"author": "Gromyko",
"title": "True enemies",
"language": "ru",
"year of publishing": 2014,
"genre": "fantastic"
}'
curl -XPUT 'localhost:9200/lib/books/2?pretty' -d '{
"author": "Strugatsky",
"title": "The Final Circle of Paradise",
"language": "en",
"year of publishing": 1965,
"genre": "fantastic"
}'
curl -XPUT 'localhost:9200/lib/books/3?pretty' -d '{
"author": "Marquez",
"title": "One Hundred Years of Solitude",
"language": "sp",
"year of publishing": 1967,
"genre": " magical realist"  
}'
curl -XPUT 'localhost:9200/lib/books/4?pretty' -d '{
"author": "Hemingway",
"title": "For Whom the Bell Tolls",
"language": "en",
"year of publishing": 1940,
"genre": "realist"  
}'
curl -XPUT 'localhost:9200/lib/books/5?pretty' -d '{
"author": "Oldi",
"title": "I will Take It Myself",
"language": "ru",
"year of publishing": 1998,
"genre": "fantastic"  
}'

Now, back to theory. Let's start with a simple overview of the default formula from the Elasticsearch - The Definitive Guide section on relevance. It shows us which mechanisms are at play in determining relevancy:

*score(q,d) =

           queryNorm(q)

         * coord(q,d)

  • score(q,d) is the relevance score of document d for the query q.

  • queryNorm(q) is the query normalization factor.

  • coord(q,d) is the coordination factor.

  • The sum of the weights for each term t in the query q for document d.

    • tf(t in d) is the term frequency for term t in document d.

    • idf(t) is the inverse document frequency

    •          * SUM (

    •                tf(t in d),

    •                idf(t)²,

    •                t.getBoost(),

    •                norm(t,d)

    •                        ) (t in q)

    • y for term t.

    • t.getBoost() is the boost that has been applied to the query.

    • norm(t,d) is the field-length norm, combined with the index-time field-level boost, if any.*

*Source: Lucene's Practical Scoring Function

Term Frequency

Where does this value come from? Often the request consists from a few words or terms, and the more matches existing within the document the more valuable it will be for the search.

Case Study: How Qbox Saved 5 Figures per Month using Supergiant

Some of the terms may have more "weight" than others. Their occurrence in the document is more important especially if they are referred more than once. If the document refers 5 times to the word having the highest "weight", it will be more important than a document with a full match with one mention of the word.

tf(t in d) = √frequency 

The term frequency (tf) for the term t in document d is the square root of the number of times the term appears in the document.

If you don't care how often a word appears in a document, then you can disable this option.

Inverse Document Frequency

Inverse document frequency (idf): This is one plus the natural logarithm of the documents in the index divided by the number of documents that contain the term plus one:

idf = 1 + ln(maxDocs/(docFreq + 1))

This parameter is inverse to the term frequency. That is, the more frequently the term is mentioned in the document, the less the value of idf.

Field Length Normalization

Field length normalization (norm): This is the inverse square root of the number of terms in the field:

norm = 1/sqrt(numFieldTerms)

The value of this parameter depends on the document field length in which a match with the query was found. The smaller length of the field the greater the value of the parameter. It makes sense if you think that the reference to the terms of the request in the field "title" is more important than in the "description". This parameter is important for a text search, in other cases it is better to turn it off.

If you are interested to learn more about the elements of the formula, refer to the official Elasticsearch website.

Let's try a simple search through our library. We will find all the books in the genre of ‘fantastic’.

curl -XGET 'localhost:9200/lib/_search?q=fantastic'

elasticsearching-scoring1.png#asset:1169

This simple search showed us importance of the score. During creation, we pointed  to the same genre in three books, but why did elasticsearch calculate a different number of points for them? To learn more, we will add "explain" in addition to the usual request. Now we shall see the details of the calculation.

curl -XGET 'localhost:9200/lib/_search?explain' -d '{
"query": {
      "bool" : {
            "must" : {
"query_string" : {
                    "query" : "fantastic"
                }
}    }    }
}'

Let's consider all the results.

{"took":106,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":3,"max_score":0.3125,
"hits":[{"_shard":2,"_node":"5KnmYpn2RyeAH3uWWRQM9A","_index":"lib","_type":"books","_id":"2","_score":0.3125,"_source":{
"author": "strugatsky",
"title": "The Final Circle of Paradise",
"language": "en",
"year of publishing": 1965,
"genre": "fantastic"
},
"_explanation":{"value":0.3125,"description":"weight(_all:fantastic in 0) [PerFieldSimilarity], result of:",
"details":[{"value":0.3125,"description":"fieldWeight in 0, product of:",
"details":[{"value":1.0,"description":"tf(freq=1.0), with freq of:",
"details":[{"value":1.0,"description":"termFreq=1.0",
"details":[]}]},{"value":1.0,"description":"idf(docFreq=1, maxDocs=2)",
"details":[]},{"value":0.3125,"description":"fieldNorm(doc=0)","details":[]}]}]}},
{"_shard":3,"_node":"5KnmYpn2RyeAH3uWWRQM9A","_index":"lib","_type":"books","_id":"1","_score":0.11506981,"_source":{
"author": "Gromyko",
"title": "true enemies",
"language": "ru",
"year of publishing": 2014,
"genre": "fantastic"
},
"_explanation":{"value":0.11506981,"description":"weight(_all:fantastic in 0) [PerFieldSimilarity], result of:",
"details":[{"value":0.11506981,"description":"fieldWeight in 0, product of:",
"details":[{"value":1.0,"description":"tf(freq=1.0), with freq of:",
"details":[{"value":1.0,"description":"termFreq=1.0",
"details":[]}]},{"value":0.30685282,"description":"idf(docFreq=1, maxDocs=1)",
"details":[]},{"value":0.375,"description":"fieldNorm(doc=0)", "details":[]}]}]}},
{"_shard":1,"_node":"5KnmYpn2RyeAH3uWWRQM9A","_index":"lib","_type":"books","_id":"5","_score":0.095891505,"_source":{
"author": "Oldi",
"title": "I will Take It Myself",
"language": "ru",
"year of publishing": 1998,
"genre": "fantastic"  
},"_explanation":{"value":0.095891505,"description":"weight(_all:fantastic in 0) [PerFieldSimilarity], result of:",
"details":[{"value":0.095891505,"description":"fieldWeight in 0, product of:",
"details":[{"value":1.0,"description":"tf(freq=1.0), with freq of:",
"details":[{"value":1.0,"description":"termFreq=1.0",
"details":[]}]},{"value":0.30685282,"description":"idf(docFreq=1, maxDocs=1)",
"details":[]},{"value":0.3125,"description":"fieldNorm(doc=0)","details":[]}]}]}}]}}

Let's look at things more carefully. The value tf(freq = 1.0) is the same for all of the results because the word "fantastic" is included in each document only once.

Now let's talk about the differences. The second and third documents have the same value of idf, which we’ve discussed earlier. All documents have a different number of words and other differences like the language of publication and year of publication.

Also, for each document Elasticsearch calculated a different value for the field-length normalization factor. Because of these distinctions, scores are different for each document.

Blog Post: Kubernetes Series: Understanding Why Container Architecture is Important to the Future of Your Business

Now let's talk a bit about sorting. As we already have learned that the relevance score is represented by the floating-point number returned in the search results as the_score, so the default sort order is _score descending.

Sometimes, though, you don’t have a meaningful relevance score. For example, we will create a query to get all information about a book with an id = 2.

curl -XGET 'localhost:9200/lib/_search?pretty=1' -d '{
   "query" : {
       "bool" : {
           "filter" : {
               "term" : {
                   "_id" : 2
               }            }        } }
}'

elasticsearch-scoring2.png#asset:1167

There isn’t a meaningful score here. Because we are using a filter, we are indicating that we just want the documents that match _id = 2 with no attempt to determine relevance. Documents will be returned in effectively random order, and each document will have a score of zero.

As an example, we will consider a library, so it makes sense to sort all books by alphabet by the author field. We can do this with the sort parameter:

curl -XGET 'localhost:9200/lib/_search?sort=author.raw'

Results:

elasticsearch-scoring3.png#asset:1168

So the query is quite simple. But, if we go back to the top of the article to the place where we created an index and mapping, you will see that the basic work was done there. There, at first, we’ve specified the field to sort by - author, and secondly determined the sorting method, which initially leads to lower case, and then goes sorting. If you want to go into depth in sorting, see the documentation.

We hope that reviewing some core concepts and walking through a simple example in this article has helped clarify how the default scoring works in Elasticsearch.  We also talked a bit about sorting and its usage. I hope this article was helpful and now you know about Elasticsearch more than before. Questions/Comments? Drop us a line below.

Other Helpful Resources

Give It a Whirl!

It's easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, Amazon or Microsoft Azure data centers. And you can now provision a replicated cluster.

Questions? Drop us a note, and we'll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.

comments powered by Disqus