In previous tutorials, we have considered many concepts from Elasticsearch and studied various filters, aggregations, and search parameters. In this article, we discuss another quite important topic: scoring and relevance of documents.

In general, scoring in Elasticsearch is a process to determine the relevance of retrieved documents based on user queries, term frequencies, and other important parameters. Scoring is performed using nuanced mathematical formulae that assign different weights to terms of the user query. To make our discussion more concrete, let’s see how Elasticsearch scoring works in practice. 

To begin, let’s find out what happens during a search. First, Elasticsearch finds all the documents that match the user query. Elasticsearch accomplishes this by utilizing the Lucene Boolean model that receives a Boolean response of 0 if the document does not match the query, and 1 if the document does match the query. For example, a “relevance AND scoring”  query will return only those documents that match both “relevance” and “scoring.” Simple Boolean operators (AND, OR, NOT) are used in this stage of analysis.

Next, for all documents with the positive response, Elasticsearch applies the Lucene Practical Scoring Function that scores the documents and sorts them in accordance with their relevance. 

Now let’s take a look at how this scoring function works. 

In general, the scoring function is based on the Term Frequency/Inverse Document Frequency (TF/IDF) model that is popular in the information retrieval, natural language processing, and other algorithmic methods of language analysis. In essence, this model says that (a) documents with a higher term frequency, and (b) documents that contain more unique uses of the term compared to other documents in the index are more relevant.

Under the hood, the Lucene scoring formula based on this model represents the relevance score of each document with a positive floating-point number named  _score. A higher _score results in a higher relevance of the document. A query clause generates a _score for each document, and the calculation of that score depends on the type of query clause.

Elasticsearch enriches the discussed model with modern features such as coordination factor, term and query clause boosting, field normalization, and other parameters that may sound esoteric but will be discussed later in this article.

At this point, the reader may ask a simple question: why worry about scoring and not just return all matching documents?

The problem with this naive approach is that not all exact queries are relevant. For example, if the user enters “Black Diamond,” he may be looking for information about diamonds — not about the famous climbing and clothing gear brand. The point we’re trying to make here is that just getting a match to one or more terms in a document field does not equate to relevance. Likewise, just because we didn’t get a match, that doesn’t mean the document isn’t relevant.

Elasticsearch scoring function helps avoid retrieval of irrelevant documents by using advanced algorithms and formulae that are quite easy to understand if we look at some practical examples.

In what follows, we will consider in detail, the main factors and some of the nuances that will help you understand scoring and relevance in Elasticsearch.

Tutorial

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click “Get Started” in the header navigation. If you need help setting up, refer to “Provisioning a Qbox Elasticsearch Cluster.

Let us proceed with the creation and preparation of the data that will be used for scoring. As an example, consider the library search across a collection of books. Let’s create an index and the mapping for our library:

curl -X PUT "http://localhost:9200/lib" -d '{
   "index": {
 },
   "analysis":{    
       "analyzer":{        
           "case_insensitive_sort" : { 
                  "tokenizer" : "keyword",
                   "filter" : ["lowercase"] 
                   }
              }  
      }    
}'

curl -X PUT "http://localhost:9200/lib/books/_mapping" -d '{
"books": {
    "properties": {
        "author": {"type": "string"},
        "title": {"type": "string"},
        "language" : {"type": "string"},
        "year of publishing": {"type": "integer"},
        "genre": {"type": "string"} 
      }
    }      
}'

Now, let’s add some books to our “library” index:

curl -XPUT 'localhost:9200/lib/books/1?pretty' -d '{
"author": "Gromyko",
"title": "True enemies",
"language": "ru",
"year of publishing": 2014,
"genre": "fantastic"
}'
curl -XPUT 'localhost:9200/lib/books/2?pretty' -d '{
"author": "Strugatsky",
"title": "The Final Circle of Paradise",
"language": "en",
"year of publishing": 1965,
"genre": "fantastic"
}'
curl -XPUT 'localhost:9200/lib/books/3?pretty' -d '{
"author": "Marquez",
"title": "One Hundred Years of Solitude",
"language": "sp",
"year of publishing": 1967,
"genre": " magical realist"  
}'
curl -XPUT 'localhost:9200/lib/books/4?pretty' -d '{
"author": "Hemingway",
"title": "For Whom the Bell Tolls",
"language": "en",
"year of publishing": 1940,
"genre": "realist"  
}'
curl -XPUT 'localhost:9200/lib/books/5?pretty' -d '{
"author": "Oldi",
"title": "I will Take It Myself",
"language": "ru",
"year of publishing": 1998,
"genre": "fantastic"  
}'

Great! Now that we have some documents in our index, let’s delve into the formula for scoring them. The Elasticsearch scoring formula appears in the section on relevance from the Elasticsearch – The Definitive Guide:

score(q,d)  =  queryNorm(q)  · coord(q,d) · ∑ (tf(t in d)· idf(t)²  · t.getBoost() · norm(t,d)) (t in q)

*Source: Lucene’s Practical Scoring Function

The terms of this function mean the following:

  • score (q,d):  a relevance score of document d for query q
  • queryNorm (q): query normalization factor
  • coord (q,d): query coordination factor
  • tf (t in d): term frequency of term t in document d
  • idf (t): inverse document frequency for term t 
  • t.getBoost():  the boost applied to the query 
  • norm(t,d) –  the field-length norm

Now, let’s discuss the components for this formula from the conceptual standpoint. As we’ve noted, the Elasticsearch scoring model rests on the TF/IDF concept, so let’s start with that.

Term Frequency

Term Frequency (TF) is quite a simple concept to start with. It simply calculates the number of times a given “term” / “word” appears in the document. It assumes that the document that has 5 matches of the query term is more relevant than a document with just a single appearance of this term.

The TF is calculated using the following formula:

tf(t in d) = √frequency

In plain English, the term frequency (tf) for the term t in document d is the square root of the number of times the term appears in the document.

Inverse Document Frequency

Inverse document frequency (IDF) assigns low weight/relevance to terms that appear frequently in all of the documents in the index. For example, the terms “and” and “in” have low relevance because normally they are not unique to any document of the index.

IDF is calculated using the following formula:

idf = 1 + ln(numDocs/(docFreq + 1))

where ln is the natural logarithm of the number of documents in the index, divided by the number of documents that include the term. 

Field Length Normalization

Field length normalization (norm) is the inverse square root of the number of terms in the field:

norm = 1/sqrt(numFieldTerms)

The value of this parameter depends on the document field length in which a match with the query was found. The smaller length of the field the greater the value of the parameter. It makes sense if you think that the reference to the terms of the request in the field “title” is more important than in the “description.” This parameter is important for a text search, but it is better to turn it off in other cases. For example, when searching logs, you are interested in the occurrence of the particular error code and not so much in the length of the log.

Query Normalization Factor

QNF is the ratio that aims to make results of different queries comparable. It is calculated at the beginning of each query using the following formula:

queryNorm = 1 / √sumOfSquaredWeights

where the sumOfSquaredWeights is computed by adding together the IDF of each term in the query, squared.

Query Coordination Factor

In the case of a multi-term query, the coordination factor rewards the documents that contain a higher number of terms of that query. The more query terms appear in the document the more relevance it might have.

For simplicity, let’s say you have a query with three terms:  “nice,” “red,” and “carpet,” each with a 1.5 score.

The coordination factor uses the following formula:

term score * number of matching terms / total number of terms in the query. 

So, for example, the document that matches “nice red” will have 3.0 * 2 / 3  = 2.0 score. Of course, the documents that contain all three terms will be much more relevant than the document that contains just two of them.

If you are interested in learning more about the elements of the scoring formula, refer to the official Elasticsearch website.

To illustrate how the discussed formula works, let’s try a simple search through our library. We will find all the books in the genre of “fantastic.”

curl -XGET 'localhost:9200/lib/_search?q=fantastic'

This simple search showed us the importance of scoring. During index creation, we pointed to the same genre in three books, but why did Elasticsearch calculate a different number of points for them? To learn more, we will add “explain” in addition to the usual request. Now we shall see the details of the calculation.

curl -XGET 'localhost:9200/lib/_search?explain' -d '{
"query": {
      "bool" : {
         "must" : {
           "query_string" : {
              "query" : "fantastic"
                }

             }    

          }   

      }
  }'

Let’s consider all the results.

{"took":106,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":3,"max_score":0.3125, 
"hits":[{"_shard":2,"_node":"5KnmYpn2RyeAH3uWWRQM9A","_index":"lib","_type":"books","_id":"2","_score":0.3125,"_source":{
"author": "strugatsky",
"title": "The Final Circle of Paradise",
"language": "en",
"year of publishing": 1965,
"genre": "fantastic"
},
"_explanation":{"value":0.3125,"description":"weight(_all:fantastic in 0) [PerFieldSimilarity], result of:",
"details":[{"value":0.3125,"description":"fieldWeight in 0, product of:",
"details":[{"value":1.0,"description":"tf(freq=1.0), with freq of:",
"details":[{"value":1.0,"description":"termFreq=1.0",
"details":[]}]},{"value":1.0,"description":"idf(docFreq=1, maxDocs=2)",
"details":[]},{"value":0.3125,"description":"fieldNorm(doc=0)","details":[]}]}]}},
{"_shard":3,"_node":"5KnmYpn2RyeAH3uWWRQM9A","_index":"lib","_type":"books","_id":"1","_score":0.11506981,"_source":{
"author": "Gromyko",
"title": "true enemies",
"language": "ru",
"year of publishing": 2014,
"genre": "fantastic"
},
"_explanation":{"value":0.11506981,"description":"weight(_all:fantastic in 0) [PerFieldSimilarity], result of:",
"details":[{"value":0.11506981,"description":"fieldWeight in 0, product of:",
"details":[{"value":1.0,"description":"tf(freq=1.0), with freq of:",
"details":[{"value":1.0,"description":"termFreq=1.0",
"details":[]}]},{"value":0.30685282,"description":"idf(docFreq=1, maxDocs=1)",
"details":[]},{"value":0.375,"description":"fieldNorm(doc=0)", "details":[]}]}]}},
{"_shard":1,"_node":"5KnmYpn2RyeAH3uWWRQM9A","_index":"lib","_type":"books","_id":"5","_score":0.095891505,"_source":{
"author": "Oldi",
"title": "I will Take It Myself",
"language": "ru",
"year of publishing": 1998,
"genre": "fantastic"  
},"_explanation":{"value":0.095891505,"description":"weight(_all:fantastic in 0) [PerFieldSimilarity], result of:",
"details":[{"value":0.095891505,"description":"fieldWeight in 0, product of:",
"details":[{"value":1.0,"description":"tf(freq=1.0), with freq of:",
"details":[{"value":1.0,"description":"termFreq=1.0",
"details":[]}]},{"value":0.30685282,"description":"idf(docFreq=1, maxDocs=1)",
"details":[]},{"value":0.3125,"description":"fieldNorm(doc=0)","details":[]}]}]}}]}}

Now let’s look at things more carefully. The value tf(freq = 1.0) is the same for all of the results because the word “fantastic” is included in each document only once.

Next, let’s talk about the differences. The second and third documents have the same value of IDF, which we’ve discussed earlier. All documents have a different number of words and other differences like the language of publication and year of publication.

Also, for each document Elasticsearch calculated a different value for the field-length normalization factor. Because of these distinctions, scores are different for each document.

Now let’s talk a bit about sorting. As we already have learned, the relevance score is represented by the floating-point number returned in the search results as the _score, so the default sort order is _score descending.

Sometimes, though, you don’t have a meaningful relevance score. For example, we will create a query to get all information about a book with an id = 2.

curl -XGET 'localhost:9200/lib/_search?pretty=1' -d '{
       "query" : {
          "bool" : {
            "filter" : {
              "term" : {
                "_id" : 2
                 }              

              }        

           } 

       }
 }'

Results:

There isn’t a meaningful score here. Because we are using a filter, we are indicating that we just want the documents that match _id = 2 with no attempt to determine relevance. Documents will be returned in an effectively random order, and each document will have a score of zero.

In many other scenarios, user queries can benefit from scoring settings. To learn more about configuring query boosting and other interesting scoring parameters, you can refer to the official documentation. 

We hope that reviewing some core concepts and walking through a simple example in this article has helped clarify how the default scoring works in Elasticsearch. Even though the scoring formula used for scoring in Elasticsearch is based on complex mathematical concepts from linear algebra and statistics, it is quite easy to understand using real-world examples.

Other Helpful Resources

Give It a Whirl!

It’s easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, Amazon or Microsoft Azure data centers. And you can now provision a replicated cluster.

Questions? Drop us a note, and we’ll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.