We’re well along now in our series on Elasticsearch scripting. In the previous article, we cover the various types of filters that you can perform with scripts. In this article, our focus turns to scoring in Elasticsearch.

We generally define scoring as giving a higher weight to documents (or data) that meet specific criteria. The objective is often to get a list of documents, sorted on the relevance to the search. Typically, relevance is the numerical output of an algorithm that determines which documents are most textually similar to the query. Elasticsearch employs and enhances standard scoring algorithms and encapsulates these within script_score and function_score.

This article introduces these highly valuable features and provides some examples that you can take with you and apply in your development efforts.

When Elasticsearch returns one or more documents after a search operation, it presents documents that match the query. Each document is given a score, and higher-scoring documents are those that have the highest relevance to the query. A document score is query-dependent: Elasticsearch is likely to assign a different score to a particular document when returning that document to queries that differ in structure.

A bit of technical background: there are three factors that Elasticsearch calculates and stores at index time: term frequency, inverse document frequency, and field-length norm.

Together, these combine into a calculation of the weight of a single term in a particular document. By default, Elasticsearch makes use of the Lucene scoring formula, which represents the relevance score of each document with a positive floating-point number known as the _score. The higher the _score, the higher the relevance of the document. A query clause generates a _score for each document, and the calculation for that score depends on the type of query clause.

Modeling the Data

Scoring is often done according to multiple parameters—including numbers, strings, and word counts for a specific field. We want to provide an adequate tutorial on scoring, so in this article we switch to a document set that is more elaborate than those that we use elsewhere in this series.

Let’s look at an index of famous authors—an index of documents containing author birthdates, place of birth, a popular quote, most famous works, average pricing for their works, and other data elements. Below we present three such documents for our index, which we name famousauthors.

Document 1

curl -XPOST '<a href="http://localhost:9200/famousauthors/authors/1" title="http://localhost:9200/famousauthors/authors/1" class="http">http://localhost:9200/famousauthors/authors/1</a>' -d '{
  "author": "William Shakespeare",
  "born": "1564-04-23T00:00:00.000Z",
  "died": "1616-04-23T00:00:00.000Z",
  "country": "UK",
  "works": {
    "popularQuote": "All the world'sastage, andallthemenandwomenmerelyplayers: theyhavetheirexitsandtheirentrances;andonemaninhistimeplaysmanyparts, hisactsbeingsevenages.",
    "popularWork": "Hamlet",
    "averagePrice": 10,
    "numberOfWorks": 43
  }
}'

 

Document 2

curl -XPOST '<a href="http://localhost:9200/famousauthors/authors/2" title="http://localhost:9200/famousauthors/authors/2" class="http">http://localhost:9200/famousauthors/authors/2</a>' -d '{
 "author": "Leo Tolstoy",
 "born": "1828-09-09T00:00:00.000Z",
 "died": "1910-11-20T00:00:00.000Z",
 "country": "Russia",
 "works": {
     "popularQuote": "The chief difference between words and deeds is that words are always intended for men for their approbation, but deeds can be done only for God.",
    "popularWork": "War and Peace",
    "averagePrice": 9,
    "numberOfWorks": 53
 }
}'

 

Document 3

curl -XPOST 'http://localhost:9200/famousauthors/authors/3' -d '{
    "author": "Charles Dickens",
    "born": "1812-02-07T00:00:00.000Z",
    "died": "1870-06-09T00:00:00.000Z",
    "country": "UK",
    "works": {
          "popularQuote": "I have always thought of Christmas time, when it has come round apart from the veneration due to its sacred name and origin, if anything belonging to it can be apart from that - as a good time; a kind, forgiving, charitable, pleasant time.",
           "popularWork": "The Christmas Carol",
            "averagePrice": 10,
            "numberOfWorks": 85
     }
}'

 

Boost the Score of Matching Documents

Arguably, the primary advantage of scoring is that we can give higher weight to the scores of the documents that are more relevant to our specific search criteria. This gives us more control and flexibility in displaying the results we seek.

Suppose that we need to score this document set according to the authors having a value of 10 dollars as the average price for their books. We want documents that are most relevant to this criterion to surface in the search before others that have little or no relevance. How can we achieve this? Have a look at the following script:

curl -XGET 'http://localhost:9200/famousauthors/authors/_search?&pretty=true&size=3' -d '{
 "query": {
       "function_score": {
             "query": {
                  "match": {
                       "averagePrice": 10
                               }
                        },
                  "functions": [
                     {
                      "script_score": {
                              "script": "_score * boostBy",
                                  "params": {
                                      "boostBy": 2
                                                 }
                                             }
                          }
                      ],
                       "score_mode": "sum",
                       "boost_mode": "replace"
              }
       }
}'

 

This script finds documents that have an averagePrice value of 10, and it then replaces the old score with the product of boostBy and the old score.

After running this script, we see that all the matching documents—that is the ones that have the value of averagePrice equal to 10—get a score value of 2 (assuming the old score value is 1).

Look more closely to inspect two terms in the query, namely the score_mode and the boost_mode. The functions are initially compared with the documents, and the score calculation is done according to score_mode parameter. The parameters for boost_mode will combine together with the query score and the score calculated by the functions. Read the Elasticsearch docs to learn more about the parameters for score_mode and boost_mode.

Boosting the Score According to Term Frequency (Recurrence of a Term)

Term frequency, or term recurrence, is another common requirement for scoring documents or data. In such cases, we want a count on the number of occurences for a specific word—or sequence of characters.

Suppose, for example, that our team lead instructs us to score our documents according to the number of occurences for the word “time” in the popularQuote field. The task is to find the term frequency of the word “time” in the popularQuote field. We would then get a new score by multiplying the numeric value of that frequency by the old score, and then that new score would replace our old score. We can achieve this with the following script.

curl -XGET 'http://localhost:9200/famousauthors/authors/_search?&pretty=true&size=3' -d '{
  "query": {
    "function_score": {
      "query": {
        "match": {
          "works.popularQuote": "time"
        }
      },
      "boost_mode": "replace",
      "functions": [
        {
          "script_score": {
            "script": "_index['works.popularQuote']['time'].tf()"
          }
        }
      ]
    }
  }
}'

The results would be a new score for each of the matching documents, as well as a sort according to the term repetition frequency of the word “time” found in each document.

Compute the Score According to Field Relationships

Suppose that our manager adds another requirement to get alternative scoring according to a relationship between the number of works the author has published (the numberOfBooks field) and the average price of the author’s books (the averagePrice field). We could do this by dividing the value in the numberOfbooks field by the value in the averagePrice field. Our script might look something like this:

curl -XGET 'http://localhost:9200/famousauthors/authors/_search?&pretty=true&size=3' -d '{
  "query": {
    "function_score": {
      "functions": [
        {
          "script_score": {
            "script": "_score * doc['numberOfWorks'].value / doc['averagePrice'].value"
          }
        }
      ],
      "score_mode": "sum",
      "boost_mode": "replace"
    }
  }
}'

We’ve come to the end of this article in which we’ve seen how to boost the score of matching documents, boost the score according to term frequency, and compute the score according to field relationships.

To learn more about scoring in Elasticsearch, read our advanced article on Optimizing Search Results in Elasticsearch with Scoring and Boosting. A more detailed discussion about scoring and the parameters on which the score of a document depends can be found in the Elasticsearch documentation.

Please let us know how this tutorial series has been helpful to you. We welcome comments and any questions using the links below, and also invite you to read more here: