We’ve recently posted articles on topics ranging from simple scripting to advanced methods in scripting to eliminating duplicates in your indexes.

Here in this tutorial, we help you learn how to combine the script_fields and geo_point methods to generate examples that you can use for modeling in a wide variety of real-world applications. We also give a brief introduction of the Explain API, which is a good aid in understanding how Elasticsearch computes document scores.

Example Data

Our data set for this tutorial contains information for a few employees who have submitted job applications to a particular company. Each document contains name, age, skills, and location details. Since the data includes location details—latitude and longitude—we need to explicitly perform the mapping for the location field.

Important: We must perform this mapping before indexing it. Otherwise, our queries below won’t function properly.

The name of our index is candidates, the name of the type is details, and we include these in our mapping request (see below). In the document, the name of the field containing the latitude/longitude coordinates is latlong. Our mapping request would be as follows:

curl -XPOST 'localhost:9200/candidates' -d '{
"settings": {
    "number_of_shards": 1
  },
  "mappings": {
    "details": {
      "_source": {
        "enabled": true
      },
      "properties": {
        "latlong": {
          "type": "geo_point",
          "index": "not_analyzed"
        }
      }
    }
  }
}'

 

NOTE: To succeed further along in this tutorial, it’s crucial that you perform the above mapping prior to indexing the documents below.

Document 1

curl -XPOST 'http://localhost:9200/candidates/details/1' -d '{
  "name": "James Smith",
  "age": 30,
  "skills": 
    [
    "java",
    "elasticsearch"
    ],
  "city": "Burlington",
  "state": "Iowa",
  "latlong": "40.78,91.12"
}'

Document 2

curl -XPOST 'http://localhost:9200/candidates/details/2' -d '{
  "name": "Chris Anderson",
  "age": 32,
  "skills": 
    [
      "c",
      "python"
    ],
  "city": "Elkhart",
  "state": "Indiana",
  "latlong": "41.72,86.00"
}'

Document 3

curl -XPOST 'http://localhost:9200/candidates/details/3' -d '{
  "name": "Mark White",
  "age": 26,
  "skills": 
  [
    "scala",
    "java"
  ],
  "city": "Chanute,KS",
  "state": "Kansas",
  "latlong": "37.67,95.48"
}'

Document 4

curl -XPOST 'http://localhost:9200/candidates/details/4' -d '{
  "name": "Jeff Walker",
  "age": 26,
  "skills": 
  [
    "c#",
    "php"
  ],
  "city": "Lexington",
  "state": "Kentucky",
  "latlong": "38.05,85.00"
}'

 

Computing a Score according to Distance from a Reference Point

Now that we have our custom mapping and an index of four documents, we can proceed with the tutorial.

Our manager at the company wants to score our documents according to distance between the main office and the address for each of the candidates. Of course, this scoring will drive the sorting of the results.

Let’s say that our company’s main office location is in San Fransisco, which has latitude-longitude coordinates of 37.75, 122.68. To get the distance between this central location and the address of each applicant, we can employ the function_score query and gauss function. The following query will give us a scoring on the documents according to their distance from the main office:

curl -XGET 'http://localhost:9200/candidates/details/_search?&pretty=true&size=4' -d '{
  "query": {
    "function_score": {
      "functions": 
       [
        {
          "gauss": {
            "latlong": {
              "origin": {
                "lat": 37.75,
                "lon": 122.68
              },
              "offset": "2000km",
              "scale": "1200km"
            }
          }
        }
      ]
    }
  }
}'

 

In the query above, the gauss decay function has the following parameters:

  • origin — The origin is the reference point, which is the common endpoint in calculating each of the distances (the main office, in our case here).
  • offset — A positive value here will extend the origin to cover a range of values, instead of a single point. In this example, we specify the value to be 2000km—which expands our origin to a radius of 2,000 kilometers. All of the latlong values in the results that fall within this range would receive the maximum _score value of 1.
  • scale — This is the rate of decay—a measure of how quickly the _score should drop the further from the origin that a document lies.

Taking all of this into consideration, we see that all locations that lie within a distance of 3,200km (offset + scale) from the origin would receive a higher score. Any locations outside of this area will get a lower score.

After running the query, we get the response below.

The highest score is for the document that has an _id of 3, for which we see that the candidate location is Chanute, KS. Looking closely at all of the scores, we see a wide variation between the first document and the rest. We’ll look at this in detail below.

{
  "took": 8,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 0.9315066,
    "hits": [
      {
        "_index": "candidates",
        "_type": "details",
        "_id": "3",
        "_score": 0.9315066,
        "_source": {
          "name": "Mark White",
          "age": 26,
          "skills": [
            "scala",
            "java"
          ],
          "city": "Chanute,KS",
          "state": "Kansas",
          "latlong": "37.67,95.48"
        }
      },
      {
        "_index": "candidates",
        "_type": "details",
        "_id": "1",
        "_score": 0.7778743,
        "_source": {
          "name": "James Smith",
          "age": 30,
          "skills": [
            "java",
            "elasticsearch"
          ],
          "city": "Burlington",
          "state": "Iowa",
          "latlong": "40.78,91.12"
        }
      },
      {
        "_index": "candidates",
        "_type": "details",
        "_id": "2",
        "_score": 0.5334505,
        "_source": {
          "name": "Chris Anderson",
          "age": 32,
          "skills": [
            "c",
            "python"
          ],
          "city": "Elkhart",
          "state": "Indiana",
          "latlong": "41.72,86.00"
        }
      },
      {
        "_index": "candidates",
        "_type": "details",
        "_id": "4",
        "_score": 0.45290327,
        "_source": {
          "name": "Jeff Walker",
          "age": 26,
          "skills": [
            "c#",
            "php"
          ],
          "city": "Lexington",
          "state": "Kentucky",
          "latlong": "38.05,85.00"
        }
      }
    ]
  }
}

 

Computing Distance using script_fields

In the previous section we saw how to score documents according to distance by using the function_score query along with the gauss function. This approach is quite basic for our requirements and actually isn’t really adequate because we are not getting the actual distance the distance between the main office and and the addresses of the candidates. This is vital information for many companies.

To get these precise distances, we suggest using the script_fields function, which will return a value for each document that is the result of executing a script that runs against fields in the document. This function then assigns the value to a specific field. In the query below, we use scripts to calculate the distance, and then we apply script_fields to store the value for each hit.

curl -XGET 'http://localhost:9200/candidates/details/_search?&pretty=true&size=4' -d '{
  "query": {
    "function_score": {
      "functions": [
        {
          "gauss": {
            "latlong": {
              "origin": {
                "lat": 37.75,
                "lon": 122.68
              },
              "offset": "2000km",
              "scale": "1200km"
            }
          }
        }
      ]
    }
  },
  "script_fields": {
    "distance": {
      "script": "doc['latlong'].distanceInKm(37.75, 122.68)"
    }
  }
}'

 

In the query above, find the distance field in the script_fields section. This field is the focus as we calculate the distances.

In the results (below), we can see a fields section that contains a distance field. For each document, the value for this field is the distance between the origin and the candidate address. Looking more closely, we can examine the _score for each document and the corresponding distance field to see clearly that the document with an _id of 3 has a higher score than any of the others because its distance is less than the range of 3,200km (offset + scale, as explained in the previous section).

{
  "took": 70,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 0.9315066,
    "hits": [
      {
        "_index": "candidates",
        "_type": "details",
        "_id": "3",
        "_score": 0.9315066,
        "fields": {
          "distance": [
            3027.9032459594
          ]
        }
      },
      {
        "_index": "candidates",
        "_type": "details",
        "_id": "1",
        "_score": 0.7778743,
        "fields": {
          "distance": [
            3529.3975783203
          ]
        }
      },
      {
        "_index": "candidates",
        "_type": "details",
        "_id": "2",
        "_score": 0.5334505,
        "fields": {
          "distance": [
            4107.0455280403
          ]
        }
      },
      {
        "_index": "candidates",
        "_type": "details",
        "_id": "4",
        "_score": 0.45290327,
        "fields": {
          "distance": [
            4194.6513562355
          ]
        }
      }
    ]
  }
}

 

NOTE: It may seem a bit odd that there is no _source field in the results. Actually, we should expect this default behavior when using the script_field function. If you find it necessary to see the _source in the response, remember to set _source to true for each and every section in the query section.

Explain API

The Explain API is an especially helpful Elasticsearch tool for understanding document score computation. We can readily demonstrate how to apply this API with a simple query on the term “elasticsearch,” such as this one:

curl -XGET 'http://localhost:9200/candidates/details/_search?&pretty=true&size=4' -d '{
  "explain": true,
  "query": {
    "match": {
      "skills": "elasticsearch"
    }
  }
}'

 

The result of this query is a single document, shown below. Look within the hits section, beneath _source, to find the _explanation subsection. There you can see the detail results of the Elasticsearch scoring mechanism. (Actually, in this case, the core Lucene scoring mechanism is operating here). Let us have a look at the three basic and important factors considered into while score calculations.

  • tf — This is the term frequency: the number of times the term occurs in the document. Here the value for tf = 1.0, since the term “elasticsearch” occurs only once.
  • idf — Inverted document frequency: gives an indication of the rarity of the term within the index. In our results here, we see that the idf value calculation is governed by two factors. The docfreq (document frequency) is singular, and maxdocs (maximum documents) is the maximum number of documents within the index, which is 4 in our example.
  • fieldNorm — This is a normalizing factor: normally shorter fields get higher scores than longer fields.
{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1.058217,
    "hits": [
      {
        "_shard": 0,
        "_node": "5D3RjyscRNyycSwTdWWEuA",
        "_index": "candidates",
        "_type": "details",
        "_id": "1",
        "_score": 1.058217,
        "_source": {
          "name": "James Smith",
          "age": 30,
          "skills": [
            "java",
            "elasticsearch"
          ],
          "city": "Burlington",
          "state": "Iowa",
          "latlong": "40.78,91.12"
        },
        "_explanation": {
          "value": 1.058217,
          "description": "weight(skills:elasticsearch in 0) [PerFieldSimilarity], result of:",
          "details": [
            {
              "value": 1.058217,
              "description": "fieldWeight in 0, product of:",
              "details": [
                {
                  "value": 1,
                  "description": "tf(freq=1.0), with freq of:",
                  "details": [
                    {
                      "value": 1,
                      "description": "termFreq=1.0"
                    }
                  ]
                },
                {
                  "value": 1.6931472,
                  "description": "idf(docFreq=1, maxDocs=4)"
                },
                {
                  "value": 0.625,
                  "description": "fieldNorm(doc=0)"
                }
              ]
            }
          ]
        }
      }
    ]
  }
}

 

More details on the ES scoring mechanism are quite beyond the scope of this lengthy article because it requires a deep dive into Lucene scoring techniques. If necessary, we recommend that you look for more information in their documentation on similarity and TFIDFSimilarity.

Switching our focus back to our candidates index, we can see that the query from the previous section (which employs script_fields) is more complex. This means that the scoring is even more complex. If we want to look at the details for the score computation on that query, we would simply introduce the explain parameter (setting it to true) immediately before query before the query and run it.

curl -XGET 'http://localhost:9200/candidates/details/_search?&pretty=true&size=4' -d '{
    "explain": true,
    "query": {
    "function_score": {
    ...
}

 

When you compare the results of this explain query with the previous results, you’ll see the extensive _explanation section that provides the details of the score computation.

That brings us to a close on this article. We welcome your comments below.