Computing Distance from a Reference Point with Script Fields and the Explain API
Posted by Vineeth Mohan September 28, 2015We’ve recently posted articles on topics ranging from simple scripting to advanced methods in scripting to eliminating duplicates in your indexes.
Here in this tutorial, we help you learn how to combine the script_fields and geo_point methods to generate examples that you can use for modeling in a wide variety of real-world applications. We also give a brief introduction of the Explain API, which is a good aid in understanding how Elasticsearch computes document scores.
Example Data
Our data set for this tutorial contains information for a few employees who have submitted job applications to a particular company. Each document contains name, age, skills, and location details. Since the data includes location details—latitude and longitude—we need to explicitly perform the mapping for the location field.
Important: We must perform this mapping before indexing it. Otherwise, our queries below won’t function properly.
The name of our index is candidates
, the name of the type is details
, and we include these in our mapping request (see below). In the document, the name of the field containing the latitude/longitude coordinates is latlong
. Our mapping request would be as follows:
curl -XPOST 'localhost:9200/candidates' -d '{ "settings": { "number_of_shards": 1 }, "mappings": { "details": { "_source": { "enabled": true }, "properties": { "latlong": { "type": "geo_point", "index": "not_analyzed" } } } } }'
NOTE: To succeed further along in this tutorial, it’s crucial that you perform the above mapping prior to indexing the documents below.
Document 1
curl -XPOST 'http://localhost:9200/candidates/details/1' -d '{ "name": "James Smith", "age": 30, "skills": [ "java", "elasticsearch" ], "city": "Burlington", "state": "Iowa", "latlong": "40.78,91.12" }'
Document 2
curl -XPOST 'http://localhost:9200/candidates/details/2' -d '{ "name": "Chris Anderson", "age": 32, "skills": [ "c", "python" ], "city": "Elkhart", "state": "Indiana", "latlong": "41.72,86.00" }'
Document 3
curl -XPOST 'http://localhost:9200/candidates/details/3' -d '{ "name": "Mark White", "age": 26, "skills": [ "scala", "java" ], "city": "Chanute,KS", "state": "Kansas", "latlong": "37.67,95.48" }'
Document 4
curl -XPOST 'http://localhost:9200/candidates/details/4' -d '{ "name": "Jeff Walker", "age": 26, "skills": [ "c#", "php" ], "city": "Lexington", "state": "Kentucky", "latlong": "38.05,85.00" }'
Computing a Score according to Distance from a Reference Point
Now that we have our custom mapping and an index of four documents, we can proceed with the tutorial.
Our manager at the company wants to score our documents according to distance between the main office and the address for each of the candidates. Of course, this scoring will drive the sorting of the results.
Let’s say that our company’s main office location is in San Fransisco, which has latitude-longitude coordinates of 37.75, 122.68. To get the distance between this central location and the address of each applicant, we can employ the function_score
query and gauss
function. The following query will give us a scoring on the documents according to their distance from the main office:
curl -XGET 'http://localhost:9200/candidates/details/_search?&pretty=true&size=4' -d '{ "query": { "function_score": { "functions": [ { "gauss": { "latlong": { "origin": { "lat": 37.75, "lon": 122.68 }, "offset": "2000km", "scale": "1200km" } } } ] } } }'
In the query above, the gauss
decay function has the following parameters:
- origin — The origin is the reference point, which is the common endpoint in calculating each of the distances (the main office, in our case here).
- offset — A positive value here will extend the origin to cover a range of values, instead of a single point. In this example, we specify the value to be 2000km—which expands our origin to a radius of 2,000 kilometers. All of the
latlong
values in the results that fall within this range would receive the maximum_score
value of 1. - scale — This is the rate of decay—a measure of how quickly the
_score
should drop the further from theorigin
that a document lies.
Taking all of this into consideration, we see that all locations that lie within a distance of 3,200km (offset + scale) from the origin would receive a higher score. Any locations outside of this area will get a lower score.
After running the query, we get the response below.
The highest score is for the document that has an _id
of 3, for which we see that the candidate location is Chanute, KS. Looking closely at all of the scores, we see a wide variation between the first document and the rest. We’ll look at this in detail below.
{ "took": 8, "timed_out": false, "_shards": { "total": 1, "successful": 1, "failed": 0 }, "hits": { "total": 4, "max_score": 0.9315066, "hits": [ { "_index": "candidates", "_type": "details", "_id": "3", "_score": 0.9315066, "_source": { "name": "Mark White", "age": 26, "skills": [ "scala", "java" ], "city": "Chanute,KS", "state": "Kansas", "latlong": "37.67,95.48" } }, { "_index": "candidates", "_type": "details", "_id": "1", "_score": 0.7778743, "_source": { "name": "James Smith", "age": 30, "skills": [ "java", "elasticsearch" ], "city": "Burlington", "state": "Iowa", "latlong": "40.78,91.12" } }, { "_index": "candidates", "_type": "details", "_id": "2", "_score": 0.5334505, "_source": { "name": "Chris Anderson", "age": 32, "skills": [ "c", "python" ], "city": "Elkhart", "state": "Indiana", "latlong": "41.72,86.00" } }, { "_index": "candidates", "_type": "details", "_id": "4", "_score": 0.45290327, "_source": { "name": "Jeff Walker", "age": 26, "skills": [ "c#", "php" ], "city": "Lexington", "state": "Kentucky", "latlong": "38.05,85.00" } } ] } }
Computing Distance using script_fields
In the previous section we saw how to score documents according to distance by using the function_score
query along with the gauss
function. This approach is quite basic for our requirements and actually isn’t really adequate because we are not getting the actual distance the distance between the main office and and the addresses of the candidates. This is vital information for many companies.
To get these precise distances, we suggest using the script_fields
function, which will return a value for each document that is the result of executing a script that runs against fields in the document. This function then assigns the value to a specific field. In the query below, we use scripts to calculate the distance, and then we apply script_fields
to store the value for each hit.
curl -XGET 'http://localhost:9200/candidates/details/_search?&pretty=true&size=4' -d '{ "query": { "function_score": { "functions": [ { "gauss": { "latlong": { "origin": { "lat": 37.75, "lon": 122.68 }, "offset": "2000km", "scale": "1200km" } } } ] } }, "script_fields": { "distance": { "script": "doc['latlong'].distanceInKm(37.75, 122.68)" } } }'
In the query above, find the distance
field in the script_fields
section. This field is the focus as we calculate the distances.
In the results (below), we can see a fields
section that contains a distance
field. For each document, the value for this field is the distance between the origin and the candidate address. Looking more closely, we can examine the _score
for each document and the corresponding distance
field to see clearly that the document with an _id
of 3 has a higher score than any of the others because its distance is less than the range of 3,200km (offset + scale, as explained in the previous section).
{ "took": 70, "timed_out": false, "_shards": { "total": 1, "successful": 1, "failed": 0 }, "hits": { "total": 4, "max_score": 0.9315066, "hits": [ { "_index": "candidates", "_type": "details", "_id": "3", "_score": 0.9315066, "fields": { "distance": [ 3027.9032459594 ] } }, { "_index": "candidates", "_type": "details", "_id": "1", "_score": 0.7778743, "fields": { "distance": [ 3529.3975783203 ] } }, { "_index": "candidates", "_type": "details", "_id": "2", "_score": 0.5334505, "fields": { "distance": [ 4107.0455280403 ] } }, { "_index": "candidates", "_type": "details", "_id": "4", "_score": 0.45290327, "fields": { "distance": [ 4194.6513562355 ] } } ] } }
NOTE: It may seem a bit odd that there is no _source
field in the results. Actually, we should expect this default behavior when using the script_field
function. If you find it necessary to see the _source
in the response, remember to set _source
to true for each and every section in the query section.
Explain API
The Explain API is an especially helpful Elasticsearch tool for understanding document score computation. We can readily demonstrate how to apply this API with a simple query on the term “elasticsearch,” such as this one:
curl -XGET 'http://localhost:9200/candidates/details/_search?&pretty=true&size=4' -d '{ "explain": true, "query": { "match": { "skills": "elasticsearch" } } }'
The result of this query is a single document, shown below. Look within the hits
section, beneath _source
, to find the _explanation
subsection. There you can see the detail results of the Elasticsearch scoring mechanism. (Actually, in this case, the core Lucene scoring mechanism is operating here). Let us have a look at the three basic and important factors considered into while score calculations.
- tf — This is the term frequency: the number of times the term occurs in the document. Here the value for tf = 1.0, since the term “elasticsearch” occurs only once.
- idf — Inverted document frequency: gives an indication of the rarity of the term within the index. In our results here, we see that the idf value calculation is governed by two factors. The
docfreq
(document frequency) is singular, andmaxdocs
(maximum documents) is the maximum number of documents within the index, which is 4 in our example. - fieldNorm — This is a normalizing factor: normally shorter fields get higher scores than longer fields.
{ "took": 6, "timed_out": false, "_shards": { "total": 1, "successful": 1, "failed": 0 }, "hits": { "total": 1, "max_score": 1.058217, "hits": [ { "_shard": 0, "_node": "5D3RjyscRNyycSwTdWWEuA", "_index": "candidates", "_type": "details", "_id": "1", "_score": 1.058217, "_source": { "name": "James Smith", "age": 30, "skills": [ "java", "elasticsearch" ], "city": "Burlington", "state": "Iowa", "latlong": "40.78,91.12" }, "_explanation": { "value": 1.058217, "description": "weight(skills:elasticsearch in 0) [PerFieldSimilarity], result of:", "details": [ { "value": 1.058217, "description": "fieldWeight in 0, product of:", "details": [ { "value": 1, "description": "tf(freq=1.0), with freq of:", "details": [ { "value": 1, "description": "termFreq=1.0" } ] }, { "value": 1.6931472, "description": "idf(docFreq=1, maxDocs=4)" }, { "value": 0.625, "description": "fieldNorm(doc=0)" } ] } ] } } ] } }
More details on the ES scoring mechanism are quite beyond the scope of this lengthy article because it requires a deep dive into Lucene scoring techniques. If necessary, we recommend that you look for more information in their documentation on similarity and TFIDFSimilarity.
Switching our focus back to our candidates
index, we can see that the query from the previous section (which employs script_fields
) is more complex. This means that the scoring is even more complex. If we want to look at the details for the score computation on that query, we would simply introduce the explain
parameter (setting it to true) immediately before query
before the query and run it.
curl -XGET 'http://localhost:9200/candidates/details/_search?&pretty=true&size=4' -d '{ "explain": true, "query": { "function_score": { ... }
When you compare the results of this explain query with the previous results, you’ll see the extensive _explanation
section that provides the details of the score computation.
That brings us to a close on this article. We welcome your comments below.