We have covered a lot on Parent-Child Relationships in Elasticsearch, indexing, searching, aggregations and the challenges it could easily face. We shall continue out streak with exploring further into Parent Child Relationships. The parent-child relationship is similar in nature to the nested model: both allows us to associate one entity with another. The difference is that, with nested objects, all entities live within the same document while, with parent-child, the parent and children are completely separate documents.

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click "Get Started" in the header navigation. If you need help setting up, refer to "Provisioning a Qbox Elasticsearch Cluster."

The has_child query and filter can be used to find parent documents based on the contents of their children. The has_child query and filter both accept the min_children and max_children parameters,which will return the parent document only if the number of matching children is within the specified range. The has_child query allows us to return parents based on data in their children, and the has_parent query returns children based on data in their parents.

When talking about aggregations, Children Aggregation is special single bucket aggregation that enables aggregating from buckets on parent document types to buckets on child documents. This aggregation relies on the _parent field in the mapping. This aggregation has a single option i.e., type which is to indicate the child type that the buckets in the parent space should be mapped to.

The parent-child relationship can extend across more than one generation as in grandchildren can have grandparents but it requires an extra step to ensure that documents from all generations are indexed on the same shard.

Let’s change our previous example to make the country type a parent of the location type:

curl -XPUT 'ES_HOST:ES_PORT/academy -d '{
 "mappings": {
   "country": {},
   "location": {
     "_parent": {
         "type": "country"
     }
   },
   "player": {
     "_parent": {
         "type": "location"
     }
   }
 }
}'

Countries and locations have a simple parent-child relationship, so we use the same process as we used in previous posts:

curl -XPOST 'ES_HOST:ES_PORT/academy/country/_bulk' -d
'{ "index": { "_id": "usa"}}
{ "name": "United States of America"}
{ "index": { "_id": "uk"}}
{ "name": "United Kingdom" }'
curl -XPOST 'ES_HOST:ES_PORT/academy/location/_bulk' -d
'{ "index": { "_id": "newyork", "parent": "usa" }}
{ "name": "Manhattan Academy"}
{ "index": { "_id": "london", "parent": "uk" }}
{ "name": "London Central" }
{ "index": { "_id": "dallas", "parent": "usa" }}
{ "name": "Dallas Academy"}'
Let's associate a few children (player) to our parent documents:
curl -XPOST 'ES_HOST:ES_PORT/academy/player/_bulk' -d
'{ "index": { "_id": 1, "parent": "london" }}
{ "name": "John Doe", "dob": "1998-07-18", "sport": "volleyball" }
{ "index": { "_id": 2, "parent": "newyork" }}
{ "name": "WIlliam Smith", "dob": "1996-11-07", "sport": "basketball" }
{ "index": { "_id": 3, "parent": "dallas" }}
{ "name": "John Henry", "dob": "1995-07-15", "sport": "billiards" }'

The parent ID has ensured that each location document is routed to the same shard as its parent country document. However, look what would happen if we were to use the same technique with the player grandchildren:

curl -XPUT 'ES_HOST:ES_PORT/academy/player/1?parent=newyork' -d '{
  "name":  "John Keats",
  "dob":   "1995-10-20",
  "sport": "table-tennis"
}'

The shard routing of the player document would be decided by the parent ID newyork but the newyork document was routed to a shard by its own parent ID usa. It is very likely that the grandchild would end up on a different shard from its parent and grandparent, which would prevent the same-shard parent-child mapping from functioning.

Instead, we need to add an extra routing parameter, set to the ID of the grandparent, to ensure that all three generations are indexed on the same shard. The indexing request should look like this:

Here, the routing value overrides the parent value.

curl -XPUT 'ES_HOST:ES_PORT/academy/player/1?parent=newyork&routing=usa' -d '{
  "name":  "John Keats",
  "dob":   "1995-10-20",
  "sport": "table-tennis"
}'

The parent parameter is still used to link the player document with its parent, but the routing parameter ensures that it is stored on the same shard as its parent and grandparent. The routing value needs to be provided for all single-document requests.

The reason for this has to do with how document routing is done. By default, the routing is done by taking the id of the document, getting a hash of that value and then getting the modulo of the value with respect to the number of shards. So, for example if I create an index with 5 shards, and index a document with id of uk then we do the following:

shardNum = hash("uk") % 5

Which gives us a number between 0 and 4. Let’s say in this case the shardNum is 3 so the document will be indexed on shard 3 of that index. The same applies when you GET a document. We send a request like the following:

curl -XGET "http://ES_HOST:ES_PORT/test_index/test_type/uk"

From the above request, we need to know which shard to go and get the document from so we use the same formula as above and hash "uk" and then get the modulo with respect to the number of shards (5) which will again be 3 and we can go to the correct shard to get the document.

With parent-child documents it works slightly differently because when indexing a child document instead of using the id as the routing value that we hash, we use the parent query parameter. It means that both the parent and child end up on the same shard because the result of the routing algorithm will be the same for both.

When you index a grandchild the parent query parameter points to the child document and we have no reference in the request that there is a parent above that child. Our request in that case might look like this:

curl -XGET "http://ES_HOST:ES_PORT/test_index/test_type/1?parent=london"

Here, the request does not provide the informtion that the parent london might have a parent of it's own. Elasticsearch would use london as the routing value and the document may end up on a different shard to its parent (and grandparent) which had uk used as its routing value.

So we need to tell Elasticsearch that it should not use the parent as the routing value for the grandchild document but instead should use the id of the document at the top generation (in this case "uk"). We have to do this by setting a custom routing value:

curl -XGET "http://ES_HOST:ES_PORT/my_index/my_type/baz?parent=london&routing=uk"

Querying and aggregating across generations works, as long as you step through each generation. For instance, to find countries where players enjoy playing volleyball, we need to join countries with locations, and locations with players:

curl -XGET 'ES_HOST:ES_PORT/academy/country/_search' -d '{
  "query": {
    "has_child": {
      "type": "location",
      "query": {
        "has_child": {
          "type": "player",
          "query": {
            "match": {
              "sport": "volleyball"
            }
          }
        }
      }
    }
  }
}'

Similarly, In order to find countries where players enjoy playing basketball, we can run a query like:

curl -XGET 'ES_HOST:ES_PORT/academy/country/_search' -d '{
  "query": {
    "has_child": {
      "type": "location",
      "query": {
        "has_child": {
          "type": "player",
          "query": {
            "match": {
              "sport": "basketball"
            }
          }
        }
      }
    }
  }
}'

Give it a Whirl!

It's easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, or Amazon data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch

Questions? Drop us a note, and we'll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.

comments powered by Disqus