We have been discussing extensively on Handling Relationships and Data Modeling in our series so far. The need to bridge the gap between flat mapping and the real world has made us focus on the following techniques.

  • Application-side joins

  • Data denormalization

  • Nested objects

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click "Get Started" in the header navigation. If you need help setting up, refer to "Provisioning a Qbox Elasticsearch Cluster."

We already have an introductory post on Parent-Child Relationships in Elasticsearch and the challenges it could easily face. We shall continue out streak with exploring further into Parent Child Relationships. The parent-child relationship is similar in nature to the nested model: both allows us to associate one entity with another. The difference is that, with nested objects, all entities live within the same document while, with parent-child, the parent and children are completely separate documents.

When we are indexing data, the task is rarely as simple as each document existing in isolation. Sometimes, we are better off denormalizing all data into the child documents. For example if we were modelling blog posts, adding an author field to blog could be a sensible choice (even if in the database - the authoritative datasource, the data is split into separate authors and blogs table). It’s simple and one can easily construct queries on both attributes of the blogs and the author’s name.

That’s not always practical though, as there might be too much data in the parent document to duplicate it in each child document. Let's consider a simple blogging application where author can publish their blogs and viewers can comment on their blogs. It wouldn’t be a good choice at all to repeat the entire content of the blog post in each comment. This can cause the indexed data size to grow manifold. On the contrary, without indexing the entire content of blog post in each comment, we can’t easily write queries to find comments on posts matching certain criteria.

The parent-child functionality allows us to associate one document type with another, in a one-to-many relationship : one parent to many children. The advantages that parent-child has over nested objects are as follows:

  • The parent document can be updated without reindexing the children.

  • Child documents can be added, changed, or deleted without affecting either the parent or other children. This is especially useful when child documents are large in number and need to be added or changed frequently.

  • Child documents can be returned as the results of a search request.

Elasticsearch maintains a map of which parents are associated with which children. It is thanks to this map that query-time joins are fast, but it does place a limitation on the parent-child relationship: the parent document and all of its children must live on the same shard.

The parent-child ID maps are stored in Doc Values, which allows them to execute quickly when fully hot in memory, but scalable enough to spill to disk when the map is very large.

Parent-Child Mapping

In in order to establish the parent-child relationship is to specify which document type should be the parent of a child type. This must be done at index creation time, or with the update-mapping API before the child type has been created.

As an example, let’s say that we have a Sports Academy that has location in many cities. We would like to associate players with the location where they play. We need to be able to search for location, individual players, and players who work for particular locations, so the nested model will not help. We could, of course, use application-side-joins or data denormalization here instead, but for demonstration purposes we will use parent-child.

All that we have to do is to tell Elasticsearch that the player type has the location document type as its _parent, which we can do when we create the index. Here, documents of type player are children of type location.

curl -XPUT 'ES_HOST:ES_PORT/academy -d '{
  "mappings": {
    "location": {},
    "player": {
      "_parent": {
        "type": "location"
      }
    }
  }
}'

_Parent Field

A parent-child relationship can be established between documents in the same index by making one mapping type the parent of another:

The parent type is parent to the child type.

curl -XPUT 'ES_HOST:ES_PORT/test_index?pretty' -H 'Content-Type: application/json' -d'
{
 "mappings": {
   "parent": {},
   "child": {
     "_parent": {
       "type": "parent"
     }
   }
 }
}'

Now, we index a parent document.

curl -XPUT 'ES_HOST:ES_PORT/test_index/parent/1?pretty' -H 'Content-Type: application/json' -d'
{
 "text": "This is a parent document"
}'

Here, we index two child documents, specifying the parent document’s ID.

curl -XPUT 'ES_HOST:ES_PORT/test_index/child/2?parent=1&pretty' -H 'Content-Type: application/json' -d'
{
 "text": "This is a child document"
}'
curl -XPUT 'ES_HOST:ES_PORT/test_index/child/3?parent=1&refresh=true&pretty' -H 'Content-Type: application/json' -d'
{
 "text": "This is another child document"
}'

Now, let's find all parent documents that have children which match the query.

curl -XGET 'ES_HOST:ES_PORT/test_index/parent/_search?pretty' -H 'Content-Type: application/json' -d'
{
 "query": {
   "has_child": {
     "type": "child",
     "query": {
       "match": {
         "text": "child document"
       }
     }
   }
 }
}'
 

The value of the _parent field is accessible in aggregations and scripts, and may be queried with the parent_id query. Here, we are querying the id of the _parent field, aggregating on the _parent field and accessing the _parent field in scripts.

curl -XGET 'ES_HOST:ES_PORT/test_index/_search?pretty' -H 'Content-Type: application/json' -d'
{
 "query": {
   "parent_id": {
     "type": "child",
     "id": "1"
   }
 },
 "aggs": {
   "parents": {
     "terms": {
       "field": "_parent",
       "size": 10
     }
   }
 },
 "script_fields": {
   "parent": {
     "script": {
        "source": "doc['_parent']"
     }
   }
 }
}'

Indexing Parents and Children

Indexing parent documents is no different from any other document. Parents don’t need to know anything about their children:

curl -XPOST 'ES_HOST:ES_PORT/academy/location/_bulk' -d
'{ "index": { "_id": "newyork" }}
{ "name": "Manhattan Academy", "state": "New York State", "country": "USA" }
{ "index": { "_id": "chicago" }}
{ "name": "Chicago Central", "state": "Illinois", "country": "USA" }
{ "index": { "_id": "dallas" }}
{ "name": "Dallas Academy", "state": "Texas", "country": "USA" }'

When indexing child documents, you must specify the ID of the associated parent document. Here, this employee document is a child of the london branch.

curl -XPUT 'ES_HOST:ES_PORT/academy/player/1?parent=newyork' -d
'{
  "name":  "Robert Frost",
  "dob":   "1992-10-25",
  "sport": "football"
}'

This parent ID serves two purposes: it creates the link between the parent and the child, and it ensures that the child document is stored on the same shard as the parent.

In Routing a Document to a Shard, elasticsearch uses a routing value, which defaults to the _id of the document, to decide which shard a document should belong to. The routing value is plugged into this simple formula:

shard = hash(routing) % number_of_primary_shards

However, if a parent ID is specified, it is used as the routing value instead of the _id. In other words, both the parent and the child use the same routing value, the _id of the parent and so they are both stored on the same shard.

The parent ID needs to be specified on all single-document requests: when retrieving a child document with a GET request, or when indexing, updating, or deleting a child document. Unlike a search request, which is forwarded to all shards in an index, these single-document requests are forwarded only to the shard that holds the document, if the parent ID is not specified, the request will probably be forwarded to the wrong shard.

The parent ID should also be specified when using the bulk API:

curl -XPOST 'ES_HOST:ES_PORT/academy/player/_bulk' -d
'{ "index": { "_id": 2, "parent": "chicago" }}
{ "name": "John Doe", "dob": "1998-07-18", "sport": "volleyball" }
{ "index": { "_id": 3, "parent": "newyork" }}
{ "name": "WIlliam Smith", "dob": "1996-11-07", "sport": "basketball" }
{ "index": { "_id": 4, "parent": "dallas" }}
{ "name": "James Henry", "dob": "1995-07-15", "sport": "billiards" }'

If you want to change the parent value of a child document, it is not sufficient to just reindex or update the child document, the new parent document may be on a different shard. Instead, you must first delete the old child, and then index the new child.

Give it a Whirl!

It's easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, or Amazon data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch

Questions? Drop us a note, and we'll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.

comments powered by Disqus