We have discussed indexing, searching, and aggregations for parent-child and grandparent-grandchildren relationships in elasticsearch. The parent-child functionality allows us to associate one document type with another, in a one-to-many relationship or one parent to many children.

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click "Get Started" in the header navigation. If you need help setting up, refer to "Provisioning a Qbox Elasticsearch Cluster."

The advantages that parent-child has over nested objects are as follows:

  • The parent document can be updated without reindexing the children.

  • Child documents can be added, changed, or deleted without affecting either the parent or other children. This is especially useful when child documents are large in number and need to be added or changed frequently.

  • Child documents can be returned as the results of a search request.

The parent-child relationship is similar in nature to the nested model: both allows us to associate one entity with another. The difference is that, with nested objects, all entities live within the same document while, with parent-child, the parent and children are completely separate documents. However, it also is bound to some restrictions:

Parent - Child Restrictions

  • The parent and child types must be different or parent-child relationships cannot be established between documents of the same type.

  • The _parent.type setting can only point to a type that doesn’t exist yet. This means that a type cannot become a parent type after it has been created.

  • Parent and child documents must be indexed on the same shard. The parent ID is used as the routing value for the child, to ensure that the child is indexed on the same shard as the parent. This means that the same parent value needs to be provided when getting, deleting, or updating a child document.

The main differences between parent-child and nested relationships can be summarised as follows:

Nested Object

Parent-Child

1. Nested objects are saved in the same document.

  Parent and child objects are saved   separately in different documents.

2. A child object can have multiple parent objects.

  A child object cannot have multiple   parent objects.

3. Querying is relatively fast.

  Querying is slow because child and   parent are stored separately.

4. Can easily maintain multiple nested levels.

  Hard to maintain multiple nested   levels.

5. Nested level querying is well-defined and simple to use for   any number of nested objects. Also Query string can be   used to query the nested objects.

  Querying gets complicated when   multiple parent-child relationships  are present.

6. Can retrieve all the data since they are residing in the same object.

  Cannot retrieve both child and   parent documents in a single query.

7. Nested object gets duplicated for each parent object.

  No data duplication involved since   the relationship is normalized.

8. If a nested object gets changed, all of the parent objects have to be re-indexed.

 No need to re-index the parent   because only a connection is   maintained between them.

Nested objects may be preferred over parent-child in order to handle the associations due to the following reasons:

  • When database models contain multiple nested associations in multiple levels. Therefore, they can be handled easily with the nested object approach.

  • In the parent-child approach, a child object cannot have multiple parent objects.

  • Nested queries are easier to perform in the nested object approach.

  • Unable to retrieve both children and parent fields in a single query.

Performance Considerations: Global Ordinals

Global ordinals is a data-structure on top of fielddata and doc values, that maintains an incremental numbering for each unique term in a lexicographic order. Each term has a unique number and the number of term A is lower than the number of term B. Global ordinals are only supported on text and keyword fields.

Fielddata and doc values also have ordinals, which is a unique numbering for all terms in a particular segment and field. Global ordinals just build on top of this, by providing a mapping between the segment ordinals and the global ordinals, the latter being unique across the entire shard.

Global ordinals are used for features that use segment ordinals, such as sorting and the terms aggregation, to improve the execution time. A terms aggregation relies purely on global ordinals to perform the aggregation at the shard level, then converts global ordinals to the real term only for the final reduce phase, which combines results from different shards.

Interested in Kubernetes? Check out our Enterprise Kubernetes Support

Parent-child uses global ordinals to speed up joins. Global ordinals need to be rebuilt after any change to a shard. The more parent id values are stored in a shard, the longer it takes to rebuild the global ordinals for the _parent field.

Global ordinals, by default, are built eagerly: if the index has changed, global ordinals for the _parent field will be rebuilt as part of the refresh. This can add significant time the refresh. However most of the times this is the right trade-off, otherwise global ordinals are rebuilt when the first parent-child query or aggregation is used. This can introduce a significant latency spike for your users and usually this is worse as multiple global ordinals for the _parent field may be attempt rebuilt within a single refresh interval when many writes are occurring.

When the parent/child is used infrequently and writes occur frequently it may make sense to disable eager loading:

curl -XPUT 'ES_HOST:ES_PORT/my_index?pretty' -H 'Content-Type: application/json' -d'
{
 "mappings": {
   "my_parent": {},
   "my_child": {
     "_parent": {
       "type": "my_parent",
       "eager_global_ordinals": false
     }
   }
 }
}
'

The amount of heap used by global ordinals can be checked as follows:

# Per-index
curl -XGET 'localhost:9200/_stats/fielddata?human&fields=_parent&pretty'
# Per-node per-index
curl -XGET 'localhost:9200/_nodes/stats/indices/fielddata?human&fields=_parent&pretty'

Conclusion

The ability to join multiple generations (like Grandparents and Grandchildren) sounds attractive until you think of the costs involved:

  • The more joins you have, the worse performance will be.

  • Each generation of parents needs to have their string _id fields stored in memory, which can consume a lot of RAM.

As you consider your relationship schemes and whether parent-child is right for you, consider this advice about parent-child relationships:

  • Use parent-child relationships sparingly, and only when there are many more children than parents.

  • Avoid using multiple parent-child joins in a single query.

  • Avoid scoring by using the has_child filter, or the has_child query with score_mode set to none.

  • Keep the parent IDs short, so that they compress better in doc values, and use less memory when transiently loaded.

Give it a Whirl!

It's easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, or Amazon data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch

Questions? Drop us a note, and we'll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.

comments powered by Disqus