When we are indexing data, the task is rarely as simple as each document existing in isolation. Sometimes, we are better off denormalizing all data into the child documents. For example, if we were modeling blog posts, adding an author field to blog could be a sensible choice; even if in the database, the authoritative datasource, the data is split into separate authors and blogs table. It’s simple and one can easily construct queries on both attributes of the blogs and the author’s name.

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click “Get Started” in the header navigation. If you need help setting up, refer to “Provisioning a Qbox Elasticsearch Cluster.

Denormalizing is not always practical though as there might be too much data in the parent document to duplicate it in each child document. Lets consider a simple blogging application where author can publish their blogs and viewers can comment on their blogs. It  wouldn’t be a good choice at all to repeat the entire content of the blog post in each comment. This can cause the indexed data size to grow manifold. On the contrary, without indexing the entire content of blog post in each comment, we can’t easily write queries to find comments on posts matching certain criteria.

Handling relationships between entities is not as obvious as it is with a dedicated relational store. The golden rule of a relational database i.e., normalize your data does not apply to Elasticsearch. We will walk through Nested Objects and Parent-Child Relationship in our next few tutorials to discuss the pros and cons of each of the available approaches.

Considering the fact that creating, deleting, and updating a single document in Elasticsearch is atomic, it makes sense to store closely related entities within the same document. For instance, we could store an order and all of its order lines in one document, or we could store a blog post and all of its comments together, by passing an array of comments:

curl -XPUT 'ES_HOST:ES_PORT/blogs/series/1' -d '{
  "title": "Supergiant",
  "body":  "Kubernetes at scale ...",
  "tags":  [ "kubernetes", "cloud",  "container"],
  "comments": [
    {
      "name":    "Adam Vanderbush",
      "comment": "Automate orchestration and deployment",
      "age":     32,
      "rating":   3,
      "date":    "2017-09-05"
    },
    {
      "name":    "Brian Sage",
      "comment": "Supergiant helps ...",
      "age":     28,
      "rating":   5,
      "date":    "2017-09-15"
    }
  ]
}'

Since all of the content is in the same document, there is no need to join blog posts and comments at query time, so searches perform well. The problem is that the preceding document would match a query like this:

curl -XGET 'ES_HOST:ES_PORT/blogs/series/_search' -d '{
  "query": {
    "bool": {
      "must": [
        { "match": { "comments.name": "Adam" }},
        { "match": { "comments.age":  28 }}
      ]
    }
  }
}'

Adam is 32 and not 28. The reason for this cross-object matching is that our beautifully structured JSON document is flattened into a simple key-value format in the index that looks like this:

{
  "title":            [ supergiant ],
  "body":             [ kubernetes, at, scale ],
  "tags":             [ kubernetes, cloud,  container ],
  "comments.name":    [ adam, vanderbush, brian, sage ],
  "comments.comment": [ Automate, orchestration, and, deployment, Supergiant, helps],
  "comments.age":     [ 32, 28 ],
  "comments.rating":  [ 3, 5 ],
  "comments.date":    [2017-09-05, 2017-09-15]
}

The correlation between Adam and 32, or between Brian and 2017-09-15, has been irretrievably lost. While fields of type object are useful for storing a single object, they are useless, from a search point of view, for storing an array of objects.

This is the problem that nested objects are designed to solve. By mapping the comments field as type nested instead of type object, each nested object is indexed as a hidden separate document, something like this:

{
  "comments.name":    [ adam, vanderbush ],
  "comments.comment": [ Automate, orchestration, and, deployment ],
  "comments.age":     [ 32 ],
  "comments.stars":   [ 3 ],
  "comments.date":    [ 017-09-05 ]
}
{
  "comments.name":    [ brian, sage ],
  "comments.comment": [ supergiant, helps ],
  "comments.age":     [ 28 ],
  "comments.stars":   [ 5 ],
  "comments.date":    [ 017-09-15 ]
}
{
  "title":            [ supergiant ],
  "body":             [ kubernetes, at, scale ],
  "tags":             [ "kubernetes", "cloud",  "container" ]
}

Indexing each nested object separately keeps the fields within the object maintain their relationships. We can run queries that will match only if the match occurs within the same nested object. Not only that, because of the way that nested objects are indexed, joining the nested documents to the root document at query time is fast and almost as fast as if they were a single document.

Interested in DevOps? Check out our Enterprise Kubernetes Support.

These extra nested documents are hidden; we can’t access them directly. In order to update, add, or remove a nested object, we have to reindex the whole document. It’s important to note that, the result returned by a search request is not the nested object alone; it is the whole document.

Nested Mapping

Setting up a nested field is simple, where you would normally specify type object, make it type nested instead:

curl -XPUT 'ES_HOST:ES_PORT/blogs' -d '{
  "mappings": {
    "series": {
      "properties": {
        "comments": {
          "type": "nested",
          "properties": {
            "name":    { "type": "string"  },
            "comment": { "type": "string"  },
            "age":     { "type": "short"   },
            "rating":   { "type": "short"  },
            "date":    { "type": "date"    }
          }
        }
      }
    }
  }
}'

Querying a Nested Object

Since the nested objects are indexed as separate hidden documents, we can’t query them directly.Instead, we have to use the nested query to access them:

curl -XGET 'ES_HOST:ES_PORT/blogs/series/_search' -d '{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "title": "supergiant"
          }
        },
        {
          "nested": {
            "path": "comments",
            "query": {
              "bool": {
                "must": [
                  {
                    "match": {
                      "comments.name": "adam"
                    }
                  },
                  {
                    "match": {
                      "comments.age": 32
                    }
                  }
                ]
              }
            }
          }
        }
      ]
}}}'

Here, the title clause operates on the root document. The nested clause “steps down” into the nested comments field. It no longer has access to fields in the root document, nor fields in any other nested document. The comments.name and comments.age clauses operate on the same nested document. A nested field can contain other nested fields. Similarly, a nested query can contain other nested queries. The nesting hierarchy is applied as you would expect.

A nested query can match several nested documents. Each matching nested document would have its own relevance score, but these multiple scores need to be reduced to a single score that can be applied to the root document.

By default, it averages the scores of the matching nested documents. This can be controlled by setting the score_mode parameter to avg, max, sum, or even none (in which case the root document gets a constant score of 1.0).

Here, the root document gets the _score from the best-matching nested document.

curl -XGET 'ES_HOST:ES_PORT/blogs/series/_search' -d '{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "title": "supergiant"
          }
        },
        {
          "nested": {
            "path": "comments",
            "score_mode": "max",
            "query": {
              "bool": {
                "must": [
                  {
                    "match": {
                      "comments.name": "adam"
                    }
                  },
                  {
                    "match": {
                      "comments.age": 32
                    }
                  }
                ]
              }
            }
          }
        }
      ]
    }
  }
}'

If placed inside the filter clause of a Boolean query, a nested query behaves much like a nested query, except that it doesn’t accept the score_mode parameter. Because it is being used as a non-scoring query,  it includes or excludes, but doesn’t score, a score_mode doesn’t make sense since there is nothing to score.

Give it a Whirl!

It’s easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, or Amazon data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch.

Questions? Drop us a note, and we’ll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.