Elasticsearch is a different kind of beast, especially if you come from the world of SQL. It comes with many benefits: performance, scale, near real-time search, and analytics across massive amounts of data.

Handling relationships between entities is not as obvious as it is with a dedicated relational store. The golden rule of a relational database, i.e., normalize your data, does not apply to Elasticsearch. This tutorial series will walk through Handling Relationships, Nested Objects, and Parent-Child Relationship to discuss the pros and cons of each of the available approaches.

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click "Get Started" in the header navigation. If you need help setting up, refer to "Provisioning a Qbox Elasticsearch Cluster."

Handling Relationships

In the real world, relationships matter: blog posts have comments, bank accounts have transactions, customers have bank accounts, orders have order lines, and directories have files and subdirectories.

Relational databases are specifically designed to manage relationships:

  • Each entity (or row, in the relational world) can be uniquely identified by a primary key.

  • Entities are normalized. The data for a unique entity is stored only once, and related entities store just its primary key. Changing the data of an entity has to happen in only one place.

  • Entities can be joined at query time, allowing for cross-entity search.

  • Changes to a single entity are atomic, consistent, isolated, and durable.

  • Most relational databases support ACID transactions across multiple entities.

But relational databases do have their limitations in addition to their poor support for full-text search. Joining entities at query time is expensive, and the more the joins required, the more expensive is the query. Performing joins between entities that live on different hardware is so expensive that it is just not practical. This places a limit on the amount of data that can be stored on a single server.

Elasticsearch, like most NoSQL databases, treats the world as though it were flat. An index is a flat collection of independent documents. A single document should contain all of the information that is required to decide whether it matches a search request.

While changing the data of a single document in Elasticsearch is ACIDic, transactions involving multiple documents are not. There is no way to roll back the index to its previous state if part of a transaction fails.

This flat modelling has its advantages:

  • Indexing is fast and lock-free.

  • Searching is fast and lock-free.

  • Massive amounts of data can be spread across multiple nodes, because each document is independent of the others.

But relationships matter. Somehow, we need to bridge the gap between flat mapping and the real world. Four common techniques are used to manage relational data in Elasticsearch:

  • Application-side joins
  • Data denormalisation
  • Nested objects
  • Parent/child relationship

Application-side Joins

We can emulate a relational database by implementing joins in our application. For instance, let’s say we are indexing users and their blog posts. In the relational world, we would do something like this:

The index, type, and id of each document together function as a primary key. 

curl -XPUT 'ES_HOST:ES_PORT/test_index/user/1 -d '{
  "Name":"Robert Frost",
  "email":"robert@frost.com",
  "dob":"1990/07/12"
}'

The blogpost links to the user by storing the user’s id. The index and type aren’t required because they are hardcoded in our application.

curl -XPUT 'ES_HOST:ES_PORT/test_index/blogpost/2 -d '{
  "title":    "Search Relevance",
  "body":     "Ecommerce Search...",
  "user":     1 
}'

Finding blog posts by user with ID 1 is easy:

curl -XPUT 'ES_HOST:ES_PORT/test_index/blogpost/_search' -d '{
  "query": {
    "filtered": {
      "filter": {
        "term": { "user": 1 }
      }
    }
  }
}'

In order to find blog posts by a user called Robert, we would need to run two queries: the first would look up all users called Robert in order to find their IDs, and the second would pass those IDs in a query similar to the preceding one:

curl -XGET 'ES_HOST:ES_PORT/test_index/user/_search' -d '{
  "query": {
    "match": {
      "name": "Robert"
    }
  }
}'

Here, the values in the terms filter would be populated with the results from the first query.

curl -XGET 'ES_HOST:ES_PORT/test_index/blogpost/_search' -d '{
  "query": {
    "filtered": {
      "filter": {
        "terms": { "user": [1] }
      }
    }
  }
}'

The main advantage of application-side joins is that the data is normalized. Changing the user’s name has to happen in only one place: the user document. The disadvantage is that you have to run extra queries in order to join documents at search time.

Denormalizing Your Data

The way to get the best search performance out of Elasticsearch is to use it as it is intended, by denormalizing your data at index time. Having redundant copies of data in each document that requires access to it removes the need for joins.

If we want to be able to find a blog post by the name of the user who wrote it, include the user’s name in the blog-post document itself:

curl -XPUT 'ES_HOST:ES_PORT/test_index/user/1 -d '{
  "Name":"Robert Frost",
  "email":"robert@frost.com",
  "dob":"1990/07/12"
}'

Here, part of the user’s data has been denormalized into the blog post document.

curl -XPUT 'ES_HOST:ES_PORT/test_index/blogpost/2' -d '{
  "title": "Search Relevance",
  "body": "Ecommerce Search...",
  "user": {
    "id": 1,
    "name": "Robert Frost"
  }
}'

Now, we can find blog posts about relationships by users called Robert with a single query:

curl -XGET 'ES_HOST:ES_PORT/test_index/blogpost/_search' -d '{
  "query": {
    "bool": {
      "must": [
        { "match": { "title": "Relevance" }},
        { "match": { "user.name": "Robert"    }}
      ]
    }
  }
}'

The advantage of data denormalization is speed. Because each document contains all of the information that is required to determine whether it matches the query, there is no need for expensive joins.

Field Collapsing

A common requirement is the need to present search results grouped by a particular field. We might want to return the most relevant blog posts grouped by the user’s name. Grouping by name implies the need for a terms aggregation. In order to be able to group on the user’s name as it is, the name field should be available in its original not_analyzed form.

curl -XPUT 'ES_HOST:ES_PORT/test_index/_mapping/blogpost' -d '{
  "properties": {
    "user": {
      "properties": {
        "name": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        }
      }
    }
  }
}'

Then add some data:

curl -XPUT 'ES_HOST:ES_PORT/test_index/user/1 -d '{
  "Name":"Robert Frost",
  "email":"robert@frost.com",
  "dob":"1990/07/12"
}'
curl -XPUT 'ES_HOST:ES_PORT/test_index/blogpost/2' -d '{
  "title":    "Search Relevance",
  "body":     "Ecommerce Search...",
  "user": {
    "id": 1,
    "name": "Robert Frost"
  }
}'
curl -XPUT 'ES_HOST:ES_PORT/test_index/user/3' -d '{
  "name": "Robert Smith",
  "email": "rober@smith.com",
  "dob": "1979/01/04"
}'
curl -XPUT 'ES_HOST:ES_PORT/test_index/blogpost/4' -d '{
  "title": "Scoring and Boosting",
  "body": "Scoring in Ecommerce...",
  "user": {
    "id": 3,
    "name": "Robert Smith"
  }
}'

Now we can run a query looking for blog posts about relationships, by users called Robert, and group the results by user, thanks to the top_hits aggregation:

Here, top_score aggregation orders the terms in the users aggregation by the top-scoring document in each bucket. The top_hits aggregation returns just the title field of the five most relevant blog posts for each user.

curl -XGET 'ES_HOST:ES_PORT/my_index/blogpost/_search' -d '{
  "size" : 0,
  "query": {
    "bool": {
      "must": [
        { "match": { "title": "Ecommerce" }},
        { "match": { "user.name": "Robert" }}
      ]
    }
  },
  "aggs": {
    "users": {
      "terms": {
        "field": "user.name.raw",
        "order": { "top_score": "desc" }
      },
      "aggs": {
        "top_score": { "max": { "script": "_score"}},
        "blogposts": { "top_hits": { "_source": "title", "size": 5 }}
      }
    }
  }
}'

The abbreviated response consists of a bucket for each user who appeared in the top results. Under each user bucket there is a blogposts.hits.hits array containing the top results for that user. The user buckets are sorted by the user’s most relevant blog post.

...
"hits": {
  "total":     2,
  "max_score": 0,
  "hits":      []
},
"aggregations": {
  "users": {
     "buckets": [
        {
           "key": "Robert Smith",
           "doc_count": 1,
           "blogposts": {
              "hits": {
                 "total":     1,
                 "max_score": 0.35258877,
                 "hits": [
                    {
                       "_index": "test_index",
                       "_type":  "blogpost",
                       "_id":    "2",
                       "_score": 0.34534337
                       "_source": {
                          "title": "Scoring and Boosting"
                       }
                    }
                 ]
              }
           },
           "top_score": {
              "value": 0.34534332     }
        },
...

Using the top_hits aggregation is the equivalent of running a query to return the names of the users with the most relevant blog posts and then running the same query for each user to get their best blog posts -- but it is much more efficient.

The top hits returned in each bucket are the result of running a light mini-query based on the original main query. The mini-query supports the usual features that you would expect from search such as highlighting and pagination.

Give it a Whirl!

It's easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, or Amazon data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch

Questions? Drop us a note, and we'll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.

comments powered by Disqus