Complex relational databases can lead to tortuous SQL queries and slow responses from the web application. If you’re trying to return a long list of objects that are built up from five, ten or even seventeen related tables your response times can be unacceptably slow. 

Such problems are encountered regularly in large and complex data modeling applications. We have found that using Elasticsearch along with some conventions for denormalising complex objects can make it easy to generate sufficiently speedy responses, even when they are returning lots of rows.

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click "Get Started" in the header navigation. If you need help setting up, refer to "Provisioning a Qbox Elasticsearch Cluster."

We have already discussed on modeling and handling relationships in elasticsearch in the post Modeling and Managing Relationships in Elasticsearch. We shall be carrying forward our discussion here with specific focus on denormalization.

Data denormalisation has downsides too. The first disadvantage is that the index will be bigger because the _source document for every blog post is bigger, and there are more indexed fields. This usually isn’t a huge problem. The data written to disk is highly compressed, and disk space is cheap. Elasticsearch can happily cope with the extra data.

The more important issue is that, if the user were to change his name, all of his blog posts would need to be updated too. Fortunately, users don’t often change names. Even if they did, it is unlikely that a user would have written more than a few thousand blog posts, so updating blog posts with the scroll and bulk APIs would take less than a second.

However, let’s consider a more complex scenario in which changes are common, far reaching, and, most important, concurrent.

In this example, we are going to emulate a filesystem with directory trees in Elasticsearch, much like a filesystem on Linux: the root of the directory is /, and each directory can contain files and sub directories.

We want to be able to search for files that live in a particular directory, the equivalent of this:

grep "some text" /qbox/cloud/elasticsearch/*

This requires us to index the path of the directory where the file lives:

curl -XPUT 'ES_HOST:ES_PORT/filesystem/file/1' -d '{
  "name":  "elasticsearch.yml",
  "path":  "/qbox/cloud/elasticsearch",
  "contents": "Launch our supported, fully-managed, RESTful search application instantly."
}'

We also want to be able to search for files that live anywhere in the directory tree below a particular directory, the equivalent of this:

grep -r "some text" /qbox

To support this, we need to index the path hierarchy:

  • /qbox

  • /qbox/cloud

  • /qbox/cloud/elasticsearch

 This hierarchy can be generated automatically from the path field using the path_hierarchy tokenizer:

The path_hierarchy tokenizer takes something like this:

/something/something/else

And produces following tokens:

/something
/something/something
/something/something/else
curl -XPUT 'ES_HOST:ES_PORT/filesystem' -d '{
  "settings": {
    "analysis": {
      "analyzer": {
        "paths": {
          "tokenizer": "path_hierarchy"
        }
      }
    }
  }
}'

The name field will contain the exact name and the path field will contain the exact directory name, while the path.tree field will contain the path hierarchy. The mapping for the file type would look like this:

curl -XPUT 'ES_HOST:ES_PORT/filesystem/_mapping/file' -d '{
  "properties": {
    "name": {
      "type":  "string",
      "index": "not_analyzed"
    },
    "path": {
      "type":  "string",
      "index": "not_analyzed",
      "fields": {
        "tree": {
          "type":     "string",
          "analyzer": "paths"
        }
      }
    }
  }
}'

Once the index is set up and the files have been indexed, we can perform a search for files containing elasticsearch in just the /qbox/cloud/elasticsearch directory like this:

curl -XGET 'ES_HOST:ES_PORT/filesystem/file/_search'  -d '{
  "query": {
    "filtered": {
      "query": {
        "match": {
          "contents": "elasticsearch"
        }
      },
      "filter": {
        "term": {
          "path": "/qbox/cloud/elasticsearch"
        }
      }
    }
  }
}'

Every file that lives in any subdirectory under /qbox will include the term /qbox in the path.tree field. So we can search for all files in any sub directory of /qbox as follows:

curl -XGET 'ES_HOST:ES_PORT/filesystem/file/_search' -d '{
  "query": {
    "filtered": {
      "query": {
        "match": {
          "contents": "elasticsearch"
        }
      },
      "filter": {
        "term": {
          "path.tree": "/qbox"
        }
      }
    }
  }
}'

Renaming Files and Directories

So far, so good. Renaming a file is easy, a simple update or index request is all that is required. We can even use optimistic concurrency control to ensure that our change doesn’t conflict with the changes from another user.

Elasticsearch is distributed. When documents are created, updated, or deleted, the new version of the document has to be replicated to other nodes in the cluster. Elasticsearch is also asynchronous and concurrent, meaning that these replication requests are sent in parallel, and may arrive at their destination out of sequence. Elasticsearch needs a way of ensuring that an older version of a document never overwrites a newer version.

We can take advantage of the _version number to ensure that conflicting changes made by our application do not result in data loss. We do this by specifying the version number of the document that we wish to change. If that version is no longer current, our request fails.

Here, the version number ensures that the change is applied only if the document in the index has this same version number.

curl -XPUT 'ES_HOST:ES_PORT/filesystem/file/1?version=2' -d '{
  "name":  "elasticsearch.ym",
  "path": "/qbox/cloud/elasticsearch",
  "contents": "Launch our supported, fully-managed, RESTful search application instantly. "
}'

We can even rename a directory, but this means updating all of the files that exist anywhere in the path hierarchy beneath that directory. This may be quick or slow, depending on how many files need to be updated. All we would need to do is to use scroll to retrieve all the files, and the bulk API to update them. The process isn’t atomic, but all files will quickly move to their new home.

In this post, we have emulated a filesystem with directory trees in Elasticsearch, much like a filesystem on Linux: the root of the directory is /, and each directory can contain files and sub directories. The problem comes when we want to allow more than one person to rename files or directories at the same time. We shall be discussing about Concurrency issues in our upcoming post.

Conclusion

Denormalisation is a strategy used on a previously-normalised database to increase performance. The idea behind it is to add redundant data where we think it will help us the most. We can use extra attributes in an existing table, add new tables, or even create instances of existing tables. The usual goal is to decrease the running time of select queries by making data more accessible to the queries or by generating summarised reports in separate tables.

A normalized database is the starting point for the denormalization process. It’s important to differentiate from the database that has not been normalized and the database that was normalized first and then denormalized later. The second one is okay; the first is often the result of bad database design or a lack of knowledge.

Elasticsearch provides two concepts that can help with denormalization:

  1. Nested Document /Query

  2. Parent & Child Relationship.

We will be continue our discussion on these relationship modeling techniques in upcoming posts of this series.

Give it a Whirl!

It's easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, or Amazon data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch

Questions? Drop us a note, and we'll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.

comments powered by Disqus