Here's a twist on the old adage: A ounce of prevention is worth a kiloton of user satisfaction.

It's no secret that we're big fans of Elasticsearch. But we've seen more than a few customers crash their clusters—in a variety of ways. Most of those failures are quite preventable. It's often a matter of a simple misunderstanding, and the remedy is usually fairly easy to apply.

As we did in our recent article on field data, we invite you to think through a number of other potential problems. With little effort, you can apply several key practices that will improve stability and improve performance for your ES cluster.

We recently spent time thinking about problems that can arise from improper use of field data. In this article, we share our easy-to-implement advice for avoiding serious problems with an sluggish, unresponsive, or dead Elasticsearch cluster.

1. Don't Allocate Far Too Many Shards

Without question, too many shards can be a bad thing. That's because there is an incremental, cummulative resource cost to every shard and every index—even if an index doesn't contain any documents. It's critically important to realize that you cannot change the shard allocation once you complete the initial configuration of your Elasticsearch cluster. If you later find it necessary to change the number of shards, then it would be necessary to reindex all the source documents.

So, how many shards should you allocate? While we can't give you a one-size-fits-all answer, we do invite you to grab some coffee and work through our extensive treatment of this topic, Optimizing Elasticsearch: How Many Shards per Index?

2. Don't Let your Mappings Get out of Control

Your Elasticsearch cluster will surely and quickly consume all of its memory if you blatantly ignore the distinction between keys and values during indexing. If the keys are set up according to value, then the mapping that Elasticsearch derives from your data will increase without limit. If you have keys that change according to values, we strongly recommend that you restructure the documents to have fixed keys. We also recommend that you invest time exploring nested documents.

Read more about mappings and other troubleshooting techniques in our article Troubleshooting in Elasticsearch: Queries, Mappings, and Scoring. Also, have a look at our blog article on Parent-Child Relationships in Elasticsearch.

3. Don't Undersize your Cluster

Of course, an inordinate request load on a small-capacity cluster can cause a very painful and costly crash. Proper sizing is essential, so the foremost question here is typically one of bigger nodes or additional nodes.

An exhaustive answer would require a lengthy response, but here is some useful but concise guidance. Our team at Qbox recommends starting with at least a medium/large box (2 cores and 3-4gb RAM in a 2-node setup), and our ongoing experience continues to validate this seemingly arbitrary suggestion.

Why bigger boxes? Well, if you’ve got a small dataset that’s relatively static in size, maybe a m1.small or even a m1.micro on EC2 would work. If the search load changes, you could just add or remove nodes.

However, we can generally assume that most application datasets should be expected to grow — and resizing servers means at least some downtime. There's really no way around it without a tedious custom script to swap out nodes — and probably a massive headache. So choose a box size with some room to grow. Then you can add nodes for scale with little or no downtime.

Why more than one node? At first glance, it might seem economical to choose one large node instead of several smaller ones. It seems logical: big dataset, low search volume. Why complicate the setup with more nodes, you might ask.

While you may think you don't need multiple nodes to start, we've found that our customers usually overload a single-node cluster very quickly. Unthrottled bulk inserts come blazing in, GC duration spikes occur, queries hang in the balance waiting for IO access ... panic!

Elasticsearch was built to withstand situations like this. It is, after all, a distributed search engine! Nodes working in parallel make a happy cluster.

We offer many more thoughtful details on scaling your cluster in our article Thoughts on Launching and Scaling Elasticsearch.

4. Don't Over-size the Parameter for your Returns

Simply put, many developers just don't give this much thought. Knowing that a search won't return that many hits, s/he still wants all of them without any paging. So s/he jacks the size parameter up to an insanely large value—like Integer.MAX_VALUE.

When optimizing code, a developer will commonly assume that a query will usually have more hits than the number of elements in the results. So, predictably, they prepare internal data structures to accommodate the number of documents specified in the size. However, if the size parameter is too large, then Elasticsearch will create a very large internal data structure. As you might now guess, there is a cost: a seemingly interminable delay for a relatively insignificant result.

In recent versions of Elasticsearch, this isn't as much of a problem as with earlier releases. But some risk remains. Even if you have done your testing and have high confidence that there is plenty of heap space, you can encounter performance degradation and induce a crash merely because one of those data structures requires contiguous memory on the heap. And, your cluster may be unable to provide that memory without performing intensive garbage collection.

We recommend a better alternative: use the scan and scroll API instead.

A Solid Strategy: Optimize your Cluster with Hosted Elasticsearch

To help you avoid various sorts of problems like field data that goes haywire, we've written a number of articles to help you configure, size, scale, and optimize your hosted Elasticsearch cluster.

We're always ready to help you achieve maximum success with your ES environment, and we hope that you find this article helpful. Stay tuned for more updates and technical bulletins.

comments powered by Disqus