In Elasticsearch, field data is generated at query time by reading the index, inverting that data structure, and then storing the results in memory. This operation can be quite slow, and it often consumes far too much valuable heap space. Before you know it, your cluster is grinding slowly toward a state of complete lethargy — but it need not be so! In this article, we present a summary of a recent Elastic article on the challenges and corresponding remedies for overgrown field data.
Elastic Support reports that the #1 problem when a user attempts to scale their environment is traceable to field data that goes out of control. Unlike simple search operations, sorts and aggregations need to be able to discover what terms can be found in a particular field of a specific document. For these tasks and others, it’s necessary to have a data structure that’s the opposite of the Elasticsearch (inverted) index. That’s the purpose of fielddata.
Fielddata: A Big Problem that usually Gets Bigger
Field data is generated at query time by reading the inverted index, inverting that data structure, and then storing the results in memory. This operation can be slow—especially with big segments—and it often consumes too much valuable heap space. It can be an abrupt, nearly intractable problem: field data suddenly appears in memory and requires specific attention to avoid becoming a critical problem.
Minimizing Use of Field Data
Most important, it’s best to avoid generating field data in the first place by manually mapping all of your fields to use doc values. For any new index, this approach preemptively transfers the load by writing the fielddata to disk at index time. As necessary, Elasticsearch will then load the values outside of your Java heap. This way, you get fast access to on-disk fielddata through the file system cache—which gives in-memory performance without incurring the cost of garbage collection. Elasticsearch also retains plenty of headroom for the Elasticsearch heap to handle more operations such as bulk indexing and concurrent searches.
The big caveat, however, is that you must do this prior to indexing. Elastic is apparently working on a remedy that will be available in the upcoming release of Elasticsearch, version 2.0. You can read more about those plans in the ES fielddata article.
Optimizing your Cluster with Qbox
To help you avoid problems like field data that goes haywire, we’ve written a number of articles to help you configure, size, scale, and optimize your hosted Elasticsearch cluster.
We hope that you find this article helpful, and we’re always ready to help you achieve maximum success with your ES environment. Stay tuned for more updates and technical bulletins.