Qbox users will often need to import an existing dataset from a primary data source or an external Elasticsearch cluster.
As you might expect, the indexing activity is much more intensive during an initial import of such data. This initial indexing places a significant load on the system. In this short article we show you how to change a couple of settings that will greatly improve the efficiency of your bulk data loads into Elasticsearch clusters.
We recommend that you make two setting changes to reduce the amount of time and system loading that results from preliminary indexing,
- Set index.refresh_interval to -1 to disable any refreshing during the initial data import. This setting controls how quickly a document will return in search results after indexing—which is usually not a factor during a bulk data import. The default setting is “1s,” and we recommend that you change it back to this value—or to another desired value—after the import is complete.
- Set index.number_of_replicas to 0 so Elasticsearch does not have to write multiple copies of each document during the indexing process. This will greatly reduce the time necessary for the initial import. In Qbox, the default value for this setting is number of nodes – 1, but you can change it to any value after the import. The replicas will then copy data from the primary shards in the background.
You can change these settings in the request body when you create an index, and you can update them with the Elasticsearch Update Indices API.
NOTE: Neither of these setting changes is appropriate for operational indexing, since it’s necessary for the refreshing to be active and the replicas to be available for regular search requests.