Optimizing indexing speed during large Bulk imports

Qbox users will often need to import an existing dataset from a primary data source or an external Elasticsearch cluster. As you might expect, the indexing activity is much more intensive during an initial import, in contrast to ongoing inserts/updates from your application. This initial indexing induces a significant load on the system.

To reduce the amount of time and system burden that results from preliminary indexing, we recommend that you make two setting changes:

  • Set index.refresh_interval to -1 to disable any refreshing during the initial data import. This setting controls how quickly a document is returned in search after indexing, which is generally not a concern during the first bulk import. The default setting is "1s", and we recommend that you change it back to this value—or to another desired value—after the import is complete.
  • Set index.number_of_replicas to 0, so that Elasticsearch does not have to write multiple copies of each document during the indexing process. This will greatly reduce the time taken for the initial import. The value for this setting defaults to number of nodes - 1 by default with Qbox, but can be changed to any desired value after the import. The replicas will then copy data from the primary shards in the background.

You can change these settings in the request body at the time you create an index; and you can update them with the Elasticsearch Update Indices API.

NOTE: Neither of these tactics is appropriate for operational indexing, since it's necessary for the refreshing to be active and the replicas to be available for regular search requests.

See this blog post and the official docs for more information:

Optimize Indexing during Bulk Data Imports

Elasticsearch Reference: Bulk API

Elasticsearch Reference: Update Indices API