Elasticsearch has released its version 2.3.0. It has some interesting APIs such as the reindex API, update by query API, and the tasks API. In this blog series, we will see how these queries are used and what the extra capabilities they provide. This post focuses mainly on the reindex API and its capabilities.

Note: This API was released in Elasticsearch 2.3.0. This will not work in previous versions.

Setup

Elasticsearch 2.3.0

Since we are dealing with the reindex, update by query, and tasks API in this post, we need Elasticsearch 2.3.0, which can be downloaded here.

Enable Inline Scripting

The next thing we need to do is to enable dynamic scripting in Elasticsearch. This can be done by setting "script.inline: true” in the elasticsearch.yml file. This is needed since we use inline scripts in this post.

Next, enable dynamic scripting in elasticsearch by adding the line “script.inline: true” to the elasticsearch.yml file.

Create a Test Index

Create a test index for this blog by running the following command in the terminal:

curl -XPUT 'localhost:9200/test-index'

Index Sample Data

Let us index some sample documents as below:

curl -XPOST 'localhost:9200/test-index/testtype/1' -d '{"gender":"Male","name":"Brian","points":21}'
curl -XPOST 'localhost:9200/test-index/testtype/2' -d '{"gender":"Male","name":"Ernest","points":32}'
curl -XPOST 'localhost:9200/test-index/testtype/3' -d '{"gender":"Female","name":"Christina","points":23}'

Reindexing

One of the issues Elasticsearch users experienced was having to reindex their data. In most practical cases, somewhere in the data cycle, we end up reindexing our documents in Elasticsearch. There might be cases for new field inclusion, mapping modifications, settings changes or data type changes that  demand reindexing.

Until Elasticsearch 2.3.0, there was no support for data reindexing in Elasticsearch, and we had to depend on logstash or stream2es. Let us explore the reindex API.

Reindex API

The Reindex API can be applied in a variety of ways, and the most important ones are discussed below:

Simple Reindex Operation

Let us familiarize ourselves with the most simplest reindex operation of plainly copying one index to another index. Reindex the files in the index "test-index" to another index, "test-index-new". Typing in the following API command in the terminal does that for us:

curl -XPOST 'localhost:9200/_reindex' -d '{
  "source": {
    "index": "test-index"
  },
  "dest": {
    "index": "test-index-new"
  }
}'

Typing in these commands and pressing enter gives a response similar to the following:

{
  "took": "1.5s",
  "timed_out": false,
  "total": 4,
  "updated": 0,
  "created": 4,
  "batches": 1,
  "version_conflicts": 0,
  "noops": 0,
  "retries": 0,
  "failures": [
    
  ]
}

The explanation for the parameters in the above example are below:

  • "took" field indicates the time taken for the reindexing operation in milliseconds.
  • "timed_out", if true, indicates whether the operation timed out.
  • "total" is the number of documents reindexed.
  • "updated" is the number of documents that were updated.
  • "batches" is the number of batches in which the entire operation was processed.
  • "version_conflicts" is the number of documents which were conflicting.
  • "noops" is the number of documents updated via noops.
  • "retries" is the number of re-attempts for indexing.
  • "failures" is the details of documents which failed to get reindexed.

Selective Reindex Operation

Now consider selective reindexing. Suppose from the index,"test-index", we need to separate out the document of female employees and index to another index. Initially we need to run a match query, collect the results, and then index them in another index using the normal or bulk indexing operations. However, with the reindex operation, just pass the query and specify the source and destination indices along with it. Example below:

curl -XPOST 'localhost:9200/_reindex' -d '{
  "source": {
    "index": "test-index",
    "query": {
      "match": {
        "gender": "female"
      }
    }
  },
  "dest": {
    "index": "test-index-new",
    "type": "female"
  }
}
}' 

In the above example, you can see that not only the destination index but also the type can be provided.

Using Scripts with the Reindexing API

In case we need to update a field data and then reindex. Reindex API provides options, too.

Suppose we want to increment the points of all female employees by 1. We need to query for female employees and use a script for incrementing the points field by 1 and then index the document. This can be achieved by the following query:

curl -XPOST 'localhost:9200/_reindex' -d '{
  "source": {
    "index": "test-index",
    "query": {
      "match": {
        "gender": "female"
      }
    }
  },
  "dest": {
    "index": "test-index-new"
  },
  "script": {
    "inline": "ctx._source.points++"
  }
}'

Reindexing for Mapping Changes

In our sample data we have only a single term for the name fields. Suppose we enter a fourth document with the "name" field as "Daisy Moon" and aggregate on the field names. There are only four names in four documents, but the aggregation will show five names. This is because, by default, the "name" field would get analyzed. We need to change the mapping for the field and reindex.

Cases similar to this will occur frequently in practical scenarios. The Reindex API can be a great help in this case. First, we create another index with the required mapping. Here we create an index called "test-index-map" with the required mapping applied to the type "testtype" of the index, like that shown below.

curl -XPOST 'localhost:9200/test-index-map' -d '{
  "mappings": {
    "testtype": {
      "properties": {
        "age": {
          "type": "long"
        },
        "gender": {
          "type": "string"
        },
        "name": {
          "type": "string",
          "index": "not_analyzed"
        }
      }
    }
  }
}'

Learn The Top Reasons Businesses Should Move to Kubernetes Now On Our New Supergiant Blog >

You can see that we have applied the "not_analyzed" mapping to the field "name". Now, we reindex our data from the "test-index" to "test-index-map" to the type "testtype" as below:

curl -XPOST 'localhost:9200/_reindex' -d '{
  "source": {
    "index": "test-index"
  },
  "dest": {
    "index": "test-index-map",
    "type": "testtype"
  }
}'

Now, if we aggregate on the field names, we can see that there are only four names as buckets, with "Daisy Moon" as one separate entity.

Conclusion

In this post we have seen how the reindex API, which was introduced in Elasticsearch 2.3.0, functions, as well as scenarios in which it can be used. In the next post of this series, we will see the "update_by_query" API in detail.