In the previous post, we learned the capabilities and scenarios in which the reindex API is used.

This article introduces you to the “update_by_query” API from Elasticsearch 2.3.0. We will see how this API came into existence and the workings and scenarios in which this API is used, along with examples.

Note: This API was released in Elasticsearch 2.3.0. Update_by_query will not work in previous versions.

Setup

We can use the same setup with our sample index and documents from the reindex API article.

Updating by Query

Elasticsearch is built on top of Lucene and uses its segment based architecture. In segment-based architecture, each document get stored in segments and numerous such segments constitute an index. The advantage of structuring this way is that there is no need for modification of files of a segment after its creation. Thus, the segments created are immutable.

What happens when we need to delete a document in this structure? It marks the document as deleted without actually deleting it. These marked documents are never shown in the results, so that the user would not be able to see them. This achieves the same functionality of a deletion operation except the memory overhead. During the Lucene segments merging operation, which is done to optimize the index, the documents that were marked for deletion are deleted from the memory.

In the case of updating a document, since the segments are immutable, the old document is marked as deleted in the segment it exits. Then the updated document is indexed in the current segment.

Learn How To Dockerize And Install A Ghost Blog Using Supergiant >

Sometimes we need to update large numbers of data, matching specific conditions. In such conditions, the query runs and the results are collected. The update operation of these documents is done one after another or by using the bulk API. Updating a large number of data documents is basically a three-step process. Now, by using the new update_by_query API, one can update bulk documents much more quickly because we are passing the query, and the code, for what needs to be changed as a single query.  Elasticsearch would update the documents just after the processing this query, which reduces the overhead of collecting results and updating separately.

The updating of documents by query in Elasticsearch, versions before 2.3.0 and 2.3.0, are shown below:

Updating Elasticsearch Documents versions prior to 2.3.0Elasticsearch versions prior to 2.3.0

Updating Elasticsearch documents versions 2.3.0 and on using update_by_queryElasticsearch versions 2.3.0 and on

Update_by_query API

Simple Update_by_query Operation

The most basic update_by_query operation can be used to update the version number on each document in the index on which it is applied.

curl -XPOST 'localhost:9200/test-index/_update_by_query?pretty'

Updating Fields

Let us see how the update_by_query API functions with a query and update script. Suppose we need to increment one point for the employee named "Ernest". Using the update_by_query API, we can write the following code to the terminal and achieve the results:

curl -XPOST 'localhost:9200/test-index/_update_by_query' -d'{
"script": {
"inline": "ctx._source.points++"
},
"query": {
"term": {
"name": "ernest"
}
}
}'

Updating Mapping

The "update_by_query" finds use in the updating of mapping changes. Normally, if we are to change the mapping for an existing field in an index, such as adding a multi-field, the effect of the mapping would only be visible after a document is updated or created in the index. This is explained in the following example.

Create a sample index named "test-index-mapping" and load the sample data given in the "Setup" section. Now, add another document like below.

curl -XPOST 'localhost:9200/test-index-mapping/test/4' -d '{"gender":"Female","name":"Daisy Moon","age":32}'

In the above document, we can see the name field contains both the first name and the second name. We decide it is better to give a multi-field option for the "name" field and make it "not_analyzed". Having that decided, we will update the mapping for the "name" field with the addition of a "not_analyzed" field called "raw". This can be done using the command line by typing in the following:

curl -XPOST 'localhost:9200/test-index-mapping/_mapping/test' -d '{
  "properties": {
    "name": {
      "type": "string",
      "fields": {
        "raw": {
          "type": "string",
          "index": "not_analyzed"
        }
      }
    }
  }
}' 

After the application of the mapping changes, run a terms aggregation on the field “name.raw” and see what the results are. Here is the aggregation we run on the index.

Learn The Top Reasons Businesses Should Move To Kubernetes >

curl -XPOST 'localhost:9200/test-index-mapping/test/_search' -d '{
  "size": 0,
  "aggs": {
    "aggs-demo": {
      "terms": {
        "field": "name.raw"
      }
    }
  }
}'

For the above aggregation, we will be receiving zero results because the addition of the "raw" field is not implemented as soon as we update the mapping. As said earlier, the new mappings are applied only when a new document is created or an existing document is updated. To avoid this overhead, we can use the "update_by_query" API. Apply the update_by_query like below:

curl -XPOST 'localhost:9200/test-index-mapping/_update_by_query?pretty&conflicts=proceed&refresh'

Now after this now try typing in the aggregation "aggs-demo" we have tried earlier. This time you can see the results for the aggregation being displayed.

Conclusion

In this post we have seen the operations using the “update_by_query” API. In the next post of this series, we will see how to check the status of an update or reindexing operation using the "tasks" API, and also ways to cancel these operations using the "cancel" API.