An Elasticsearch cluster may consist of a single node with a single index. Or it may have a hundred data nodes, three dedicated masters, a few dozen client nodes—all operating on a thousand indices (and tens of thousands of shards). No matter the scale of the cluster, you’ll want a quick way to assess the status of your cluster. The Cluster Health API fills that role. It can reassure you that everything is alright, or alert you to a problem somewhere in your cluster.

Our Goal

The goal of this tutorial is to highlight the power of the Explain API by going through some examples of how to use it to diagnose shard allocation problems on Qbox provisioned Elasticsearch.

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click "Get Started" in the header navigation. If you need help setting up, refer to "Provisioning a Qbox Elasticsearch Cluster."

Let’s execute a Cluster-Health API and see what the response looks like:

curl -XGET http://ES_HOST:ES_PORT/_cluster/health?pretty -d '{
  "cluster_name" : "escluster",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 16,
  "active_primary_shards" : 2558,
  "active_shards" : 5628,
  "relocating_shards" : 0,
  "initializing_shards" : 4,
  "unassigned_shards" : 22
}'

The most important piece of information in the response is the status field. The status may be one of three values:

  • green All primary and replica shards are allocated. The cluster is 100% operational.

  • yellow All primary shards are allocated, but at least one replica is missing. No data is missing, so search results will still be complete. However, your high availability is compromised to some degree. If more shards disappear, you might lose data. Think of yellow as a warning that should prompt investigation.

  • red At least one primary shard (and all of its replicas) is missing. This means that you are missing data: searches will return partial results, and indexing into that shard will return an exception.

The status red is a sign that the cluster is missing some primary shards. As a consequence, queries on these data will fail and indexing will take a tremendous amount of time. When all the primary shards are back, the cluster switches in yellow to warn you it’s still working but your data is present. 

The green/yellow/red status is a great way to glance at your cluster and understand what’s going on. The rest of the metrics give you a general summary of your cluster: 

  • number_of_nodes and number_of_data_nodes are fairly self-descriptive.

  • active_primary_shards indicates the number of primary shards in your cluster. This is an aggregate total across all indices.

  • active_shards is an aggregate total of all shards across all indices, which includes replica shards.

  • relocating_shards shows the number of shards that are currently moving from one node to another node. 

  • initializing_shards is a count of shards that are being freshly created. For example, when you first create an index, the shards will all briefly reside in initializing state.

  • unassigned_shards are shards that exist in the cluster state, but cannot be found in the cluster itself. A common source of unassigned shards are unassigned replicas.

In earlier versions of Elasticsearch, figuring out why shards are not being allocated required the analytical skills of a bomb defusion expert.  You’d look through the cluster state API, the cat-shards API, the cat-allocation API, the cat-indices API, the indices-recovery API, the indices-shard-stores API and wonder what it all means.

The cluster allocation explain API (henceforth referred to as the explain API) was introduced as an experimental API in v5.0 and reworked into its current form in v5.2.  The explain API was designed to answer two fundamental questions:

  • For Unassigned shards: “Why are my shards unassigned?”

  • For Assigned shards: “Why are my shards assigned to this particular node?”

The purpose of the cluster allocation explain API is to provide explanations for shard allocations in the cluster. For unassigned shards, the explain API provides an explanation for why the shard is unassigned. For assigned shards, the explain API provides an explanation for why the shard is remaining on its current node and has not moved or rebalanced to another node. This API can be very useful when attempting to diagnose why a shard is unassigned or why a shard continues to remain on its current node when you might expect otherwise.

Explain API

To explain the allocation of a shard, first an index should exist:

curl -XPUT ‘ES_HOST:ES_PORT/index’

And then the allocation for shards of that index can be explained:

curl -XGET 'ES_HOST:ES_PORT/_cluster/allocation/explain?pretty' -H 'Content-Type: application/json' -d '{
 "index": "myindex",
 "shard": 0,
 "primary": true
}'

Specify the index and shard id of the shard you would like an explanation for, as well as the primary flag to indicate whether to explain the primary shard for the given shard id or one of its replica shards. These three request parameters are required.

We may also specify an optional current_node request parameter to only explain a shard that is currently located on current_node. The current_node can be specified as either the node id or node name

curl -XGET 'ES_HOST:ES_PORT/_cluster/allocation/explain?pretty' -H 'Content-Type: application/json' -d '{
 "index": "myindex",
 "shard": 0,
 "primary": false,
 "current_node": "nodeA"                         
}'

We can also have Elasticsearch explain the allocation of the first unassigned shard that it finds by sending an empty body for the request:

curl -XGET 'ES_HOST:ES_PORT/_cluster/allocation/explain?pretty'

Allocators and Deciders

Allocating shards and assigning them to the best node possible is of fundamental importance within Elasticsearch. The shard allocation process differs for newly created indices and existing indices.  In both cases, Elasticsearch has two main components at work: allocators and deciders.  Allocators try to find the best nodes to hold the shard, and deciders make the decision if allocating to a node is allowed. 

  • New Indices - The allocator looks for the nodes with the least amount of shards on them and returns a list of nodes sorted by shard weight in ascending order so as to assign the index’s shards to nodes in a manner that would result in the most balanced cluster. The allocators only take into account the number of shards per node, not the size of each shard i.e., weight of a node increases with increase in number of shards irrespective of the size of allocated shards. The deciders then take each node in order and decide if the shard is allowed to be allocated to that node on basis of filter allocation rules, disk occupancy threshold etc.
  • Existing Indices - For a primary shard, the allocator will only allow allocation to a node that already holds a known good copy of the shard.  If the allocator did not take such a step, then allocating a primary shard to a node that does not already have an up-to-date copy of the shard will result in data loss.  In the case of replica shards, the allocator first looks to see if there are already copies of the shard (even stale copies) on other nodes.  If so, the allocator will prioritise assigning the shard to one of the nodes holding a copy, because the replica needs to get in-sync with the primary once it is allocated, and the fact that a node already has some of the shard data means (hopefully) a lot less data has to be copied over to the replica from the primary.  This can speed up the recovery process for the replica shard significantly.

Cluster State(Red) Use Cases

Unassigned Primary Shards 

If the unassigned primary is on a newly created index, no documents can be written to that index.  If the unassigned primary is on an existing index, then not only can the index not be written to, but all data previously indexed is unavailable for searching. 

Let’s start by creating a new index named `test_index`, with 1 shard and 0 replicas per shard, in a two-node cluster (with node names `A` and `B`), but assigning filter allocation rules at index creation time so that shards for the index cannot be assigned to nodes `A` and `B`.  The following curl command accomplishes this: 

curl -XPUT 'ES_HOST:ES_PORT/test_index?wait_for_active_shards=0’  -d ‘{ 
   "settings": 
   { 
      "number_of_shards": 1, 
      "number_of_replicas": 0,
      "index.routing.allocation.exclude._name": "A,B" 
   } 
}’

The filter allocation rules will prevent allocation of the newly created index test_index to the only two nodes in the cluster.  The cluster thus turns RED.  We can get an explanation for the first unassigned shard found in the cluster (like in this case) by invoking the explain API with an empty request body:

curl -XGET 'ES_HOST:ES_PORT/_cluster/allocation/explain

Which produces the following output:

{
  "index" : "test_index",
  "shard" : 0,
  "primary" : true,
  "current_state" : "unassigned", 
  "unassigned_info" : {
    "reason" : "INDEX_CREATED", 
    "at" : "2017-06-05T14:12:39.401Z",
    "last_allocation_status" : "no"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",   
  "node_allocation_decisions" : [ 
    {
      "node_id" : "ksdnkn3kskSDSJKAsskskSksd",
      "node_name" : "A", 
      "transport_address" : "127.0.0.1:9300",
      "node_decision" : "no",
      "weight_ranking" : 1,
      "deciders" : [
        {
          "decider" : "filter",  
          "decision" : "NO", 
          "explanation" : "node matches index setting [index.routing.allocation.exclude.] filters [_name:\"A OR B\"]" 
        }
      ]
    },
    {
      "node_id" : "dhHJHDffDA4fssfsKDJBJS",
      "node_name" : "B", 
      "transport_address" : "127.0.0.1:9301",
      "node_decision" : "no",
      "weight_ranking" : 2,
      "deciders" : [
        {
          "decider" : "filter",
          "decision" : "NO",
          "explanation" : "node matches index setting [index.routing.allocation.exclude.] filters [_name:\"A OR B\"]"
        }
      ]
    }
  ]
}

The shard cannot be allocated due to none of the nodes permitting allocation of the shard.  When drilling down to each node’s decision (see "node_allocation_decisions"), we observe that node A received a decision not to allocate due to the filter decider preventing allocation with the reason that the filter allocation settings excluded nodes `A` and `B` from holding a copy of the shard (see "explanation" inside the "deciders" section).  The explanation also contains the exact setting to change to allow the shard to be allocated in the cluster.

Updating the filter allocation settings via: 

curl -XPUT 'ES_HOST:ES_PORT/test_index/_settings ‘ -d ‘{ 
   "index.routing.allocation.exclude._name": null 
}’

And re-running the explain API results in an error message saying unable to find any unassigned shards to explain. This is because the only shard for `test_index` has now been assigned.

Running the explain API on the primary shard:

curl -XGET 'ES_HOST:ES_PORT/_cluster/allocation/explain' -d '{ 
   "index": "test_index", 
   "shard": 0, 
   "primary": true
}'

The response allows us to see which node the shard was assigned to:

{
  "index" : "test_index",
  "shard" : 0,
  "primary" : true,
  "current_state" : "started",
  "current_node" : {
    "id" : "ksdnkn3kskSDSJKAsskskSksd",
    "name" : "A",
    "transport_address" : "127.0.0.1:9300",
    "weight_ranking" : 1
  },
  …
}

We can see that the shard is now in the allocated state and assigned to node `A`. Now, let’s index some data into `test_index`.  Next, we will stop node A so that the primary shard is no longer in the cluster. Since there are no replica copies at the moment, the shard will remain unassigned and the cluster health will be RED. Rerunning the above explain API command on primary shard 0 will return:

{
  "index" : "test_index",
  "shard" : 0,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {             
    "reason" : "NODE_LEFT",    
    "at" : "2017-06-05T15:24:21.157Z",
    "details" : "node_left[wessk7BJBJ5sernknSEkas]",
    "last_allocation_status" : "no_valid_shard_copy"
  },
  "can_allocate" : "no_valid_shard_copy", 
  "allocate_explanation" : "cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster" 
}

The output tells us that the primary shard is currently unassigned because the node holding the primary left the cluster. The reason is that there is no longer any valid copy for shard 0 of `test_index` in the cluster (see "can_allocate") along with the explanation of why the shard cannot be allocated (see "allocate_explanation").  

 We know we have lost all of the nodes that held a valid shard copy when the explain API tells us that there is no longer a valid shard copy for our primary shard. At this point, the only recourse is to wait for those nodes to come back to life and rejoin the cluster. In the odd event that all nodes holding copies of this particular shard are permanently dead, the only recourse is to use the reroute commands to allocate an empty/stale primary shard and accept the fact that data has been lost.

Unassigned Replica Shards

Let’s take our existing `test_index` and increase the number of replicas to 1:

curl -XPUT 'ES_HOST:ES_PORT/test_index/_settings’ -d ‘{ 
   "number_of_replicas": 1 
}’

This will give us a total of 2 shards for `test_index` - the primary for shard 0 and the replica for shard 0.  Since node `A` already holds the primary, the replica should be allocated to node `B`, to form a balanced cluster.  Running the explain API on the replica shard confirms this:

curl -GET 'ES_HOST:ES_POR/_cluster/allocation/explain -d ‘{ 
   "index": "test_index", 
   "shard": 0, 
   "primary": false 
}’

The response to above request is:

{
  "index" : "test_index",
  "shard" : 0,
  "primary" : false,
  "current_state" : "started",
  "current_node" : {
    "id" : "dhHJHDffDA4fssfsKDJBJS",
    "name" : "B",
    "transport_address" : "127.0.0.1:9301",
    "weight_ranking" : 1
  },
  …
}

The output shows that the shard is in the started state and assigned to node `B`.

Next, we will again set the filter allocation settings on the index, but this time, we will only prevent shard allocation to node `B`:

curl -XPUT 'ES_HOST:ES_PORT/test_index/_settings’ -d ‘{ 
   "index.routing.allocation.exclude._name": "B" 
}’

Now, restart node B and re-run the explain API command for the replica shard:

{
  "index" : "test_index",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "NODE_LEFT",
    "at" : "2017-06-05T15:35:34.478Z",
    "details" : "node_left[dhHJHDffDA4fssfsKDJBJS]",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no", 
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "dhHJHDffDA4fssfsKDJBJS",
      "node_name" : "B",
      "transport_address" : "127.0.0.1:9301",
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "filter",  
          "decision" : "NO",
          "explanation" : "node matches index setting [index.routing.allocation.exclude.] filters [_name:\"B\"]" 
        }
      ]
    },
    {
      "node_id" : "ksdnkn3kskSDSJKAsskskSksd",
      "node_name" : "A",
      "transport_address" : "127.0.0.1:9300",
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "same_shard",  
          "decision" : "NO",
          "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[test_index][0], node[ksdnkn3kskSDSJKAsskskSksd], [P], s[STARTED], a[id=JNODiTgYTrSp8N2s0Q7MrQ]]" 
        }
      ]
    }
  ]
}

We learn from this output that the replica shard cannot be allocated because for node `B`, the filter allocation rules prevented allocation to it.  Since node `A` already holds the primary shard, another copy of the shard cannot be assigned to it (see "explanation" for the same shard decider). Elasticsearch avoids this because there is no point in having two copies of the same data live on the same node.

Assigned Shards

If a shard is assigned, why might you care about its allocation explanation?  One common reason would be that a shard (primary or replica) of an index is allocated to a node, and you just set up filtering rules to move the shard from its current node to another node but for some reason, the shard remains on its current node.  This is another situation where the explain API can shed light on the shard allocation process.

Let’s again clear the filter allocation rules so that both primary and replica in our `test_index` are assigned:

curl -XPUT 'ES_HOST:ES_PORT/test_index/_settings’ -d ‘{ 
    "index.routing.allocation.exclude._name": null 
}’

Now, let’s set the filter allocation rules so that the primary shard cannot remain on its current node (in my case, node `A`): 

curl -XPUT 'ES_HOST:ES_PORT/test_index/_settings’  -d ‘{ 
    "index.routing.allocation.exclude._name": "A"  
}’

One might expect at this point that the filter allocation rules will cause the primary to move away from its current node to another node, but in fact it does not.  We can run the explain API on the primary shard to see why:

curl -XGET 'ES_HOST:ES_PORT/_cluster/allocation/explain’ -d ‘{ 
    "index": "test_index", 
    "shard": 0, 
    "primary": true 
}’

The response is:

{
  "index" : "test_index",
  "shard" : 0,
  "primary" : true,
  "current_state" : "started",
  "current_node" : {
    "id" : "ksdnkn3kskSDSJKAsskskSksd",
    "name" : "A",  
    "transport_address" : "127.0.0.1:9300"
  },
  "can_remain_on_current_node" : "no", 
  "can_remain_decisions" : [   
    {
      "decider" : "filter",
      "decision" : "NO",
      "explanation" : "node matches index setting [index.routing.allocation.exclude.] filters [_name:\"A\"]"   
    }
  ],
  "can_move_to_other_node" : "no", 
  "move_explanation" : "cannot move shard to another node, even though it is not allowed to remain on its current node",
  "node_allocation_decisions" : [
    {
      "node_id" : "dhHJHDffDA4fssfsKDJBJS",
      "node_name" : "B",
      "transport_address" : "127.0.0.1:9301",
      "node_decision" : "no",
      "weight_ranking" : 1,
      "deciders" : [
        {
          "decider" : "same_shard", 
          "decision" : "NO",
          "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[test_index][0], node[dhHJHDffDA4fssfsKDJBJS], [R], s[STARTED], a[id=jhdjKDSDJKJDSIEI3-DSNK]]" 
        }
      ]
    }
  ]
}

We can see that the primary shard is still assigned to node `A`.  The cluster correctly acknowledges that the shard can no longer remain on its current node, with the reason given that the current node matches the filter allocation exclude rules.  Despite not being allowed to remain on its current node, the explain API tells us that the shard cannot be allocated to a different node because, for the only other node in the cluster (node `B`), that node already contains an active shard copy, and the same shard copy cannot be allocated to the same node more than once (see the explanation for the same shard decider in "node_allocation_decisions").

Conclusion

The cluster allocation explanation API is designed to assist in answering the question "why is this shard unassigned?". It is very useful at helping an administrator understand the shard allocation process in an Elasticsearch cluster. There are many use cases that the explain API covers, including showing the node weights to explain why an assigned shard remains on its current node instead of rebalancing to another node. The explain API is an great tool in troubleshooting issues with one’s cluster. It has already resulted in big benefits and time savings while using it internally during development as well as while diagnosing cluster state inconsistencies.

Give Qbox a Try

It's easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, or Amazon data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch

Questions? Drop us a note, and we'll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.

comments powered by Disqus