For Qbox clusters that have multiple nodes, requests will reroute to other nodes. For example, if node one goes down at the process-level, its requests will route to either to node two or node three.
Node failures or timeouts can result from crashes of the elasticsearch process or the nodes themselves. This type of failure is less-common, and is usually the result of resource-strain—such as overloading a cluster that’s too small for the request/data volume. We recommend staying within a 1-5x ratio of data to RAM. If request response time is important, you should be at a 1-1 ratio depending on the complexity of your queries.
If your cluster isn’t fully replicated, you will experience request failures in the event of a node outage. You will also see failures if the request volume is too high for your cluster’s thread pools. For some tasks, Elasticsearch has built in tools to configure retrying requests on failure. For the
_update API, a value can be provided for
retry_on_conflict to make updates more robust: https://www.elastic.co/guide/e…
In addition, some Elasticsearch language clients have their own
retry_count options, which relay any request failures to other available nodes. In all node-loss scenarios, our automated systems will immediately begin recovery procedures and send alerts to our support engineers and to the user.
Thread pool capacity is determined by the number of CPU cores; you can read about Elasticsearch’s various thread pools here: https://www.elastic.co/guide/e…
NOTE: For production clusters, two replicas with two replica shards per index is necessary for a cluster to have effective failover. Select 2 replicas on the cluster create page as shown below:
Two nodes and one replica for each index is still better than one node and zero replicas. But a two node cluster with one replica will only benefit from data redundancy, not increased uptime. In a Qbox cluster, a majority (or quorum) of nodes must be available and responsive to continue serving requests. Since a majority of two is two, three nodes is the minimum necessary to prevent downtime during a node failure.