Last week was not a good week for the public cloud infrastructure business or for the companies that depend on them. “>
To some degree, the struggle continues.
First, the so-called Shellshock Bash Bug (CVE-2014-6271), discovered in just about every server running Unix and its various flavors (including Linux, which is the OS for every Qbox node), made its appearance, requiring a system-wide patch, not only at the VM level but also at the root machine level.
Then, later last week, an undisclosed vulnerability was discovered in the Xen hypervisor, which is the virtualization engine underpinning much of the public cloud, including most VM’s at AWS and all Rackspace VM’s. Xen Foundation has not made details of this vulnerability public, and all partners who use it are bound by non-disclosure from discussing it, so both companies were secretive in the nature of this vulnerability and announced the quite sudden necessity of rebooting entire datacenters.
Although both Rackspace and the Xen Foundation deny that the events are connected, the vulnerability must have been serious enough, as both companies will surely take a hit on their uptime SLA’s. A Forbes technology pundit called it “an example of how NOT to respond to a critical fault.”
These chains of events resulted in some unexpected downtime for some of our customers, compounded by mistakes we made that could have been handled better. This post is therefore both explanation and mea culpa.
To provide some background, Qbox is architected to enable customers to locate their Elasticsearch indexes in the same data center as their primary data store and other infrastructure. This not only vastly improves performance over choosing a remote service provider, it also removes abstractions between you and the underlying application.
We also learned early on that price limiting our customers based on documents, queries, or storage limited the number of use cases in which a hosted model made sense, at least for this particular technology. To us, a shared model was sub-optimal when your underlying technology involves bulk indexing and memory-intensive facets that at any moment can hoover the resources from everybody else and degrade performance system wide.
Therefore, selling our service by the node has been our approach, and for the most part, we think it has enabled us to have the most cost -effective and most production-ready service available.
Like many customers, the notifications from both infrastructure partners came late, and for a great many applications, a little scheduled downtime is not a huge problem, usually requiring only a few seconds to a minute or so. For clustered applications that act as their own load balancer such as Elasticsearch, however, this presents a bigger problem. Rebalancing takes a period, during which time performance may be affected, especially if the node was on the edge of being under-resourced in the first place.
For example, if a 5-node cluster loses 1 node, it will likely just rebalance and not lose much in the way of performance. If several nodes at one time are lost, or if it is in a constant state of rebalancing, the remaining nodes could be overloaded and thus become unresponsive. If 1 node out of 3 nodes goes out, one-third of the computing resource vanishes.
So this particular action by our cloud partners not only caused some nodes to be down, it also caused some clusters to be overloaded. To make matters worse, some nodes (approximately 20 percent) that were shut down just never came back up for reasons that are still being investigated. For hours. With no reason or response from the provider. As a result, some of our most important customers with multi-node clusters were especially affected.
We compounded this mistake by making an assumption that this scheduled maintenance would be more-or-less routine. It turned out to be anything but routine. We kept making assumptions that recovery processes, coupled with our own recovery scripts would cause the downtime to correct itself. Our misunderstanding of the situation led to poor communication to the customers experiencing down time, and the lack of communication understandably angered some with production instances.
We lost some customers over this issue, and this upsets us. However, it must be said that self-hosted deployments would have likely faced the same problems if they were using the same cloud data centers.
As of now (Sunday afternoon US Central time), none of our three cloud partners have completed this maintenance. SoftLayer has indicated that their cloud may not be vulnerable, although they leave open the possibility that it may happen in the future. Rackspace is taking a data-center-by-data-center approach, having completed Northern Virginia, and now progressing to DFW, Chicago, and so on. Their status page gives a good window on what to expect. According to their claims, AWS is only rebooting approximately 10% of their nodes, and those are being completed at 1 am UTC time. We will communicate with our affected AWS customers as the information becomes available.
We would like to invite any and all customers to share their feedback, good or bad. More importantly, this is also a good opportunity to review your use of Elasticsearch to ensure that you have an appropriately-sized cluster or to review the security situation.