A Xen Security Bulletin, warning of a vunerability that is not yet public, has again forced the major public clouds that rely on the Xen hypervisor, including all 3 Qbox infrastructure partners, to issue rolling reboots of a portion of their VM’s.
This must be done prior to March 10 when the details of the vulnerability will be released, ringing the dinner bell for all manner of bad actors to feast upon unpatched instances.
This is the 3rd such “cloud reboot” since October of 2014, so they are getting better at dealing with a crisis — as is Qbox.
Security is of utmost concern to us, as is uptime and availability, so please take a moment to read this bulletin.
There is more specific information regarding each infrastructure partner below, but in all cases, here are some high-level recommendations to help us help you minimize downtime.
1) Qbox backs up nightly, but only if you are on Elasticsearch v 1.1.x or higher. If you are on an older version, you should upgrade anyway. Open a support ticket if you can upgrade, but check version notes for any breaking changes. (There is a black Zendesk tab on the left side on your dashboard.)
2) Whether in the midst of reboot or not, a true production-worthy high availability setup is with 3 nodes. Elasticsearch is made to be distributed, but it requires a majority in order to continue serving search requests while a node is recovering. With <3, there can be no majority. We do not prevent a smaller setup because so many of our customers want a dev or staging environment at a low price point, but if high availability is of utmost importance, you should have 3 nodes. Add a node from your dashboard if you have fewer than 3 nodes.
3) We are proactively monitoring any downtime, but they are not giving us any information about exact schedules beyond the publicly available windows.
If your Qbox Cluster is on AWS
According to AWS’ security bulletin, only around 10% of their VM’s were affected. Further, they identified a way to live-patch the fix. As they say, “This means that over 99.9% of our total EC2 instances will receive the live-update and avoid a reboot.” We have not yet received any notifications of AWS nodes that will be rebooted, although we will certainly stay on the lookout.
Amazon has also confirmed that new assets that are spun up will be pre-patched. (The other two partners have not confirmed this.) Thus, if it turns out your nodes are affected, you can control the upgrade by migrating to a new cluster. If you have 3 nodes, this can be done with very little downtime. With fewer than 3, it can take longer; see my point #2 above if this is the case.
If your Qbox Cluster is on Rackspace
According to Rackspace’s status page, a schedule has been announced, and their rebooting is already underway. The vast majority of Qbox customers have nodes in either Chicago or DFW, and those are scheduled for March 3 and 4, respectively. So far, our alerts have included nodes (basically, all of them) that “MAY BE” affected.
If Your Qbox Cluster is on Softlayer
Very few Qbox on Softlayer customers will be affected. The updates begin March 4 and end March 9, following this schedule:
HKG02 (Hong Kong): 04-Mar-2015 16:00 UTC / 05-Mar-2015 midnight local datacenter time
AMS01 (Amsterdam): 04-Mar-2015 23:00 UTC / 05-Mar-2015 midnight local datacenter time
HOU02 (Houston): 05-Mar-2015 06:00 UTC / 05-Mar-2015 midnight local datacenter time
DAL01 (Dallas #1): 05-Mar-2015 07:00 UTC / 05-Mar-2015 1:00AM local datacenter time
SJC01 (San Jose): 05-Mar-2015 08:00 UTC / 05-Mar-2015 midnight local datacenter time
MEL02 (Melbourne): 05-Mar-2015 13:00 UTC / 06-Mar-2015 midnight local datacenter time
SNG01 (Singapore): 05-Mar-2015 16:00 UTC / 06-Mar-2015 midnight local datacenter time
FRA02 (Frankfurt): 05-Mar-2015 23:00 UTC / 06-Mar-2015 midnight local datacenter time
WDC01 (Washington, DC): 06-Mar-2015 05:00 UTC / 06-Mar-2015 midnight local datacenter time
DAL09 (Dallas #9): 06-Mar-2015 06:00 UTC / 06-Mar-2015 midnight local datacenter time
TOK02 (Tokyo): 06-Mar-2015 15:00 UTC / 07-Mar-2015 midnight local datacenter time
PAR01 (Paris): 06-Mar-2015 23:00 UTC / 07-Mar-2015 midnight local datacenter time
TOR01 (Toronto): 07-Mar-2015 05:00 UTC / 07-Mar-2015 midnight local datacenter time
DAL06 (Dallas #6): 07-Mar-2015 06:00 UTC / 07-Mar-2015 midnight local datacenter time
SEA01 (Seattle): 07-Mar-2015 08:00 UTC / 07-Mar-2015 midnight local datacenter time
MEX01 (Mexico City): 07-Mar-2015 08:00 UTC / 07-Mar-2015 2:00AM local datacenter time
LON02 (London): 08-Mar-2015 00:00 UTC / 08-Mar-2015 midnight local datacenter time
DAL05 (Dallas #5): 07-Mar-2015 06:00 UTC / 08-Mar-2015 midnight local datacenter time
Very few Qbox customers are affected, and they also indicate that they will be live-patching, which means that an extraordinarily small group of nodes will require downtime.