Crisis? What crisis?

Repeat after me: I will not let your lack of planning become my crisis. Keith Mitchell at PIPEX taught me this.——Bill Thompson

I don’t care about downtime nearly so much as I care about data loss, so, from my point of view, I’m pretty happy with the outcome of the Amazon Web Services problems last week.

To recap: Amazon AWS services hosted in Virginia suffered connectivity problems which weren’t fully resolved for several days. This affected a number of large sites, such as Reddit, and the hosting platform Heroku.

The reasons for using AWS instead of dedicated hosting are still valid: it’s quick, easy, and cheap to provision servers and storage because it’s an anonymous, automated system. Of course, those advantages turn to disadvantages when something goes wrong: it’s still an anonymous automated system that doesn’t really tell you much about what’s going wrong or when it’s going to be fixed.

But you know what? I think that’s OK. Computer systems fail, sometimes catastrophically. Hardware breaks. The only reason anyone noticed the Amazon outage is that it affected a large number of sites simultaneously. Running your own systems may make it easier to find someone to blame when they go down, and to nag while they try to fix it, but would it really be better? Can you hire the networking and database expertise you’d need at the price you’re willing to pay? Can you set up a multiple-location backup system that actually works?

A sense of perspective is important here. We’re talking about hosting websites, not brain surgery or moon landings. No one’s dying. People overreact, it may be a bit embarrassing, and it’s not nice to be the target of frothy-mouthed panic from jittery middle managers, but that’s not really critical. If you didn’t have a continuity plan a week ago, then you didn’t think the service was that important. As the quote above says, your lack of planning is not my crisis.

The worst possible outcome of the AWS downtime would be a reactionary exodus from AWS to another hosting platform that hasn’t failed yet. It won’t be any better.

Now, you might complain that Amazon said that this outage wouldn’t happen, so you didn’t plan for it. Have you planned for nuclear strikes on the eastern seaboard of the United States of America? If you have, then congratulations: the AWS outage probably didn’t affect you. If you haven’t, then you implicitly accepted the risk of downtime. In most cases, that’s the rational choice, but it’s also an acceptance of the fact that your website is not critical.

Amazon’s recent outage affected connectivity to virtual servers on the US east coast, and, by extension, RDS, their hosted database service. One of the interesting features of RDS is that it provides automated snapshots of the database at daily intervals. During the downtime, anyone using an affected RDS database was able to create a new working database instance from the last snapshot with two or three mouse clicks. Far from being a failure, that’s actually a remarkably resilient system, and a quality of backup far better than most people would manage running their own database servers. However, given the potential loss of up to 24 hours’ data, or the difficulty of re-integrating the data from that period later, I suspect that many people would have chosen simply to wait for connectivity to be restored. That’s an acknowledgement that preserving data is more important than uptime.

Take fright at ‘the cloud’ if you will. Run back to traditional hosting providers. Pay them lots of money. Wait days for them to rack up servers. Fill in change requests in Word documents whenever you want to do something. Make sure that they set up regular off-site backups of your databases and storage. Ensure that those backups work. And then, next time Amazon goes down, you’ll be safe. But don’t fool yourself that you won’t have downtime, or that it will be any easier or more reliable.

There are ways to architect applications for high availability, but they come with costs and trade-offs of their own. It’s your choice.