Roughly 48 hours after its major service outage, Amazon is admitting what caused the problem. Apparently, some poor engineer at Amazon Web Services (AWS) did an oopsie and brought the internet to its knees. Oopsies are the worst!
In all seriousness, it’s a sobering story. Here’s how Amazon described it in a recent blog post:
At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.
We’ve all been there. You push the wrong button and end up getting Sprite instead of Coke. But this poor guy or gal probably made an errant keystroke that crippled AWS for at least four hours. Since about a third of all internet traffic reportedly flows through AWS servers, deleting a whole bunch of those servers screwed up a few people’s days.