The incident that made some of the web inaccessible this week was a result of human error.
Amazon published an explanation about Tuesday’s disrupted service of S3, part of Amazon Web Services, which provides hosting for hundreds of thousands of websites and apps.
Turns out, it was a typo.
In a statement on Thursday, Amazon said an employee on its S3 team was working on an issue with the billing system and meant to take a small number of servers offline — but they incorrectly entered the command and removed a much larger set of servers.
Amazon is “making several changes” to its system to avoid a similar event in the future. Namely, “the tool used allowed too much capacity to be removed too quickly.”
According to Synergy Research Group, AWS owns 40% of the cloud services market, meaning it’s responsible for the operability of large swaths of popular websites. So if AWS goes down, it takes a huge number of businesses, apps, and publishers with it.
That’s why so many sites struggled with slow or reduced capacity during Tuesday’s outage. Some news organizations couldn’t publish stories, and file sharing was disabled on the enterprise chat app Slack. Other sites impacted include GitHub, Trello, and Venmo. It took Amazon almost four hours to resolve the issue.
“While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses,” the company said. “We will do everything we can to learn from this event and use it to improve our availability even further.”