最近,互联网上最大的事可能是Amazon的AWS宕机了,而且好几天都没有完全恢复。整个Internet都在讨论这个事,Internet很不高兴,后果可能很严重。可能是因为这个事件对中国没有影响,所以中文这边相关的文章不多,大家可以参考一下和讯网的这篇《伤不起!亚马逊史前最大宕机事件的启示》。

国外有人把所有和这个事件相关的贴子都收集了起来,都是一些相当不错的贴子和文章,尤其是一些经验教训的贴子,很受教,转给大家看看。这个贴子的来源在这里。

目录

个别公司的经历,有好有坏Amazon Web Services 讨论区总结立场:这是用户的错立场:这是Amazon的错教训和启示Vendor很生气

个别公司的经历,有好有坏

How Heroku Survived the Amazon Outage on the Heroku status page
How SimpleGeo Stayed Up During the AWS Downtime by Mike Malone
How SmugMug survived the Amazonpocalypse by Don MacAskill  (Hacker News discussion)
How Bizo survived the Great AWS Outage of 2011 relatively unscathed… by Someone at Bizo
Joe Stump’s explanation of how SimpleGeo survived
How Netflix Survived the Outage
Why Twilio Wasn’t Affected by Today’s AWS Issues on Twilio Engineering’s Blog (Hacker News thread)
On reddit’s outage
What caused the Quora problems/outage in April 2011?
Recovering from Amazon cloud outage by Drew Engelson of PBS.

PBS was affected for a while primarily because we do use EBS-backed RDS databases. Despite being spread across multiple availability-zones, we weren’t easily able to launch new resources ANYWHERE in the East region since everyone else was trying to do the same. I ended up pushing the RDS stuff out West for the time being.  From Comment

Amazon Web Services 讨论区

有一些有经验的人共享了很多相当不错的宕机的经历。

Amazon Web Services Discussion Forum
Cost-effective backup plan from now on?
Life of our patients is at stake – I am desperately asking you to contact
Why did the EBS, RDS, Cloudformation, Cloudwatch and Beanstalk all fail?
Moved all resources off of AWS
Any success stories?
Is the mass exodus from East going to cause demand problems in the West?
Finally back online after about 71 hours
Amazon EC2 features vs windows azure
Aren’t Availability Zones supposed to be “insulated from failures”?
What a lot of people aren’t realizing about the downtime:
ELB CNAME
Availability Zones were used in a misleading manner
Tip: How to recover your instance
Crying in Forum Gets Results, Silver-level AWS Premium Support Doesn’t
Well-worth reading: “design for failure” cloud deployment strategy
New best practice
Don’t bother with Premium Support
Best practices for multi-region redundancy
“Postmortum“
Learning from this case
Amazon, still no instructions what to do?
Anyone else prepared for an all-nighter?
Is Jeff Bezos going to give a public statement?
Rackspace, GoGrid, StormonDemand and Others
Jeff Barr, Werner Vogels and other AWS persons – where have you been???
After you guys fix EBS do I have do anything on my side?
Need Help!!! Lives of people and billions in revenue are at risk now!!!
I’ve Got A Suspicion
Farewell EC2, Farewell

There were also many many instances of support and help in the log.

总结

Amazon EC2 outage: summary and lessons learned by RightScale
AWS outage timeline & downtimes by recovery strategy by Eric Kidd
The Aftermath of Amazon’s Cloud Outage by Rich Miller

立场:这是用户的错

So Your AWS-based Application is Down? Don’t Blame Amazon by The Storage Architect
The Cloud is not a Silver Bullet by Joe Stump (Hacker News thread)
The AWS Outage: The Cloud’s Shining Moment by George Reese (Hacker News discussion)
Failing to Plan is Planning to Fail by Ted Theodoropoulos
Get a life and build redundancy/resiliency in your apps on the Cloud Computing group

立场:这是Amazon的错

Stop Blaming the Customers – the Fault is on Amazon Web Services by Klint Finley
AWS is down: Why the sky is falling by Justin Santa Barbara  (Hacker News thread)
Amazon Web Services are down – Huge Hacker News thread

教训和启示

People Using Amazon Cloud: Get Some Cheap Insurance At Least by Bob Warfield
Basic scalability principles to avert downtime by Ronald Bradford
Amazon crash reveals ‘cloud’ computing actually based on data centers by Kevin Fogarty
Seven lessons to learn from Amazon’s outage By Phil Wainewright
The Cloud and Outages : Five Key Lessons by Patrick Baillie (Cloud Computing Group discussion)
Some thoughts on outages by Till Klampaeckel
Amazon.com’s real problem isn’t the outage, it’s the communication by Keith Smith
How to work around Amazon EC2 outages by James Cohen (Hacker News thread)
Today’s EC2 / EBS Outage: Lessons learned on Agile Sysadmin
Amazon EC2 has gone down -what would a prefered hosting platform be? on Focus
Single Points of Failure by Mat
Coping with Cloud Downtime with Puppet
Amazon Outage Concerns Are Overblown by Tim Crawford
Where There Are Clouds, It Sometimes Rains by Clay Loveless
Availability, redundancy, failover and data backups at LearnBoost by Guillermo Rauch
Cloud hosting vs colocation by Chris Chandler (Hacker News thread)
Amazon’s EC2 & EBS outage by Arnon Rotem-Gal-Oz

Vendor很生气

Amazon Outage Proves Value of Riak’s Vision by Basho
Magical Block Store: When Abstractions Fail Us by Mark Joyent (Hacker News discussion)
On Cascading Failures and Amazon’s Elastic Block Store by Jason
An unofficial EC2 outage postmortem – the sky is not falling from CloudHarmony

转载于酷壳CoolShell 无删改 仅以此纪念陈皓(左耳朵耗子)