Re: One region? What's the big deal?
One of the issues with EBS in certain failures is that the breaking doesn't always work. Sometimes you just get a 100x slowdown one disk, without any 'breaking', until you detect and route around the slowdown. Other times Amazon reports that a disk transaction has been committed, but if you try and read it back you get corrupted data every time you try; indicating a corrupted write in the first place (this is rare, but anyone who has spent much time with any hefty EBS instances will have seen this). So you can end up with corrupted data on both master and slave copies before any of Amazon's stuff actually reports that anything might be wrong, whilst what you have left runs like a drain.
This does not make reliability engineering very easy.
Basically, it's not that EBS fails (you should plan for that as has been mentioned) it's that it sometimes fails in completely unexpected ways that go against the documentation on how it should fail, and you don't necessarily notice that it's gone off on an acid trip until it is too late.
It also doesn't help when multiple availability zones break at the same time, or when the failover takes forever because everyone else is failing over with you.
I should point out that I haven't touched AWS recently, so Amazon may have miraculously fixed all of this, and this latest issue is some unrelated problem.
Any one with more up to date knowledge of AWS care to weigh in?