One of the biggest fails here...
Seems to be the simple "don't rely on an item, to report news about said item.
That being said, they're learning from it.
A 43-second loss of connectivity on the US East Coast helped trigger GitHub's 24-hour TITSUP (Total Inability To Support User Pulls) earlier this month. The bit bucket today published a detailed analysis of the outage, and explained that the brief loss of connectivity between its US East Coast network hub and the primary US …
“To improve performance at scale .. We use Orchestrator to manage our MySQL cluster topologies and handle automated failover.”
Demonstrating the potential instability introduced by excessive complexicity in a system. There's a name for this that escapes me at this time.
Ive had fewer brown trouser moments BECAUSE of clustering my resources. We arent large scale but we do use MS clustering, it has kept things going with the core switches wents down due to faulty UPS, clustering on 2nd site failed over nicely and replicated storage back when we fixed the UPS. SQL server has failed over when spod accidentally triggered an update that rebooted one of the custers - this was a "nice" failover but worked just as well.
big boys tend to have bigger problems though.
"Demonstrating the potential instability introduced by excessive complexicity in a system"
The term I'm thinking of starts with 'cluster' (how aproh-pooh)
Admittedly, I've never had to set one of these up. I would think that a transaction-based system could simply apply the backlogged transactions, right???
[you know, like taking a failed RAID drive off-line and swapping in a new one]
DB replication can be a bastard. Think monotonically increasing ID style keys. If both nodes have different data records for the same primary key entry for ID 12345 that are then linked into other data records, then you have to untangle the mess. In dual node systems you can do odds & evens, but in multi node clusters it gets a lot uglier
Not a DBA, but I pretend sometimes when the real DBAs are away.
And this is why you shouldn't have automatic failover in disaster recovery situations. For local HA, with redundant equipment, when a disk, switch or server goes down automatic is fine. For long-distance DR it's well-nigh impossible for an automated system to have a full view of what happened (recoverable network outage versus primary site disappearing in a ball of nuclear fire, for example). With a person in the loop they could have looked at the situation, perhaps called an admin on the other coast, and said "oh, it's just a transient network outage, best solution is wait until it comes back.". Automate the changeover by all means, once the decision is taken, but don't make it automatic.
What you also need is a mechanism which *guarantees* that there is no split-brain scenario: a provably-correct consensus protocol like Paxos or Raft. You want writes to be committed everywhere or not at all.
Some databases like CockroachDB integrate this at a very low level; whether it is fast enough for Github's use case is another question.
Sometimes time is of the essence with a failover. If you do it correctly and fast you might see zero disruption and when it's the storage that is at issue then even delaying for a few seconds can cause you to lose the LUN or cause the application to go awry.
Maybe just a bit more intelligence in the failover, done well it could make more intelligent failover decisions with a human just needing to disable the system if there is known engineering works going on that might accidentally trigger it.
Were these 2 datacenters really 11k miles apart?
Latency usually covers round-trip time, and speed-of-light in fibre is only 2x10^8 m/s, so 60mS is more like 6000km of pure fibre separation. That is roughly the distance between US East & West coasts by fibre, but your point is still valid. It's too slow for synchronous replication, so you cannot guarantee that both sides are 100% in sync. Any failover solution has to take that into account.
For local HA, with redundant equipment, when a disk, switch or server goes down automatic is fine.
This strikes me as being a LAN issue though at one or the other site. A fault that caused connectivity to rather than on the WAN link to be lost. The reason I suggest this is that in default trim a 43 second outage is suspiciously close to the time traditional STP will take to kick in and rearrange the active links.
These days you would expect people to be using at least RSTP which is almost info infinitely faster to respond but only for some of the more common faults. In particular it doesn't detect the case where a link fails on one direction in a consistent manner, in that case you are still dependent upon the original protocol to sort it out with the delays that come with it.
One concept Microsoft (afaik) came up with is that of a RID master. It gives out blocks of numbers to other servers upon request. When the server passes the watermark it will preemptively request a new block. In the case of a loss of connectivity, it can still create new objects until the block is exhausted. I thought this could well be applied to database replication.
DO you then cluster the RID master over geographic locations to ensure redundancy? Do you then need a RID master master to oversee your RID cluster?
How will the ID blocks work when they are re-merged, the transactions will be all over the place? If you don't need the ID for anything useful outside of a unique index then you could just do a compound index with the server name or start your index block at different starting points that will never overlap.
In reality it isn't so much about inserting data once or reading data, it is about changing data or a set of transnational commands that needs to be done in a set sequence where some of that sequence may exist on one server and some on another and where the times could be ms out. Or where some data is changed that only exists on one system, or has only made it into one index.
"Yeah, with the right webscale blockchain system this could never have happened! :-)"
Technically you're correct - once the webscale blockchain consultants legged it with all the money, you'd be forced to use a single host and bits of string to connect everything together, completely eliminating the possibility of split brain.
The average business that is hell-bent on sticking data in the cloud needs to bear this in mind. You're buying into whole layers of sophistication that are totally unnecessary. One data repository and backups taken during pauses of data writing is all that many of us really want.
Sure, outages consequently mean inaccessibility to your data, but this is the price paid for being hell-bent on sticking data in the cloud. However, if you count up the risk points of failure of a cloud solution and compare that with the risk points of failure in an on-premises solution and draw your own conclusions.
I'll agree that github, et al are slightly different animals to say, accounts data, but anything basic that can be done in the cloud can be done on-prem. If bells-and-whistles are important then you're incrementing that risk points of failure counter.
The brief outage ... caused problems in the organisation's complex MySQL replication architecture
If you have to define a system as "Complex", you can guarantee that when it goes wrong, it will go wrong in a manner that will take a long time to clean up afterwards.
I know you're operating at scale, but Keep It Simple, Stupid. Simple is the only way you stand of keeping big things like this running.
Most of the apparent complexity you see in IT is fake, you can easily complicate trivial matters and make them look complex to the lazy eye, just ask a lawyer how it's done.
Complication is not the same as complexity, complexity, to be real, must be an essential attribute of the problem while complication results from either laziness, ignorance or ill intentions.
I'll explain that problem to you in two words.
You had two places that both thought they had the "definitive" copy of the database, but didn't, because they didn't have what the other side had, because both were pretending to be in charge and taking any orders that came to them and applying them, even if they could never tell the other side about those orders.
Note that this is perfectly possibly with ANY replication setup that works in a failover mode whereby one place - upon detection that it can't talk to the other place - becomes a full-service node. It starts taking orders from the waiters and giving them to their own chefs, without realising that other places are also taking orders and giving them to their chefs, and then you try to merge the kitchens back together and you just get chaos.
It's so prevalent that you can do it in Hyper-V failover replicas, DFS, MySQL or anything else that tries to "take over" when a site goes down without proper shared "journalling" of some kind, or a forcible master server handing off work.
If you chop your network in two, and expect both halves to get full service, you need a way to resolve split-brain afterwards. That can either be something like DFS or Offline Files does (hey, we have these conflicts, sorry, nothing we can do and you need to manually check what you wanted), or you have to literally put in intermediary services that can handle and clean up the situation.
The job is almost impossible automatically... someone commits something to site A... it times-out because of the fault but gets to site A storage. They retry, they get balanced over to site B, you now have an *almost* identical commit to site B, but they both differ. Or you have one developer commit his branch to site A, another to site B, they conflict and now you've messed both side's entire tree. Leave it for 40 minutes with a crowd of developers and before you know it you have entire trees with two completely conflicting trees that can't be merged because the patches change the same parts of the code and who do you reject now? Plus one of those developers is going to have to rebase their tree but may have done thousands of hours of work based on the deprecated tree and they won't be happy.
And I've tried to explain this to people too... yes, just slap in a failover / replica, magic happens and it all works when you join them back.
No. It doesn't. The only way to do that is to have a load-balanced queuing/transaction system whereby the underlying databases are separate, but there's only ever one "real" master, and that gets committed to by a single ordered list of processes that will always feed that data back in the same order to the same system. Literally, one side "takes orders" but does nothing with them. Until the join is fixed and then they hand them off to the shared kitchen. You don't lose any orders, but they don't get acted upon immediately (i.e. you accept the commit, but on the failed site, it's never reflected in the tree). Even there, you have problems (maybe the commit you accepted wouldn't be valid against what is NOW the master tree that's taken other commits in the meantime).
Such things - and their solutions - introduce all kinds of problems with your "distributed / fail-safe" setup.
And all because you didn't think it through and just assumed it would all carry on working perfectly like magic. If you have a blip, and you failover, the failover will work perfectly. But before you can ever resume service, you have work to do that if you haven't considered it in your design turns into a mess with hours of downtime and potentially accepted-but-then-disappearing commits.
Agreed, distributed databases is a hard problem. The fact that they used MySQL does not make it any bettter. The solution you mention requires a totally asynchronous client, which may not work for the database users. The other solution is to use Convergent Replicated Data Types, and yet another is to simply fail one side due to the lack of quorum.
"The fact that they used MySQL does not make it any bettter."
Indeed. "Doesn't scale well" is what springs to mind, from experience.
MySQL is very good at what it's designed for (read heavy operations) but once you get past a few tens of millions of entries there are better DBs that have smaller memory/CPU footprints, which don't have these kinds of problems with replication.
People become highly resistant to change and attempt to keep banging the square peg into the round hole long after it's apparent it doesn't fit, rather than learn how to drive another DB. PostgreSQL and friends may seem big and scary upfront, but it's actually a helluva lot easier to run than trying to tune a large-scale MySQL installation - instead of lots of knobs, PgSQL tends to just get on with it.
(Personal experience: Same hardware, 500million entry DB, write heavy - PgSQL ended up about 5 times faster than MySQL and using 20% of the resources. Replication was a doddle too, so migrated to it. Next manager along didn't understand any of it, so ordered it ripped it out in favour of MySQL and then spent a year trying to make it work reliably.)
"Next manager along didn't understand any of it, so ordered it ripped it out in favour of..."
The mark of a truly gifted Manager is that they can come into a new job and immediately decide to rip out systems that have been reliably ticking over for 10 years. All because they read a "Best Practices" article on a forum somewhere. That's why they get the big money.
Me, bitter? no...
failover works perfectly if you dont have automatic failover. Its the automatic bit that makes split brain possible. Our hyper-v site failover is manual. We issue a command to the cluster telling it which site is active. When the main site goes down we issue command to failover to backup site, when main site comes back up then we issue the return and the backup falls back and replicates back. If we lose comms between backup and main then it sucks to be at the backup site (as we dont issue the failover).
a solved problem in the dark days long before DevoPS.
I seem to recall one mostly forgotten set of vendors calling it something like "disaster tolerance"; other similar options may have existed elsewhere.
Then the commodity side of the industry decided it was too hard or too expensive or too unfashionable and started sprouting BS like "eventual consistency". I mean, EC probably works ok for presentation-layer stuff where there is no correct answer and the money comes in anyway, but for stuff where things are either right or wrong and there's real stuff at stake, it seems like it might be time to look again at stuff from a couple of decades ago, before the knowledge is lost forever.
I'd wager that Active Directory is one of the top multi-node database systems in use across the globe, and whilst I've seen issues in my two decades of experience it's been so incredibly rare compared to this type of SQL split brain I wouldn't count Microsoft out as being able to produce a DB which is top tier in terms of it's resilience during a disaster.
Why not just use distributed transactions and two phase commit? That's the way to guarantee consistency. Have they just prioritised write speed instead, and added a we'll synchronise later for consistency kludge on top?
The region specific nature of this fuckup will have impressed the new owners though. See, it happens to everyone!
Yes, VMS cluster, Solaris cluster, Veritas cluster, they all do that for local HA, and it works. Disk vendors like EMC and HDS can also do it for their distributed storage.
It gets much harder once you bring geographical distances into the picture. Fully synchronized updates so that both sites always have identical copies of data gets too slow once you go much past a few mS latency, so you have to live with the fact that each site has a different view.
There are solutions, usually requiring more than three sites with a mix of local/remote and sync/async replication. That gets expensive.
Oracle DataGuard (yeah, I know it's the evil empire, but the DB is still good) has a useful feature called "far sync" where the data exists only at the two main sites, but there are additional copies of the redo logs maintained synchronously at nearby sites. If a main site is lost, the survivor will automatically fetch & replay the logs to make sure it has caught up before becoming active. Still needs multiple sites, but the intermediate ones only need small systems & storage for logs, not the big servers & TB disks it would require to keep another copy of the whole DB.
Yes, VMS cluster, Solaris cluster, Veritas cluster, they all do that for local HA, and it works. Disk vendors like EMC and HDS can also do it for their distributed storage.
Don't forget Tandem NonStop systems, which distributed the whole kit and caboodle (CPUs, disks, disk controllers) over geographic distances as a networked set of up to 16 nodes, each containing between 2 and 16 CPUs with all hardware duplicated and hot-pluggable. The whole distributed system was fault tolerant by design with automatic fail-over for all components. Result: 99.99% up time.
... and that was early '80s technology, so there's really no excuse for current systems supporting worldwide 24x7 services to offer less guaranteed availability than was available in 1984.
"There's an effort underway to support “'N+1 redundancy at the facility level'"
1) If you're not running N+1, then you've not learned the first lesson of SRE.
2) For the repos (not all the cruft GitHub has added to them), unless you have a hash collision, resolving "some writes here, some writes there" is trivial by the design of git.
3) For the local cruft, adding shadow git repos to handle their changes should be at worst simple. Again trivial to recover from split brain.
4) For the non-local cruft (ie: webhooks), yes, there would be work to be done to deal with the fact that apparently there are people allowed near keyboards that think that IP is a reliable system for data transport.
What am I missing here?
Biting the hand that feeds IT © 1998–2019