Reply to post:

GitHub lost a network link for 43 seconds, went TITSUP for a day

Lee D Silver badge

I'll explain that problem to you in two words.


You had two places that both thought they had the "definitive" copy of the database, but didn't, because they didn't have what the other side had, because both were pretending to be in charge and taking any orders that came to them and applying them, even if they could never tell the other side about those orders.

Note that this is perfectly possibly with ANY replication setup that works in a failover mode whereby one place - upon detection that it can't talk to the other place - becomes a full-service node. It starts taking orders from the waiters and giving them to their own chefs, without realising that other places are also taking orders and giving them to their chefs, and then you try to merge the kitchens back together and you just get chaos.

It's so prevalent that you can do it in Hyper-V failover replicas, DFS, MySQL or anything else that tries to "take over" when a site goes down without proper shared "journalling" of some kind, or a forcible master server handing off work.

If you chop your network in two, and expect both halves to get full service, you need a way to resolve split-brain afterwards. That can either be something like DFS or Offline Files does (hey, we have these conflicts, sorry, nothing we can do and you need to manually check what you wanted), or you have to literally put in intermediary services that can handle and clean up the situation.

The job is almost impossible automatically... someone commits something to site A... it times-out because of the fault but gets to site A storage. They retry, they get balanced over to site B, you now have an *almost* identical commit to site B, but they both differ. Or you have one developer commit his branch to site A, another to site B, they conflict and now you've messed both side's entire tree. Leave it for 40 minutes with a crowd of developers and before you know it you have entire trees with two completely conflicting trees that can't be merged because the patches change the same parts of the code and who do you reject now? Plus one of those developers is going to have to rebase their tree but may have done thousands of hours of work based on the deprecated tree and they won't be happy.

And I've tried to explain this to people too... yes, just slap in a failover / replica, magic happens and it all works when you join them back.

No. It doesn't. The only way to do that is to have a load-balanced queuing/transaction system whereby the underlying databases are separate, but there's only ever one "real" master, and that gets committed to by a single ordered list of processes that will always feed that data back in the same order to the same system. Literally, one side "takes orders" but does nothing with them. Until the join is fixed and then they hand them off to the shared kitchen. You don't lose any orders, but they don't get acted upon immediately (i.e. you accept the commit, but on the failed site, it's never reflected in the tree). Even there, you have problems (maybe the commit you accepted wouldn't be valid against what is NOW the master tree that's taken other commits in the meantime).

Such things - and their solutions - introduce all kinds of problems with your "distributed / fail-safe" setup.

And all because you didn't think it through and just assumed it would all carry on working perfectly like magic. If you have a blip, and you failover, the failover will work perfectly. But before you can ever resume service, you have work to do that if you haven't considered it in your design turns into a mess with hours of downtime and potentially accepted-but-then-disappearing commits.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2019