back to article GitHub.com freezes up as techies race to fix dead data storage gear

GitHub's website remains broken after a data storage system failed hours ago. Depending on where you are, you may have been working on some Sunday evening programming, or getting up to speed with work on a Monday morning, using resources on GitHub.com – and possibly failing miserably as a result of the outage. From about 4pm …

asdfasdfasdf2015

storage

should have used zfs

Anonymous Coward
Anonymous Coward

Re: storage

From https://blog.github.com/2018-10-21-october21-incident-report/

"At 10:52 pm Sunday UTC, multiple services on GitHub.com were affected by a network partition and subsequent database failure resulting in inconsistent information being presented on our website."

Anyone care to talk us through how zfs addresses this?

Anonymous Coward
Anonymous Coward

Re: storage

Network partition followed by database "failure" suggests some sort of clustered system with no split-brain protection. Two halves of one cluster each thinking they are the "one true cluster" and writing data to shared or replicated storage with no protection. It's a typical mistake made by home-grown cluster software that doesn't fail safe.

Plenty of ways to avoid that with decent clustering software, with or without ZFS.

Nate Amsden
Silver badge

Re: storage

I remember nexenta's zfs solution to that problem was to corrupt the zfs file system and then get into a kernel panic reboot loop until the offending block device(and filesystem) was removed. Support's suggestion was "restore from backup". I tried using zfs debug tools (2012 time frame) to recover the fs. Was shocked how immature they were. Wanted a simple "zero out the bad shit and continue", didn't exist at the time. Disabled nexenta HA after that. Left nexenta behind a while later.

Anonymous Coward
Anonymous Coward

Re: storage

I tried using zfs debug tools (2012 time frame) to recover the fs. Was shocked how immature they were

One of the irritating things about ZFS design, even today, is the insistence that the combination of checksumming & copy-on-write mean that it "can't" get into a state of internal corruption, so fsck-like tools aren't necessary.

IMHO, it has never sufficiently taken into account the way that a FS can be externally corrupted by storage problems or administrative misconfiguration.

macjules
Silver badge
Happy

Re: storage

Never mind. Users can always switch to GitLab.

Oh, wait ..

tfb
Silver badge

Re: storage

The ZFS approach to checking if a filesystem (pool, really) is OK used to be (and may still be) to attempt to import that pool. In other words all the consistency checks run in the kernel where any kind of error in the checks (I mean a mistake in the code which checks for on-disk errors) is probably going to cause the machine to fall over in some horrible way. That kind of made it obvious to me that the ZFS people had never been very near a production environment and certainly never should be allowed near one.

FuzzyWuzzys
Silver badge
Alert

Cloud based services

They're great and I genuinely think they're the way forward but they're just not 100% there yet. I think that huge strides are being made and these are just teething problems as we all adjust to the new world, however we as customers need to use these services in tandem with our own on-prem, securing our own data and making sure we're fully ready before we make the complete move.

Ken Moorhouse
Silver badge

Re: Cloud based services

The analogy I have for most of us is that Cloud is akin to asking NASA to organise your weekly shop.

The complexity of what goes on in the cloud is not relevant to the average I.T. user, and can still result in data loss/data corruption.

Gene Cash
Silver badge
Facepalm

Re: Cloud based services

This isn't a "cloud" service... it's a "share your code" service.

Archtech
Silver badge

Re: Cloud based services

Yes, it's a "share your code" service - in the cloud.

bombastic bob
Silver badge
Meh

Re: Cloud based services

'The cloud' is once again overrated.

All eggs, in one basket [even a distributed basket] is not necessarily a good idea.

I like github but I don't stake my business on it always being there. A lot can go wrong between my computer and their servers. A lot.

TonyJ
Silver badge

Re: Cloud based services

"...'The cloud' is once again overrated.

All eggs, in one basket [even a distributed basket] is not necessarily a good idea.

I like github but I don't stake my business on it always being there. A lot can go wrong between my computer and their servers. A lot..."

Worryingly, that's twice in a single month you've to only made sense, but I find myself generally agreeing with you, bob

However, like I've said before on here, just because something is in "the cloud" doesn't and shouldn't absolve the owners of the data/service of their responsibilities. These are usually the same people who wouldn't bat an eyelid if told - correctly - that you wouldn't trust the data/service to a single point of failure they own themselves.

And yet we still see this "throw it over the fence and it's someone else's issue" mentality time and time again.

"Cloud services" can work well. But they are not a panacea and they still require some levels of simple management and accountability.

Lee D
Silver badge

Re: Cloud based services

It isn't really cloud, though, is it?

Not if one data storage thing going offline causes the whole thing to fall over. It's more like a Drip. Maybe a Puddle.

Whether or not it's "cloud"... where's the failover? And I mean failover, not just "oh, have some stale data and we may be able to restore a backup"... but live storage somewhere else ready to take over. You'd think $7bn might be able to buy something like that, no?

It doesn't matter whether it's cloud or not - it's SHODDY. Storage failures should never get to the point where they affect users, because you should have enough redundant storage mirrored up to date, and via a versioned filesystem so even a "delete all" command can be undone, for it not to matter.

If you're basing your business on their services, immediately review that decision. From the looks of it, they are just running off stale caches at the moment. That might mean they have no data actually up at all.

tfb
Silver badge

Re: Cloud based services

'Live storage somewhere else ready to take over' is why banking IT is expensive. My guess what most of the cloud people do is, at best, 'storage somewhere else ready to take over which is in a consistent state and no more than a few transactions (of whatever nature: git commits here) behind the current live storage. May be that's enough.

ThomH
Silver badge

Re: Cloud based services

For GitHub consumers this is one of the lesser cloud deployments since cloning a Git repository by default involves making a full local copy, and all operations are performed locally and then merely synced to remote.

Git doesn't even enforce any sort of topology — e.g. an international company that used GitHub could have local copies of all repositories that act as remote for all local developers and which sync up to GitHub from that single point; GitHub would then be the thing that permits cross-site work, and the authoritative copy.

What you lose is GitHub's additions to Git: the pull requests, the issue tracking, etc. Or, in this case, I guess you can still see slightly historic versions of those things effectively in read-only mode.

So I don't think I'm ready to jump on the cloud-is-a-bad-thing bandwagon in this particular use case. It's slightly more of an adjunct rather than a full solution, but the downage needn't be an absolute stop to work like it would be if, say, you were in the business of modifying and reviewing legal documents, and were just keeping them all on One Drive/Google Drive/DropBox/whatever, which vanished from sight.

So, ummm, just think about what you're paying for and be sensible?

tfb
Silver badge

Re: Cloud based services

The problem is that although git can do all that -- you can ship updates by email I'm pretty sure (and not just the git format-patch thing but commits), so the connectivity requirements are tiny in theory -- people (a) really, really want the issue-tracking stuff (b) in practice treat git just the same way they treated subversion and CVS, with a central system which runs everything, and (c) want it to be free. And that central system, for many people, is GitHub, so when it goes away the same doom befalls them that befell them when google code went away and when sourceforge went away before that (I know it, sort of, came back). And there's almost no collective memory -- anything that happened more than a year or so ago is forgotten -- and so the wheel of reinvention turns forever.

ibmalone
Silver badge

Re: Cloud based services

The problem is that although git can do all that -- you can ship updates by email I'm pretty sure (and not just the git format-patch thing but commits), so the connectivity requirements are tiny in theory -- people (a) really, really want the issue-tracking stuff (b) in practice treat git just the same way they treated subversion and CVS, with a central system which runs everything, and (c) want it to be free. And that central system, for many people, is GitHub, so when it goes away the same doom befalls them that befell them when google code went away and when sourceforge went away before that (I know it, sort of, came back). And there's almost no collective memory -- anything that happened more than a year or so ago is forgotten -- and so the wheel of reinvention turns forever.

Don't entirely agree here. Even with a master repo (which you may put on github or not, as you wish), git still behaves fundamentally differently to both CVS and Subversion in a number of ways.

CVS is the most obvious, there is little concept of repository-level state unless you tag or branch. So a changeset will involve different revision changes for different files. History is completely in your repository, so lose it and all you have is a copy of the code, lose connection to it and you can't commit or checkout different versions.

SVN is a bit better, with the concept of a revision for the state of the code. However branching is mixed up with the directory structure of your repository. I haven't used SVN for ages, but the merge tracking feature wasn't introduced till 2008, I guess inspired by git. Again, history on repo only.

Git, locally stored history, even if you don't use it. Branching and merging on a graph based model, tools that let you diff freely across branches, commits and history. Using it with a central or master repository model doesn't actually detract from this. You can still commit, branch, checkout with no connection. Lose the remote and you still have all history and can make a new master. A central repo becomes a useful point of coordination, but it's no longer an Achille's heel. Lose it and you lose the extra nice stuff that's tied to the web-service, but not the actual commit information you would for subversion or CVS.

tfb
Silver badge

Re: Cloud based services

It's not about what git can do, it's about how people use it, and particularly that they expect there to be a big central system and are lost without it.

I'm also pretty sure that unless you do work to avoid the problem you're also in trouble with git if the central system you rely on goes away because you generally won't have all its commits. The documentation for 'git fetch' says, in part,

Fetch branches and/or tags (collectively, "refs") from one or more other repositories, along with the objects necessary to complete their histories.

which, I think, means it only fetches the commits it needs, and not commits associated with refs you're not fetching. So I think that means that pulls generally don't pull branches &c which you aren't tracking. In a busy repo that could be a lot.

I might be wrong about that but it would be easy to check I think. I don't know because I'd never use GHitHub as my big central repo but have origins which sit on storage I control and I'm generally very careful about making sure I have complete clones when I need them.

Mark 85
Silver badge

The Microsoft Curse?

Seems everything MS touches or wants to touch goes to shit. Could just be an "oops" or hardware failure. As for conspiracy theories.. MS owns an insider and this will drop the sale price.

Phil Kingston
Silver badge

Re: The Microsoft Curse?

YAY! I win the office bet on how soon the first Microsoft bash would be. I said top 3.

Anonymous Coward
Anonymous Coward

Re: The Microsoft Curse?

Poor Microsoft, you have to feel sorry for them. Maybe we shouldn't criticise companies that constantly make mistakes or that don't act in the best interest of their users.

stephanh
Silver badge

Re: The Microsoft Curse?

You should move to gitlab. They never screw up.

(At least it's git, you can still branch and merge locally, right? And pull & push from colleagues.)

DavCrav
Silver badge

Re: The Microsoft Curse?

"Poor Microsoft, you have to feel sorry for them. Maybe we shouldn't criticise companies that constantly make mistakes or that don't act in the best interest of their users."

Microsoft hasn't bought it yet.

Archtech
Silver badge

Re: The Microsoft Curse?

"Seems everything MS touches or wants to touch goes to shit".

That is axiomatic. But wise people have been debating for decades whether it's intentional or the results of incompetence.

Unless someone intentionally puts someone incompetent in charge...

Dan 55
Silver badge
Trollface

Re: The Microsoft Curse?

Microsoft hasn't bought it yet.

Even the mere threat of an MS buyout is enough to trigger Azure levels of reliability.

Lost In Clouds of Data
WTF?

Re: The Microsoft Curse?

So if the 2018 outage was a Microsoft curse, what caused the 2017 ones?

https://m.slashdot.org/story/329399

I'm no MS lover, but for Pete's sake this consistent MS bashing is just sad.

By all means slam then for the utter debacle that is Skype/...for Business/Lync/Teams or for the scrawling mess that is Azure etc. But Cloud Services go down all the bloody time.

Even GitLab - https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/

Have some perspective people. It's a Cloud Service - and that means irrespective to the owners or prospective owners, dumb shit that affects a lot of people will happen.

Nick Stallman

Re: The Microsoft Curse?

Nah its probably not Microsoft fault.

It was an issue with their mysql cluster. So I say it was Oracle putting a knife in Microsoft back when they weren't expecting it. :p

Tom 38
Silver badge
Joke

Re: The Microsoft Curse?

You should move to gitlab. They never screw up.

Not more than once an hour..

ibmalone
Silver badge

Re: The Microsoft Curse?

You should move to gitlab. They never screw up.

One nice thing about gitlab is that, if you don't trust all this cloud malarky, you can host your own instance.

Which you are then free to lose in the datacentre meltdown of your choice, but at least it will be your datacentre meltdown.

(At least it's git, you can still branch and merge locally, right? And pull & push from colleagues.)

That's the theory, but without a master repo things can get a bit hard to manage if you are pushing and pulling between multiple clones. I think this is a big reason for the success of GitHub and GitLab: the workflows you can build around having a master repository to manage the branches and merging. However, yes, just keep making commits and push once things are working again. If the master repo is utterly lost then spin up a new one and push one of your local clones to it (however you'll have lost all issue and merge request history, and will need to set up any c.i. again).

JohnFen
Silver badge

Re: The Microsoft Curse?

As much as I'd love to use this as an opportunity to bash the Microsoft purchase of GitHub, I don't think there's any connection between that and this. This isn't Microsoft's fault.

JohnFen
Silver badge

Re: The Microsoft Curse?

"Which you are then free to lose in the datacentre meltdown of your choice, but at least it will be your datacentre meltdown."

You say that as a joke, but I really believe that's a huge advantage to on-prem hosting.

JLV
Silver badge

Re: The Microsoft Curse?

>MS bashing is just sad.

Well, technically speaking you are right since purchase is still ongoing. But, looking at the headline I just knew MS was going to get bashedpowershelled and was looking forward to it. And also to the logic contortions that the haters were going to be use to somehow blame them.

What, you’re always reasonable, fair and mature?

Joking aside, github effups are very high profile, MS best take that into account once they do run it.

ibmalone
Silver badge

Re: The Microsoft Curse?

"Which you are then free to lose in the datacentre meltdown of your choice, but at least it will be your datacentre meltdown."

You say that as a joke, but I really believe that's a huge advantage to on-prem hosting.

Yes, only half a joke. If you can afford to do it right (including paying for the knowledge as well as equipment), then you know where your data is (hopefully offsite too...), what arrangements you have for backup and recovery, and what events and failures you're ready for. Cloud can be easier, and maybe give better availability than you can achieve for a sane level of cost+complexity, but aside from a number in a contract (optimistically) it's hard to be certain that Google or Amazon have more than one copy of your stuff, or what might bring it down.

Secta_Protecta

If Only....

Somebody would invent a HA storage system... ;)

Kabukiwookie
Bronze badge

Re: If Only....

The issue is probably human error. Probably a misconfiguration or alerts that were ignored or not received at all, because infrastructure is easy and any developer who can bang two lines of Java together automatically knows everything there is to know about infrastructure as well.

That's the Devops way.

JRW

Re: If Only....

Some people look at HA as being protecting against single points of failure and stop there. You don't hear about all the times 1 thing happened and some point later normal protected running was re-established. You have to plan for more than 1 event. 3 tips to start with.

1- Don't turn check sums off. If a supplier suggests you turn check sums off on a production system work out how quickly you can stop using that supplier. Check if any benchmarks or certifications cited involved turning checksums off - and if they did demand the ones with checksums enabled.

2 - A snapshot is not a backup.

3 - A replicated snapshot is not a backup.

SonOfDilbert
Pint

Re: If Only....

> 2 - A snapshot is not a backup.

> 3 - A replicated snapshot is not a backup.

Thank you! I've been trying to tell people this for years! Do they listen? Do they buggery.

Nate Amsden
Silver badge

Re: If Only....

What are you protecting against? Snapshots certainly are backups. As is RAID. It doesn't protect against everything certainly.

Just today on an isilon cluster i was deleting a bunch of shit and i wasn't paying close attention. Realized i deleted some incorrect things(data was static for months). i just restored the data from a snapshot in a few minutes.

I've been through 3 largish scale storage outages(12+hrs of downtime) in the past 15 years. It all depends on what your trying to protect and understanding that.

In my ideal world I'd have offline tape(LTFS+NFS) backups of everything critical stored off site(offline being key word not online where someone with a compromised access can wipe out your backups etc). This is in addition to any offsite online backups. Something I've been requesting for years. Managers didn't understand but did once I explained it. It's certainly an edge case for data security but something I'd like to see done. Maybe next year..

understand what your protecting against and set your backups accordingly.

Phil O'Sophical
Silver badge

Re: If Only....

Snapshots certainly are backups

Not really. They are a frozen point-in-time image, but they work by storing original copies of blocks that get changed after the snapshot time. For any unchanged block in the snapshot, the original filesystem is still the underlying source of the data. Take a snapshot of a filesystem, then remove or corrupt the original filesystem, and your snapshot is worthless.

Anonymous Coward
Anonymous Coward

Re: If Only....

For any unchanged block in the snapshot, the original filesystem is still the underlying source of the data

I guess it depends on how you understand the terminology. To me what you're describing is an incremental backup.

A snapshot to me is a backup of a moment in time but that backup contains everything needed to rebuild the system. E.g. dd if=/dev/sda of=/dev/sdc is a snapshot/backup, call it what you will, of sda that contains everything you need to rebuild that disc.

Phil O'Sophical
Silver badge

Re: If Only....

I guess it depends on how you understand the terminology. To me what you're describing is an incremental backup.

In a way it's the opposite. An incremental backup is a set of all the data which has changed since a particular moment, which can be added to a full backup to get the latest state. A snapshot is the opposite, it's a list of the data which has changed, but it contains the unchanged data, not the changes. The idea with the snapshot is that you can go back to that point in time, even if things have changed in the meantime. In both cases, though, you need the full dataset as well, neither snapshot or incremental backup are of any use without a full copy of the data, since they only reflect changes..

A snapshot to me is a backup of a moment in time but that backup contains everything needed to rebuild the system. E.g. dd if=/dev/sda of=/dev/sdc is a snapshot/backup, call it what you will, of sda that contains everything you need to rebuild that disc.

The thing about a snapshot is that it is instantaneous. A dd of a disk will take a finite and possibly long time to complete, during which you need to block all activity to keep it self-consistent. A snapshot freezes an instant of a running system without any visible impact. You can then, of course, make an offline copy of that snapshot, with dd or anything else and you'll get a complete and consistent copy of the filesystem at that frozen moment. It won't matter if the filesystem is changing while you're doing that, the snapshot will protect you from the changes. That offline copy is certainly an independent backup, but the snapshot alone isn't.

tfb
Silver badge

Re: If Only....

I think a lot of people would define a backup as 'a (possibly partial in the case of an incremental) copy of something on physically- and logically-independent storage' In that sense a neither a snapshot nor a RAID system is a backup (a detached mirror might be in the right circumstances).

JoshOvki

Oh go do one Saishav!

"Can you restore your service already?" software dev Saishav Agarwal

FFS does this guy think the GitHub people are just sat there thinking "oh well, we could get it working by pressing Enter... But hey let's wait a while until someone asks. We get it all the time when something is down "Can't you just get it back up?!"... Oh that is what we are supposed to be doing. fsck off

Joseph Haig

Re: Oh go do one Saishav!

If only there were a way to work on a local copy of your repositories when the servers are not available.

Oh, wait ...

Korev
Silver badge
Joke

Re: Oh go do one Saishav!

Don't be a git...

stephanh
Silver badge

Re: Oh go do one Saishav!

That would require reading the git manpage. Better just whine on Twitter and generally have a day off. Nobody likes a showoff.

el_oscuro
Coat

Re: Oh go do one Saishav!

This man page?

https://xkcd.com/1597/

Hollerithevo
Silver badge

Re: Oh go do one Saishav!

He's the guy who honks his honk in a traffic jam. All the rest of the drivers: oh, if only we'd thought to do that, we'd all be on the move!

WallMeerkat
Bronze badge

Re: Oh go do one Saishav!

For all it's on films and TV, I've never encountered people beeping in traffic jams*. Maybe it's an American or Continental Europe thing, whereas on these isles we tend to resign ourselves to being stuck in traffic.

*(Unless the cause is that car in front that takes forever to move off on green, or someone deliberately blocking lanes/yellow box juntions)

Page:

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2018