back to article Power meltdown 'fries' SourceForge, knocks site's servers titsup

A crippling data center power failure knackered SourceForge's equipment yesterday and earlier today, knocking the site offline. The code repository for free and open-source software projects crashed yesterday morning (around 0645 Pacific Time) after unspecified "issues" hit its hosting provider's power distribution unit, …

Anonymous Coward

SourceFource

This week's edition of "Things you're surprised to find out still exist!"

27
11
Silver badge
WTF?

Re: SourceFource

Do you live under a rock? A lot of people rely on Sourceforge, for good or bad. Portable Apps is but one example.

22
5
Silver badge

Re: SourceFource

I’ve heard of it obviously, just like I’ve head of Tripod, Geocities, MySpace, Friends Reunited and other relics of the early internet.

25
12

Re: SourceFource

Do you live under a rock? A lot of people rely on Sourceforge, for good or bad. Portable Apps is but one example.

There are a lot of people that used to rely on sourceforge, but have stopped using it as it has been banned with their current employment. I think the main reason a lot of people have stopped using it maybe because sourceforge at one time started purposefully including spyware and viruses with their executables in order to "make" extra money. Those of us that stopped using it are not aware if they changed their deceptive practice.

9
0

Re: SourceFource

"Those of us that stopped using it are not aware if they changed their deceptive practice."

They've changed ownership since then, and the new ownership said none of that would be going on anymore. So far, that seems to be true. I no longer have to avoid Filezilla, for example.

6
0

It's worth noting that ./ (slashdot) is down as well.

11
0
Headmaster

Pedant's corner

I think you'll find that's dotslash not /.

:/

30
0
Anonymous Coward

It's worth noting that ./ (slashdot) is down as well.

Unfortunately, it seems to have recovered

14
1

Re /. seems to have recovered

Hmmm - not for me. Got kicked out in the original outage. Now seeing a "404 File Not Found" on my browser tab and a "503 - Service Offline" message on the page when I try to log on again. Message follows:

"Slashdot is presently in offline mode. Only the front page and story pages linked from the front page are available in this mode. Please try again later."

0
0

Re: Pedant's corner

Maybe he's from an Arabian country

2
0

Impacted projects

I was actually feeling this yesterday as I was trying to download GParted Live and had to find another source.

Anyone else run into the downtime?

5
0

Re: Impacted projects

I haven't visited SF in ages, but yesterday needed additional textures for SweetHome3D and ran into the 404 page. It worked fine some minutes later, though.

3
0

Re: Impacted projects

I wanted to re-install PyScripter (or at least access the forum). I just updated my Python install and now PyScripter can't find it anymore.

0
0
Silver badge

Here's an idea!

> "We recognize there have always been issues with SourceForge and Slashdot, both with our current provider and within the infrastructure,"

If only someone would start a project to let websites make some sort of copies of themselves. ...

19
1
Anonymous Coward

Re: Here's an idea!

> "we had already decided to fund a complete rebuild of hardware and infrastructure with a new provider"

They've completely missed the point.

Right answer: "we have decided to rebuild with TWO new providers, so if one data centre goes down, we just switch over services to the other one"

4
0
tfb
Bronze badge

Re: Here's an idea!

"We would like to use multiple providers, but there is no money."

0
0
Anonymous Coward

Re: Here's an idea!

And you'd fund that effort how, pray tell?

It's all well and good to come up with the best technical architecture, but in the real world bills have to be paid. SourceForge is long off the sweet teat of VC funds.

0
0
Silver badge

Just my luck

Yesterday was the first time in forever I wanted to download something from them, and every time I've tried (including just now again) it is still offline!

4
0
Anonymous Coward

Re: Just my luck

Maybe it was your attempt at downloading that causes all of this. I'd start watching for mysterious people in dark suits if I were you.

:)

12
0
Silver badge

Re: Just my luck

Ditto. Something I wanted to download, was on SourceForge, okay... wait... what?

"We're sorry -- the Sourceforge site is currently in Disaster Recovery mode".

Oops.

2
0
Anonymous Coward

A lot of cool kids have moved on to GitHub

Sourceforge has deteriorated over the years.

Scummy management always ruins a good thing.

https://notepad-plus-plus.org/news/notepad-plus-plus-leaves-sf.html

19
1
Bronze badge

Re: A lot of cool kids have moved on to GitHub

Sourceforge felt dirty when I had to use it recently. Had to upload what I downloaded to virustotal just to feel safe.

Not the sign of a good site.

5
1

Mirrors!

Luckily Sourceforge downloads were always mirrored. Still are I believe.

But last time I looked I had to fight through a barrage of JS/Ad farm mirror redirector pages, all refusing to give me a direct link.

Having a mirror for people to access critical projects = good.

Having ad bloat, tracking, JS, redirecting nonsense in front of your mirrors, that goes down when your site is down = poor. Really really poor.

14
0
Silver badge

Same everywhere

Anyone shopping for DC space should ask the proprietor when they last randomly flipped the master breakers with no advance notice[1] to test the auxiliary systems. This isn't because you expect an answer, it's for the amusement value of watching the Facilities guys turn grey in ten seconds.

Untested business continuity procedures are obviously likely to be worthless, but in fairness to the guys on the ground, actually running a test is likely to end your career. Identifying a critical weakness in the DR plan will not protect you from PHBs whose bonuses are linked to uptime metrics. This is why you hear of generators with no fuel, auxiliary power units that fail in seconds because the fuses have evaporated and 3 phase switch overs so wildly unbalanced that the upstream systems shut down.

[1] "Scheduling" tests never works because Operations will subvert you by shifting the workload elsewhere. That causes the servers, fans and CRAC units to idle which means the power load you're switching won't be representative of a real failure condition.

22
1
pdh

Re: Same everywhere

Re: doing an unscheduled test with no advance notice: eventually you *will* face this sort of test, whether it's your hand on the breaker, or the hand of Chaos. Given that this is the second time that SF has failed the test, it's hard to blame anyone other than SF themselves.

9
0
Silver badge

Re: Same everywhere

As you say. you don't do unscheduled random failure tests unless you want to be fired. What you do do is a hell of a lot of prep work to work out what should happen in a certain scenario and then plan test of that at the least disruptive possible time for the business with all hands on deck to clear up the mess when it goes to shit (it will - which is the whole point - so you can fix your plan).

In an ideal world (with bottomless pockets) you do this in a test environment that replicates Production, but few of us have that much money.

==

What surprises me in these news stories is the number of services for IT professionals that don't have a DR. The story is about getting their primary infrastructure back up, rather than their fail-over to secondary. But then you get what you pay for I guess.

12
0
Alert

Re: Same everywhere

Re: Lysenko

Something in your text reminded me of reading

https://en.wikipedia.org/wiki/Chernobyl_disaster

Skip down to section "Conditions before the accident".

4
0
Silver badge

Re: Same everywhere

@Mark 110

I agree. The only place I've seen do proper tests on a regular basis was military.

What you describe is the best that can be achieved in the commercial world, with the caveat that scheduling things at the least disruptive time for the business will often tend to invalidate tests because the least disruptive time is usually the same as minimal loading. The fact you can switch the Amazon purchasing DC to "B" feed at 2 am on a random Tuesday does not mean you can do the same in the middle of "Black Friday" and it is under maximal load that failure should be anticipated because that's when everything is as hot as it's going to get and your mechanical components (e.g. CRAC units) are most likely to lock up and start a failure cascade. Faking full load with dummy processes (assuming Ops even have the capability) is only a partial solution because of thermal inertia.

As for DR sites, I think the main reason they are avoided is that even if Facilities hands over correctly, Ops won't. The network probably won't re-route properly, and even it it does you end up with dangling partial transactions in the storage and database systems, a nightmare job reintegrating the datasets afterwards and inevitable data loss because there is so much lazy writing, RAM buffering and non-ACID data (I'm looking at you, Riak) floating about in modern systems.

8
0
Orv
Silver badge

Re: Same everywhere

It's often the things you don't expect. I was in a data center that ran fine for three years, then one day we suddenly lost one of our two power feeds in our rack. (Naturally it was Christmas week and everyone was on vacation.) Turned out someone had forgotten to tighten a nut on a connection in the breaker box, way back when the center was built. Things were fine as long as the row of racks it fed was mostly empty...but when they finally got around to filling it, the extra current caused the high-resistance connection to melt down. Unfortunately this was the part of the power distribution system between the UPS and the servers, so the UPS's didn't help. I believe they started doing regular IR scans of the breaker boxes, after that.

I'm not sure anything will ever top the story I heard about a data center that did regular generator tests, always successful, but the generator failed after a few minutes when there was an actual power outage. Turns out no one had ever noticed that the fuel transfer pump was only wired to utility power...

9
0
Silver badge

Re: Same everywhere

"Scheduling" tests never works because Operations will subvert you by shifting the workload elsewhere. That causes the servers, fans and CRAC units to idle which means the power load you're switching won't be representative of a real failure condition.

I can well believe that.

Related, and seen somewhere on Youtube recently:

"Staff will treat Penetration Testers the same way as they do auditors. The natural inclination is to hide anything embarrassing; they won't tell them everything."

5
0
Silver badge

Re: Same everywhere

Facilities tests are almost always run by Facilities people who have a vested interest (and therefore a cognitive bias) in successful results. The Military case I mentioned before was more like a penetration test. The resiliency team *delighted* in failure - they weren't trying to prove the systems worked, they were trying to break them. That shift in perspective can dramatically change the results.

6
0

Re: Same everywhere

I remember on of our staff doing a pull the plug test they were specifically told not to do and doing 10 grands worth of damage to test systems. Test had been done before.

Trouble with disaster recovery tests is you don't want to test what would happen to your datacenter if someone took a baseball bat to the cage on the left by actually doing it.

7
0
Anonymous Coward

Re: Same everywhere

Every organisation should have a dedicated DR person who has the right to p*** anyone off. If they are smart they only p*** off the lazy ones that deserve it.

1
0
Silver badge

Re: Same everywhere

[1] "Scheduling" tests never works because Operations will subvert you by shifting the workload elsewhere. That causes the servers, fans and CRAC units to idle which means the power load you're switching won't be representative of a real failure condition.

Very true, though scheduled tests can be useful for doing things like running the fuel store down on a regular basis so that when you pull the breakers on an unscheduled test (or an actual failure) your tanks aren't just full of sludge where the diesel used to be.

3
0
tfb
Bronze badge

Re: Same everywhere

It can be worse than ending your career. if you're a systemically-important financial institution then a failed DR test can quite plausibly crash the economy. So, not surprisingly, they never get done: they do DR tests but they are very carefully rehearsed events, usually of a tiny number of services, which don't represent reality at all.

The end result of all this is kind of terrifying: in due course some such institution *is* going to lose a whole DC, and will this be forced to do an entirely unrehearsed DR of a very large number of services. That DR will almost certainly fail, and the zombie apocalypse follows.

2
0
Bronze badge

Re: Same everywhere

And for years, apparently. I recall reading a similar "fuel pump not on the right side of UPS" story in Lessons Learned (or not) from the (1965) Great Northeast (US) Blackout. Also similarly about a sump-pump in one hospital being considered "not critical", at least until seepage from the nearby river rose to the level needed to short out the generator in the basement.

OTOH, there were rumors of a surge of births nine months later, although it's hard to imagine losing access to SourceForge and /. would have that effect.

2
0
Orv
Silver badge

Re: Same everywhere

Clearly whoever made the decision at the hospital never lived in a house with a full basement, in an area with a high water table. The sump pump was one of the first things we worried about when power failed. Some of our neighbors had backup battery-powered ones.

0
0
Orv
Silver badge

Re: Same everywhere

...scheduled tests can be useful for doing things like running the fuel store down on a regular basis so that when you pull the breakers on an unscheduled test (or an actual failure) your tanks aren't just full of sludge where the diesel used to be.

That actually happened to a hospital in the rural Michigan town I used to live in. It was later determined that the maintenance staff had been pencil-whipping the generator tests for years.

0
0
Silver badge

"their redundancy failed us..."

Er... no... your COMPLETE LACK OF REDUNDANCY failed you.

9
0
Anonymous Coward

Er... no... your COMPLETE LACK OF REDUNDANCY failed you.

I think you meant: "Er... no... YOUR complete lack of redundancy failed you."

Pleasure to be of service!

6
0
Anonymous Coward

The Cloud...

Other peoples computers you have no control over.

7
3
Bronze badge

Hmm, SauceForge and Sloshdat are old favourites of mine, but they have been long suffering from atrocious management. Now it seems that the last person that had half a clue of how to run a web server has left.

5
1

slashdot down, not missing much

slashdot used to rip all their articles off elreg anyways, if they don't come back, i guess its no huge loss.

4
0
Silver badge

Ah the classic

" We have the hardware on hand and are at the final stages of negotiations with the new provider." "

What do you mean it costs double to have a fully redundant site? It's not like it's ever going to be used.

2
0

Data Center?

Does it still make economic sense to buy space in a data center for a workload like sourceforge or slashdot? I would have expected a cloud service like AWS and Azure to be a more (logical|scalable|inexpensive) choice.

(Full disclosure, I work for one of those cloud providers.)

1
0
Silver badge

Re: Data Center?

I don't know, they may prefer uptime (even with this outage) over price.

0
0
Thumb Up

Re: Data Center?

Exactly.

0
0

Re: Data Center?

owning your own stack and staff is still likely cheaper to a degree than cloud . The storage portion is the most expensive part of AWS .

0
0
Mushroom

just remember

being in the cloud simply means someone else's computer ....

0
0

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Forums

Biting the hand that feeds IT © 1998–2017