Google is developing some sort of back-end technology that automatically - and nearly instantly - redistributes live compute loads when a data center is in danger of overheating. Or maybe this is just talk. Google prefers to at least maintain the illusion of data-center nirvana. During a panel discussion last week at Structure …
Rise of the machines!
Data centres that automatically transfer processes across the globe to avoid local 'failures' - didn't Skynet do that? The end is nigh! :)
is this a complex problem that would take a lot of time and money and effort to solve?
modify the new server install procedure to add an extra step of "enter BIOS Setup, configure overhead shutdown settings", then allow the automatic redundancies they have designed for server failures to handle the rest (perhaps with a modified BIOS that has a wake-on-cool setting as well?)
can i haz huge piles of cash now for being smart enough to solve this complex problem?
They want to shift workload away from the datacentre BEFORE it hangs, not after. This is supposed to be pre-emptive avoidance, not post-disaster clear-up.
Now, if you could get a reliable temperature monitor for a CPU linked to every box in the datacentre, link it up to the load balancing system, AND have MapReduce redirect tasks from an overheating (but not failed) datacentre to many others across the globe, and THEN when that failing centre is repaired / back to normal operating temperature redirect tasks back to the datacentre, in REAL TIME, *THEN* u can haz moneez.
Been there done that
Look, call me a dinosaur if you will, but the nicer VAXes with VMSclusters in multi-site setups have been doing this kind of multi-datacentre loadbalancing thing since the last ice age, for customers who wanted it. And for customers who want it today, they still can, so long as you're willing to buy Itanium and run VMS (which seems like a small price to pay for the functionality on offer).
I realise most Google architects and El Reg writers and readers weren't born back in that era, and that because VAXes predate the Interwebs they don't actually exist as far as Google architects and many others are concerned, but there's f*** all originality in what they're talking about.
That's exactly what I thought within 30 seconds of reading the artice, they already have a technology for dealing with outright server failure, why not have temp monitoring that gives a "virtual" fail and the existing redundancy picks up the slack.
The devil is probably in the detail however, in providing a graceful fail rather than an actual catastrophic hardware failure that user notice as a glitch in the matrix (eg. A search take 5ms longer than normal :-p)
In any case, it dosen't seem such an amazing revolutionary idea, but that's easy for me to say after the fact, and probably why I don't work for Google. :-)
Sorry, I have prior art on this here post-it note that I have post dated. I could post it to you but it'd get stuck on the inside of the envelope.
Or another way, (for windows, or possibly using SNMP traps) write a script that fires every 3 seconds, checks the CPU and mobo temperature then if it is above a certain number, use WMI or powershell or something to fail over the node.
The problem is to cool the server quickly enough if all the nodes are under strain. Not a bad idea though.
Because we'll always have Paris
Could just put a fan on the CPU? Better yet, a temperature controlled fan! Think I have a spare one that I can let them have for a nominal fee.
I know, I know, I'm going.
On a serious note, wouldn't you consider rotating the load in sync with the Earth, always keeping it on the night side? Possibly taking into account seasons if you were feeling flash. Assuming you had a truly global presence of course.
re: Been there done that
Of course data centre load balancing has been done before - they're not talking about that. They're talking about widening the conditions of load balancing. Previously load balancers tend to either be simple round robin balancers or act based on the load already being handled by each data centre.
The ones we're talking about here are able to take into account the temperature/power requirements and balance accordingly. It's just enhancing the load balancing algorithms.
Clock slowdown with temperature rise
A few years back, didn't we have CPUs with clocks that slow as temperature rises.
If used, these processors would process tasks more slowly, and the load balancing would direct jobs away from them.
You just need global load balancing, or so you? Perhaps the load balancing would add a distance cost/weighting and redirect to a neighbour.
With a restartable process like building an index to dynamic web-pages, the requirements are less strict. For applications/services storing user data, then that is a different set of requirements.
The big enabler for high flexibility is making the workloads very portable, that's accomplished by virtualisation (which these days is relatively easy).
That statement isn't meant to trivialise the scale of the supporting infrastructure as in order to do this properly I guess you need to have lots of very stanfardised, very responsive, very high capacity and very integrated technology, this is also quite obvious and although not easy, it's not 'magic'
Perhaps the magic ingredient is trending and predictive as oppose to reactive balancing of the workloads, this isn't my specialist area but isn't that the way the electricity companies work ?
Data centers 101
They could, you know, not over-subscribe their cooling units?
No need to thank me, my consultancy fee is in the post
Am I the only the one that smells bullshit ?