back to article Hyperscale data centres win between their ears, not on the racks

Organisations that hope to improve their own data centre operations by adopting the techniques used by hyperscale operators like Google or Facebook need to consider the stuff between their ears, not just the stuff on their racks, because changing data centre culture is more powerful than changing equipment. That was the gist …

  1. Anonymous Coward
    Anonymous Coward

    new metrics by which to measure on-premises data centre teams,

    Does "cost of failure, including some external costs" figure anywhere in calculations and decision making? I didn't see it, maybe I missed it.

    And how about who picks up that cost? The poor suckers paying for the undelivered service, or the IT Director's bonuses? Seems like an odd week to forget about those...

    And finally: why would on-prem and cloud providers not have the same fundamental metrics, based on service availability and impact (cost) of non-availability? Innovation is for innovators, coin-op consultants and their disciples, at least until it's been proven fit for use in critical services.

    1. Mark 110

      Re: new metrics by which to measure on-premises data centre teams,

      I think the 'blast radius' concept was addressing the cost of failure issue. Ensuring you understand the implications of failing and limiting the impact.

      The metrics of quick failure detection and recovery will also reduce the cost of failing.

  2. Denarius

    problem

    But the pair also declared that “the biggest opportunity is changing how people think and react”. The PHB class think ? All I ever saw was a reflex action to protect bonuses.

    Not sure how much the hypergasm about outsourcery and cloud is dying, but it does seem that big organisations not in death throws may want to keep everything, including staff on their premises given all the caveats.

  3. Doctor Syntax Silver badge

    It's all easily explained

    Just go back and look who these guys are working for. Gartner.

  4. Mad Mike

    And this says it all

    “They won't do 100 changes at once because the blast radius is big,” Zeng said.

    The above statement shows the level of thought applied here. 1 change or 100 changes don't change the blast radius unless you assume all the changes fail. A single small change could have a huge blast radius (if it hits something critical), but 100 small changes against low importance areas could have almost no blast radius, especially individually. The blast radius has nothing to do with the number of changes. Perhaps they need to go back to the drawing board if this is their thinking.

  5. Terje

    I think that one point they do have (even if possibly by accident) is that in many situations it might be better to plan for and expect things to fail and have plans for quick response and recovery then to spend all effort on never failing.

  6. Androgynous Cow Herd

    Moderately interesting

    But while many a venture backed startup talks or dreams of being the next hyper scale break out, most of them are not ever going to need this sort of architecture. The one or two that makes it to scale will end up defining things like "Failure domains" specific to the workload they are delivering.

    Trying to design or run a normal to large Datacenter using methodology developed for a hyperscale solution is laughable. Because at true hyperscale, the failure domain is not the application, or the server or even the rack. It is the entire room if not the entire site. You can lose a rack or two of servers and maybe a load balancer twitches, potentiallly there is a slight performance degradation and you schedule rip and replace of those racks for the next maintainence window. But no one has to stay late or freaks out simply because you lost a couple racks worth of processing.

  7. Anonymous Coward
    Anonymous Coward

    Disaster Recovery anyone?

    Everything will fail one day, so get ready now.

    A DRP should encompass all possible scenarios and be tested. It doesn't matter if it's big or small it just takes some thought.

  8. Anonymous Coward
    Anonymous Coward

    MTBF, MTTR, reslience, availability

    "“In the enterprise we measure and pay people on mean time between failure,” Skorupa said. “The whole operating principle is to avoid risk at all cost.”"

    Shirley a sensible principle might be to minimise service disruption (ie maximise service availability), taking into account the impact (cost?) of service disruption. But the article talks about risk without explaining what today's meaning is?

    Stuff fails, inevitably. Some stuff fails more than others - to have advance knowledge of what's likely to fail (and what the associated effects might be) can sometimes be handy, but isn't always essential.

    If the overall system design includes suitable resilience, a single subsystem failure shouldn't lead to service disruption. Sometimes multiple failures can be tolerated without visible disruption. Sometimes transactional integrity is required, sometimes it's not, the required designs may be different depending on a particular setup's needs.

    All of which might actually have been their point, but it's kind of hard to tell.

    Notice where the terms MTBF and MTTR appear in the description above, and where "service availability" appeared in the article?

    It's almost as though the last couple decades mostly never happened. Which might well be the case as far as lots of Gartner staff (and their MBA-indoctrinated clients) are concerned, but the two Gartneers in question here appear to have been around in the 1990s too:

    https://www.gartner.com/analyst/38674/Evan-Zeng

    https://www.gartner.com/analyst/24834/Joe-Skorupa

    https://www.cnet.com/uk/news/fore-execs-flee-in-wake-of-gec-reorganization/

  9. Anonymous Coward
    Anonymous Coward

    don't reboot memcache servers

    Meanwhile at my org, devs were brilliant in putting persistent data in memcache with no backup(brand new e commerce app built from nothing about 3 years ago). So while we wait for them to migrate to redis (close to 3 years) management has asked we not reboot those memcache nodes for things like security updates. Lucky for them those servers have uptimes of over 2 years at this point. (Last outage was to move them to a newer vmware cluster with new cpus. No vmotion between those 2 cpu types)

    Just a small example.. i laugh when people mention the possibility of using a public cloud provider for DR or bursting into. Clueless.... so clueless.

    Company originally used public cloud(years ago ) and I moved them out before costs got too out of control.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like