back to article 729 teraflops, 71,000-core Super cost just US$5,500 to build

Cycle Computing has helped hard drive giant Western Digital shove a month's worth of simulations into eight hours on Amazon cores. The simulation workload was non-trivial: to check out new hard drive head designs, the company runs a million simulations, each of which involved a sweep of 22 head design parameters on three types …

  1. Notas Badoff

    I don'know, wha'd you wanna do tonight?

    "729 teraflops ...

    nearly 71,000 AWS cores for an eight-hour run ...

    completed nearly 620,000 compute-hours."

    Trying to figure out what this means in AWS latent capacity. Am I really reading that AWS has high NN K cores just laying around - unused - just waiting for customers? I can understand a couple K cores just kinda loping along poking the spare database or serving a web page, waiting for a 'real' question to come along. But towards and beyond 100 K cores shooting the shit waiting for a good stretch-of-the-legs? How much capacity have they got just 'waiting'?

    1. Denarius
      Meh

      Re: I don'know, wha'd you wanna do tonight?

      might have been a slow day. NSA might be running short of crypto to crack Amazon might have broken even that day at least, but it does suggest a lot of capacity is lying around. Just seems odd that after a career watching batch operation being derided by client server real time interactive enthusiasts that a solid batch job situation outside a bank is reported.

    2. Charlie Clark Silver badge

      Re: I don'know, wha'd you wanna do tonight?

      The genesis of AWS was just that: lots of capacity lying around that was required for a few very busy periods of the year (Thanksgiving, Christmas,…).

      Businesses get to choose between operational and capital expenditure and pass the risk onto suppliers like Amazon. But don't worry: their risks are also limited as data centres are usually funded by substantial subsidies.

  2. Trevor_Pott Gold badge

    Except...public cloud "doubters" never doubted this particular use case. Software was rewritten specifically to work with the public cloud, it is a definable, burstable workload, it runs as a batch (input workload, receive result, you don't need to be connected all day to it) and it has a definable cost.

    That's completely different from taking a legacy "must be up 24/7" workload and tossing it into the public cloud. Especially one where the developer has no intention of (or can't, because they're out of business/lack skills/etc) rewriting the thing for the public cloud.

    The public cloud is not "pay for what you use", it is "pay for what you provision". If you need to provision the workload to be available 24/7 then the public cloud is a terrible value for dollar. If you need to essentially run an HPC batch process, then it'll do you just fine.

    1. Nate Amsden

      absolutely, nail on the head there.

      this "super computer" didn't cost $5,500 to build, it cost $5,500 to rent. big difference(duh). Obviously 99.9%+ of the workloads out there aren't suited to one off runs of a few hours never to be needed again..

      You've been able to "rent" super computer time for a long time, no news here.

      This is one of the very very few good use cases of public cloud computing (IaaS anyway - SaaS is a good model, PaaS not sold on either).

      1. Anonymous Coward
        Anonymous Coward

        Agree with the above. I'm not against IAAS, but that "supercomputer" cost a hell of a lot more to build than $5500.

        And no wonder AWS is losing money, if it has that many cores lying around waiting for Western Digital or someone else to come along for 8 hours.

    2. Anonymous Coward
      Happy

      Fits every big data predictive analytics job I've ever done (1975+) just fine.

    3. dan1980

      @Trevor_Pott

      Two things, mate:

      1. - Nail, head, etc...

      2. - Be nice to Richard.

      'Cloud' is perfect for such tasks, where you have - before time - defined a set of operations, a program to carry them out, a method for running the workload in parallel and then, at go time, you feed it the input and set it to run.

      In a way, it's like outsourcing a big data entry job to a third party. That works well because the workload is one that lends itself to such an arrangement. It's also, generally, a 'burstable' workload in that data entry often comes in big loads at a time, meaning sometimes you need 5 people and sometimes you need 50. Keeping 50 drones on the payroll to cope with peak load is silly but keeping 5 and then 'bursting' the rest to an outsourcer is a good idea.

      You see similar things with call centers, where they have overflow to third parties.

      Once you've got your processes down then outsourcing data entry can make great sense, just like this application does. But this, of course, is not a new concept and just as you have to assess things on a task-by-task basis to see if outsourcing is worthwhile (or even viable), so to must you do that with 'cloud' . . . stuff.

      1. Trevor_Pott Gold badge

        "Be nice to Richard"

        Always. He's fucking fantastic people. Without qualification, I'd be there for him, brother from another mother style. Doesn't mean we won't disagree about things from time to time.

        1. Ian Bush
          Paris Hilton

          "He's fucking fantastic people"

          Too much information ...

  3. pierce
    Paris Hilton

    don't think of that as $5500, think of it as $16500/day. they just happened to only use 1/3rd of a day.

    1. Anonymous Coward
      Anonymous Coward

      So?

      If if it was $16.5K, it ran almost 100 times faster than their in-house solution.

  4. Denarius
    Trollface

    Wow, at last!

    something that can run M$ Office at a reasonable speed with their new Skype/Lync app. Nuff /bin/sed

  5. Anonymous Coward
    Anonymous Coward

    But what about security?

  6. Shannon Jacobs

    Pretty sure they optimized the scheduling

    Almost certain that they scheduled their job to run on slack periods. You can think of it as separate budgets for peak usage and slack time. Amazon would obviously charge much more when they have customers queued up, but if you're willing to wait for idle time, then the only cost is the electricity. Ergo, getting $5,500 is better than nothing.

    1. Anonymous Coward
      Anonymous Coward

      Re: Pretty sure they optimized the scheduling

      Who says mainframe computing is dead. This would be CLASS M TIME=24.00.00 in IBM JCL.

  7. Infury8r

    Maybe the UK's Met Office should use it.

    And so save UK taxpayers £97,000,000.

    1. Ian Bush

      Re: Maybe the UK's Met Office should use it.

      And the Unified Model would run like a dog on it, you (and the Met Offices paying customers) probably wouldn't get tomorrow's prediction until next week at best, and it would cost a huge amount more in recurrent rather than capital expenditure.

      Comparing embarrassingly parallel workloads on loosely coupled, rented hardware and communication sensitive codes that require tightly coupled, low latency hardware is at best an exercise in futility.

      1. Jason Ozolins

        Re: Maybe the UK's Met Office should use it.

        Yup. People where I work are looking at how to remove an up to 20% slowdown in large (>512 core) Unified Model runs that looks to be down to the rate at which the batch management system causes context switches on each node to track memory and disk usage. Never mind that the batch system doesn't use much CPU overall - the communication is so tightly coupled with the computation that a single slow process can make many other processes wait (and waste power/time). This is with all inter-node communication going over 56Gb Infiniband. Not very suitable code to run on loosely coupled cloud nodes.

        (BTW, this is an example of an HPC job where turning on HyperThreading helps, as long as you only use one thread on each core for your compute job; the other hyperthreads get used to run OS/daemon/async comms stuff without causing context switches on the compute job threads. The observed performance hit from batch system accounting and other daemons is much lower with HT enabled.)

        Anyway, this WD workload sounds very much like the sort of thing companies have long farmed out to their engineering workstations overnight using HTCondor:

        http://en.wikipedia.org/wiki/HTCondor

        If it can run well under Condor, it'd run just fine in the cloud...

  8. Turtle

    “Gojira”...

    “Gojira” is the Japanese name for Godzilla, should there be anyone here who does not already know this li'l factlet.

    1. drand
      Headmaster

      Re: “Gojira”...

      Or 'Godzilla' is the English for 'Gojira'...

  9. phil dude
    Boffin

    back of envelope calc...

    ok 729000 Gflops/71000 core ~ 10Gflops/processors?

    Doesn't seem very efficient...? Intel's Dual-E5-v3-2687 gets 788GF on LINPACK and that is on 20 cores =~ 40Gflops/core.

    So this is 71000 of some chip?

    or 925 dual-xeons (E5-v3-2687)

    or 700 nodes of ORNL's titan.

    P.

    1. SJG

      Re: back of envelope calc...

      An EC2 Compute unit (aka virtual core) is, according to Wikipedia, based on 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor. According to Amazon's instance definitions, you would probably need 16 of those to equate to a Intel Xeon E5-2680 v2 (Ivy Bridge) CPU.

  10. crediblywitless

    How much did the Matlab licence cost for that job?

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like