back to article The hidden horse power driving Machine Learning models

Machine Learning is becoming the only real available method to perform many modern computational tasks in near real time. Machine Vision, speech recognition and natural language processing have all proved difficult to crack without ML techniques. When it comes to hardware, the tasks themselves do not need a great deal of …

  1. Anonymous Coward
    Anonymous Coward

    "They, however, remained a relatively expensive device for the few, with demanding data needs."

    Doesn't help that all those hip young things are buying up every GPU they can get their hands on to mine cryptocurrency. Nigh on impossible to find decent cards these days, unless you're buying in the hundreds at a time.

    1. Korev Silver badge

      Intel are very keen to get into this game too; maybe if push Phi/Nervana whilst there's a GPU storage then they can catch up

  2. DrBobK
    Headmaster

    Why is the DGX-1 so expensive? Why is it needed?

    I don't understand. A Titan-X has nearly 4,000 cuda cores. A DGX-1 V100 has about 40,000 cuda cores. A Titan-X costs about £1,000, a DGX-1 costs about £100,000. Are these things to limited by transfer rates between cards that the 10 fold increase is price per core is worth it? I thought in a neural net architecture you could process data on sets of layers independently and only needed to transfer data across the connections at the top and bottom layers of each set? I am genuinely puzzled. Can someone tell me if these nets work really differently to the multilayer back-prop I know of old and why the DGX-1 costs so much compared to the Titan-X?

    Yours, an academic who did NN stuff in the 1980s and 90s using such parallel compute monsters as the 16 CPU Encore Multimax!

    1. DrBobK
      Headmaster

      Re: Why is the DGX-1 so expensive? Why is it needed?

      Probably just me stuck thinking in terms of Encore Multimaxes or a few Xeons - I guess with 40,000 cores each core is more likely to be simulating a single neuron rather than a layer - so lots of need for data transfer...

    2. Anonymous Coward
      Anonymous Coward

      Re: Why is the DGX-1 so expensive? Why is it needed?

      1) Data volumes are quite a bit bigger these days so data transfers can be enormous. Even modest sized models are measured in the gigabytes - their training inputs are 10s, 100s or even 1000s of terabytes. Shuffling that volume of data quickly becomes the bottleneck. Fast as this appliance might be it's still an appliance - you need distributed training to scale fully (and that's a Hard Problem). So whether or not it's hugely faster is, probably moot.

      2) The cost is probably motivated by this being an appliance. 10 cards in one box with networking, storage and compute from a single vendor. Probably not worth the 10x uplift but throw in some hefty discounting and it's arguably competitive for medium-scale workloads.

    3. Korev Silver badge
      Boffin

      Re: Why is the DGX-1 so expensive? Why is it needed?

      The Titan-X is a consumer card aimed at gamers; the DGX-1 has cards designed for compute servers which cost rather a lot more money. There's also Infiniband and NV-link.

      1. ArrZarr Silver badge
        Boffin

        Re: Why is the DGX-1 so expensive? Why is it needed?

        Going to nitpick that a bit - For gaming, a 1080Ti is a better card. I think the Titan range is more aimed at people doing video/image work as it has more features for that kind of precision work than the GTX range.

    4. Ian Michael Gumby

      @DrBobK ... Re: Why is the DGX-1 so expensive? Why is it needed?

      There's a bit more complexity under the covers when you go from 4K to 40K cores.

      That's why.

      You are also paying a premium for the latest and greatest kit. But the premium isn't as much as the cost increase due to complexity. Said complexity could be in manufacturing (lower yields due to higher percentage of defects... or something else.) Higher cost in terms of design complexity in terms of interconnects... or something similar.

      Does that help?

  3. Korev Silver badge
    Boffin

    Only machine of its type?

    The DGX-1 is the only machine of its type around at the moment. Sure, you can build your own machine with five GPU cards, but you still want get close to the performance go the DGX-1 due to its custom bus features allowing date to be transferred to the GPU cards at impressive speeds.

    Cray sell a box with eight P100s and NVlink too. I'm sure other vendors are getting in on the party.

  4. John Smith 19 Gold badge
    Coat

    Why £100K. Convenience of course

    And the brand name.

    No the raw hardware doesn't cost that much.

    But can you engineer the back planes it's connected into to get the same data loading bandwidth?

    Do you understand Tensorflow well enough to optimize its features to make use of that processing power?

    If you do you can probably make one of your own, but then you're probably working for one of their competitors and are already doing so.

    1. Ian Michael Gumby
      Boffin

      @John Smith ... Re: Why £100K. Convenience of course

      The hardware cost is relative.

      You have expensive hardware, yes.

      But you also have the expense of the R&D in developing the hardware which has to be expense d over the entire product line and its based on the estimate number of cards they expect to sell.

      Its less about understanding 'Tensor Flow' than it is to provide the hardware and basic framework to allow tensor flow.

  5. Anonymous Coward
    Anonymous Coward

    Perhaps I'm mistaken but Golem, a new ICO attempting to connect computers around the world to form a super-computer that anyone can use, seems to answer this quite well. See here Golem.network - when I first read about them, Machine learning was, for me, a no brainier.

  6. richalt2

    Please consider evaluating the $100,000 in terms of value of your output. If your output delivered in the timeframe which $100,000 machine makes possible is not valued for more, then spend $10,000, and take 10 times as long to deliver! This is really simple economics - balance your output value with the capital invested. Follow this path. Don't complain about the potential of spending more. After all, you could have bought two machines for 200,000, and doubled your complaints!

  7. Griffo

    What am I missing?

    The article said you couldn't get data to the Azure cloud fast enough. But apparently using hard disk shipping was OK, because you could write to them with 10gbit. But you can also get 10gbit expressroute connections straight to azure. What am I missing?

  8. aberglas

    Machine Learning <> (!= for some) Artificial Neural Networks

    There are many types of machine learning. ARNs have the buzz at the moment, and are often used inappropriately. And even if ARNs are used, there is (much) more to them than the naive back propagation algorithm that is so, so slow to learn.

    Fix the algorithm before going crazy with hardware.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like